The author of KAN said: The message I want to convey is not “KAN is great”, but “Try to critically think about the current architecture, and seek fundamentally different alternatives. These alternatives can accomplish interesting and useful things.”
The multilayer perceptron (MLP), also known as the fully connected feedforward neural network, is the foundational building block of today’s deep learning models. The importance of MLPs cannot be overstated, as they are the default method used in machine learning to approximate nonlinear functions.
However, recently, researchers from institutions like MIT have proposed a very promising alternative – KAN. This method performs better than MLP in terms of accuracy and interpretability. Moreover, it can outperform MLP running with larger parameter volumes with very few parameters. For example, the authors state that with KAN they have rediscovered mathematical laws in lattice theory and reproduced DeepMind’s results with a smaller network and a higher degree of automation. Specifically, DeepMind’s MLP has about 300,000 parameters, while KAN only has about 200 parameters.
These astonishing results have quickly popularized KAN, attracting many people to embark on research into it. Soon, some people raised questions. Among them, a Colab document titled “KAN is just MLP” became the focus of discussion.
Is KAN just an ordinary MLP?
The author of the aforementioned document states, you can write KAN as an MLP, as long as you add some repetition and shift before the ReLU.
In a brief example, the author demonstrates how to rewrite a KAN network as a regular MLP with the same number of parameters and a slightly atypical structure.
It’s worth noting that KAN has activation functions on the edges. They use Bsplines. In the shown example, for simplicity’s sake, the author will only use piecewise linear functions. This won’t change the network’s modeling ability.
Below is an example of a piecewise linear function:
def f(x):
if x < 0:
return 2*x
if x < 1:
return 0.5*x
return 2*x  2.5
X = torch.linspace(2, 2, 100)
plt.plot(X, [f(x) for x in X])
plt.grid()
The author states that we can easily rewrite this function using multiple ReLUs and linear functions. Please note that sometimes it is necessary to shift the input of the ReLU.
plt.plot(X, 2*X + torch.relu(X)*1.5 + torch.relu(X1)*2.5)
plt.grid()
The real question is how to rewrite a KAN layer into a typical MLP layer. Suppose there are n input neurons, m output neurons, and the piecewise function has k pieces. This requires nmk parameters (each edge has k parameters, and you have n*m edges).
Now, consider a KAN edge. For this, you need to copy the input k times, shift each copy by a constant, and then run it through a ReLU and a linear layer (except for the first layer). Graphically, it looks like this (C is a constant, W is a weight):
Now, one can repeat this process for each edge. But one thing to note is that if the piecewise linear function grids are the same everywhere, we can share the intermediate ReLU outputs and just need to mix the weights on top of it. Just like this:
k = 3 # Grid size
inp_size = 5
out_size = 7
batch_size = 10
X = torch.randn(batch_size, inp_size) # Our input
linear = nn.Linear(inp_size * k, out_size) # Weights
repeated = X.unsqueeze(1).repeat(1, k, 1)
shifts = torch.linspace(1, 1, k).reshape(1, k, 1)
shifted = repeated + shifts
intermediate = torch.cat([shifted[:, :1, :], torch.relu(shifted[:, 1:, :])], dim=1).flatten(1)
outputs = linear(intermediate)
Now our layer looks like this:
 Expand + shift + ReLU
 Linear
Consider three layers one by one:

Expand + shift + ReLU (The first layer starts from here)
Linear
Expand + shift + ReLU (The second layer starts from here)
Linear
Expand + shift + ReLU (The third layer starts from here)
Linear
Ignoring the input expansion, we can rearrange:

Linear (The first layer starts from here)
Expand + shift + ReLU
Linear (The second layer starts from here)
Expand + shift + ReLU
The following layers can basically be called MLP. You can also make the linear layers larger, remove the expand and shift, and get better modeling capabilities (although it requires a higher parameter cost).

Linear (The second layer starts from here)
Expand + shift + ReLU
Through this example, the author shows that KAN is a kind of MLP. This statement has led everyone to rethink the two kinds of methods.
Reevaluation of the KAN approach, methods, and results.
In fact, in addition to the unclear relationship with MLP, KAN has also been questioned in many other ways.
In summary, researchers’ discussions mainly focus on the following points.
First, the main contribution of KAN lies in its interpretability, not in the expansion speed, accuracy, etc.
The author of the paper once said:

The expansion speed of KAN is faster than MLP. KAN has better accuracy than MLP with fewer parameters.

KAN can be intuitively visualized. KAN provides interpretability and interactivity that MLP cannot provide. We can potentially discover new scientific laws using KAN.
Among them, the interpretability of the network is selfevident in terms of the importance of the model in solving real problems:
But the problem is: “I think their claim is only that it learns faster and has interpretability, not other things. If the parameters of KAN are much less than the equivalent NN, the former is meaningful. I still feel that training KAN is very unstable.”
So, can KAN really have much fewer parameters than the equivalent NN?
There are still questions about this claim. In the paper, the authors of KAN stated that with just 200 parameters, their KAN could reproduce the research on discovering mathematical theorems by DeepMind’s MLP with 300,000 parameters. Upon seeing this result, two students of Associate Professor Humphrey Shi of Georgia Tech revisited DeepMind’s experiment, and they found that DeepMind’s MLP could match KAN’s accuracy of 81.6% with just 122 parameters. Moreover, they didn’t make any significant modifications to DeepMind’s code. To achieve this result, they merely reduced the size of the network, used a random seed, and increased the training time.
In response to this, the authors of the paper also gave a positive response:
Second, there is no essential difference between KAN and MLP in terms of methodology.
“Yes, this is clearly the same thing. They first do the activation in KAN, then do the linear combination, while in MLP, they first do the linear combination, then do the activation. When you zoom in, it’s basically the same thing. To my knowledge, the main reason to use KAN is for interpretability and symbolic regression.”
In addition to questioning the method, researchers also called for a rational evaluation of this paper:
“I think people need to stop viewing the KAN paper as a huge shift in the fundamental unit of deep learning, and instead see it as a good paper on deep learning interpretability. The interpretability of the nonlinear function learned on each edge is the major contribution of this paper.”
Third, some researchers say that the idea of KAN is not new.
“People have been studying this since the 1980s. A discussion on Hacker News mentioned an Italian paper that discussed this issue. So it’s not really anything new. 40 years have passed, and this is just something that either came back or was rejected and is being reexamined.”
What can be seen is that the authors of KAN paper did not hide this issue.
“These ideas are not new, but I don’t think the author avoided this. He just packaged everything up nicely and did some good experiments on toy data. But that’s also a contribution.”
Meanwhile, a paper, MaxOut (
https://arxiv.org/pdf/1302.4389), by Ian Goodfellow and Yoshua Bengio from over a decade ago was also brought up, with some researchers noting that although the two are ‘slightly different, the idea is somewhat similar’.
Author: The original research goal was indeed interpretability.
The result of the intense discussion was that one of the authors, Sachin Vaidya, stepped forward.
On the GitHub homepage, one of the paper’s authors, Liu Ziming, also responded to the evaluations received by this research:
The most common question I’ve been asked recently is whether KAN will become the next generation LLM. I don’t have a clear judgement on this.
KAN is specifically designed for applications that care about high accuracy and interpretability. We indeed care about the interpretability of LLM, but interpretability might mean different things for LLM and science. Do we care about the high accuracy of LLM? The scaling laws seem to imply so, but the accuracy may not be that high. Furthermore, for LLM and science, accuracy might also mean different things.
I welcome people to criticize KAN. Practice is the only criterion for testing truth. Many things we don’t know in advance until they undergo real attempts and are proven successful or failure. Although I would like to see the success of KAN, I am equally curious about the failure of KAN.
KAN and MLP are not interchangeable, they each have advantages in some scenarios and limitations in others. I would be interested in a theoretical framework that includes both, or even proposing new alternatives (physicists like unified theories, sorry).
The first author of the KAN paper is Liu Ziming. He is a physicist and machine learning researcher, currently a thirdyear PhD student at MIT and IAIFI, under the guidance of Max Tegmark. His research interest mainly focuses on the intersection of artificial intelligence AI and physics.
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.