A twist after becoming viral? KAN, who “took out MLP overnight,” said: Actually, I also am an MLP.

Content2wks agorelease Lyan23
21 0 0
The author of KAN said: The message I want to convey is not “KAN is great”, but “Try to critically think about the current architecture, and seek fundamentally different alternatives. These alternatives can accomplish interesting and useful things.”
The multilayer perceptron (MLP), also known as the fully connected feedforward neural network, is the foundational building block of today’s deep learning models. The importance of MLPs cannot be overstated, as they are the default method used in machine learning to approximate nonlinear functions.


However, recently, researchers from institutions like MIT have proposed a very promising alternative – KAN. This method performs better than MLP in terms of accuracy and interpretability. Moreover, it can outperform MLP running with larger parameter volumes with very few parameters. For example, the authors state that with KAN they have re-discovered mathematical laws in lattice theory and reproduced DeepMind’s results with a smaller network and a higher degree of automation. Specifically, DeepMind’s MLP has about 300,000 parameters, while KAN only has about 200 parameters.


These astonishing results have quickly popularized KAN, attracting many people to embark on research into it. Soon, some people raised questions. Among them, a Colab document titled “KAN is just MLP” became the focus of discussion.
A twist after becoming viral? KAN, who
Is KAN just an ordinary MLP?   
The author of the aforementioned document states, you can write KAN as an MLP, as long as you add some repetition and shift before the ReLU.


In a brief example, the author demonstrates how to rewrite a KAN network as a regular MLP with the same number of parameters and a slightly atypical structure.


It’s worth noting that KAN has activation functions on the edges. They use B-splines. In the shown example, for simplicity’s sake, the author will only use piece-wise linear functions. This won’t change the network’s modeling ability.


Below is an example of a piece-wise linear function:
def f(x): 
if x < 0: 
return -2*x 
if x < 1: 
return -0.5*x 
return 2*x - 2.5 

X = torch.linspace(-2, 2, 100) 
plt.plot(X, [f(x) for x in X]) 
A twist after becoming viral? KAN, who
The author states that we can easily rewrite this function using multiple ReLUs and linear functions. Please note that sometimes it is necessary to shift the input of the ReLU.
plt.plot(X, -2*X + torch.relu(X)*1.5 + torch.relu(X-1)*2.5)plt.grid()
A twist after becoming viral? KAN, who
The real question is how to rewrite a KAN layer into a typical MLP layer. Suppose there are n input neurons, m output neurons, and the piece-wise function has k pieces. This requires nmk parameters (each edge has k parameters, and you have n*m edges).


Now, consider a KAN edge. For this, you need to copy the input k times, shift each copy by a constant, and then run it through a ReLU and a linear layer (except for the first layer). Graphically, it looks like this (C is a constant, W is a weight):
A twist after becoming viral? KAN, who
Now, one can repeat this process for each edge. But one thing to note is that if the piece-wise linear function grids are the same everywhere, we can share the intermediate ReLU outputs and just need to mix the weights on top of it. Just like this:
A twist after becoming viral? KAN, who

k = 3 # Grid size
inp_size = 5
out_size = 7
batch_size = 10

X = torch.randn(batch_size, inp_size) # Our input

linear = nn.Linear(inp_size * k, out_size) # Weights

repeated = X.unsqueeze(1).repeat(1, k, 1)
shifts = torch.linspace(-1, 1, k).reshape(1, k, 1)
shifted = repeated + shifts

intermediate = torch.cat([shifted[:, :1, :], torch.relu(shifted[:, 1:, :])], dim=1).flatten(1)

outputs = linear(intermediate)

Now our layer looks like this:
  • Expand + shift + ReLU
  • Linear
Consider three layers one by one:
  • Expand + shift + ReLU (The first layer starts from here)
    Expand + shift + ReLU (The second layer starts from here)
    Expand + shift + ReLU (The third layer starts from here)
Ignoring the input expansion, we can rearrange:
  • Linear (The first layer starts from here)
    Expand + shift + ReLU
    Linear (The second layer starts from here)
    Expand + shift + ReLU
The following layers can basically be called MLP. You can also make the linear layers larger, remove the expand and shift, and get better modeling capabilities (although it requires a higher parameter cost).
  • Linear (The second layer starts from here)
    Expand + shift + ReLU
Through this example, the author shows that KAN is a kind of MLP. This statement has led everyone to rethink the two kinds of methods.
A twist after becoming viral? KAN, who
Reevaluation of the KAN approach, methods, and results.
In fact, in addition to the unclear relationship with MLP, KAN has also been questioned in many other ways.


In summary, researchers’ discussions mainly focus on the following points.


First, the main contribution of KAN lies in its interpretability, not in the expansion speed, accuracy, etc.


The author of the paper once said:
  1. The expansion speed of KAN is faster than MLP. KAN has better accuracy than MLP with fewer parameters.
  2. KAN can be intuitively visualized. KAN provides interpretability and interactivity that MLP cannot provide. We can potentially discover new scientific laws using KAN.
Among them, the interpretability of the network is self-evident in terms of the importance of the model in solving real problems:
A twist after becoming viral? KAN, who
But the problem is: “I think their claim is only that it learns faster and has interpretability, not other things. If the parameters of KAN are much less than the equivalent NN, the former is meaningful. I still feel that training KAN is very unstable.”
A twist after becoming viral? KAN, who
So, can KAN really have much fewer parameters than the equivalent NN?


There are still questions about this claim. In the paper, the authors of KAN stated that with just 200 parameters, their KAN could reproduce the research on discovering mathematical theorems by DeepMind’s MLP with 300,000 parameters. Upon seeing this result, two students of Associate Professor Humphrey Shi of Georgia Tech revisited DeepMind’s experiment, and they found that DeepMind’s MLP could match KAN’s accuracy of 81.6% with just 122 parameters. Moreover, they didn’t make any significant modifications to DeepMind’s code. To achieve this result, they merely reduced the size of the network, used a random seed, and increased the training time.
A twist after becoming viral? KAN, who
A twist after becoming viral? KAN, who
In response to this, the authors of the paper also gave a positive response:
A twist after becoming viral? KAN, who
Second, there is no essential difference between KAN and MLP in terms of methodology.
A twist after becoming viral? KAN, who
“Yes, this is clearly the same thing. They first do the activation in KAN, then do the linear combination, while in MLP, they first do the linear combination, then do the activation. When you zoom in, it’s basically the same thing. To my knowledge, the main reason to use KAN is for interpretability and symbolic regression.”
A twist after becoming viral? KAN, who
In addition to questioning the method, researchers also called for a rational evaluation of this paper:


“I think people need to stop viewing the KAN paper as a huge shift in the fundamental unit of deep learning, and instead see it as a good paper on deep learning interpretability. The interpretability of the non-linear function learned on each edge is the major contribution of this paper.”


Third, some researchers say that the idea of KAN is not new.
A twist after becoming viral? KAN, who
“People have been studying this since the 1980s. A discussion on Hacker News mentioned an Italian paper that discussed this issue. So it’s not really anything new. 40 years have passed, and this is just something that either came back or was rejected and is being re-examined.”


What can be seen is that the authors of KAN paper did not hide this issue.


“These ideas are not new, but I don’t think the author avoided this. He just packaged everything up nicely and did some good experiments on toy data. But that’s also a contribution.”


Meanwhile, a paper, MaxOut (https://arxiv.org/pdf/1302.4389), by Ian Goodfellow and Yoshua Bengio from over a decade ago was also brought up, with some researchers noting that although the two are ‘slightly different, the idea is somewhat similar’.
Author: The original research goal was indeed interpretability.
The result of the intense discussion was that one of the authors, Sachin Vaidya, stepped forward.
A twist after becoming viral? KAN, who
On the GitHub homepage, one of the paper’s authors, Liu Ziming, also responded to the evaluations received by this research:
The most common question I’ve been asked recently is whether KAN will become the next generation LLM. I don’t have a clear judgement on this.


KAN is specifically designed for applications that care about high accuracy and interpretability. We indeed care about the interpretability of LLM, but interpretability might mean different things for LLM and science. Do we care about the high accuracy of LLM? The scaling laws seem to imply so, but the accuracy may not be that high. Furthermore, for LLM and science, accuracy might also mean different things.
I welcome people to criticize KAN. Practice is the only criterion for testing truth. Many things we don’t know in advance until they undergo real attempts and are proven successful or failure. Although I would like to see the success of KAN, I am equally curious about the failure of KAN.


KAN and MLP are not interchangeable, they each have advantages in some scenarios and limitations in others. I would be interested in a theoretical framework that includes both, or even proposing new alternatives (physicists like unified theories, sorry).
A twist after becoming viral? KAN, who
The first author of the KAN paper is Liu Ziming. He is a physicist and machine learning researcher, currently a third-year PhD student at MIT and IAIFI, under the guidance of Max Tegmark. His research interest mainly focuses on the intersection of artificial intelligence AI and physics.
© Copyright notes

Related posts