Creating a GPU from scratch, following Nvidia CUDA’s design, took only two weeks.

Latest Release1yrs ago (2024)release Lyan23

234 0 0

Starting from the basics of learning about chips.

“I spent two weeks building a GPU from scratch with no experience, which was much more difficult than imagined.”

Creating a GPU from scratch, following Nvidia CUDA's design, took only two weeks.

People always say that the supply of Huang’s chips can’t meet the demand, and everyone can’t wait to build their own GPUs. Now, someone actually tried it.

Recently, Adam Majmudar, one of the founding engineers of a U.S. web3 development company, shared his successful experience with building a GPU by hand, which won a lot of praise from netizens. Surprisingly, he completed this intellectual feat in just two weeks. In the thread on Twitter/X, Majmudar live-streamed, taking us through the entire process step by step.

The practice of self-made GPUs has certainly been made public on GitHub, and this project now has 5300 stars.

Project link: https://github.com/adam-maj/tiny-gpu

It should be clear that the current stage of this project is the chip layout in Verilog, which was eventually verified through the OpenLane EDA software. After this, the GPU will be sent for tape-out through Tiny Tapeout 7, and is thus destined to become a physical chip in the coming months.

Majmudar has detailed the task flow for designing a GPU. Obviously, as a “from scratch” project, a lot of research and thinking is needed before taking the first exploratory step. Due to the dominance of proprietary technology, the GPU is a relatively complex research field. It’s difficult to think about, and even more difficult to put into practice.

What are the steps to build a GPU by hand?

In fact, for Majmudar, the process involves even more steps as he really didn’t have any technical foundation and had to start from learning the basics of GPU architecture.

He first began trying to understand the GPU software model by learning Nvidia’s CUDA framework, leading to an understanding of the single instruction, multiple data (SIMD) programming pattern used for writing GPU programs (known as kernels).

With this background, Majmudar began deeply learning about the core components of a GPU: from global memory, computation cores, hierarchical caches, memory controller to program scheduling.

Then, within each compute core, we also need to understand the key units: including registers, local/shared memory, load/store units (LSU), compute units, schedulers, fetchers, and decoders.

Okay, now you’re someone who understands modern GPU architecture, let’s get our hands dirty and build a GPU.

Here, Majmudar notes that due to the high complexity, we must simplify the GPU to a level that a novice can design, otherwise the project timeline will explode.

Next is creating your own GPU architecture. Our goal is to create a minimal GPU that highlights the core concepts of the GPU and eliminates unnecessary complexity so that others can more easily understand the GPU.

Majmudar says designing your own GPU architecture is an incredibly practical experience.

He learned and operated at the same time, then decided to emphasize the following points in the design:

Parallelization – Implementing the SIMD pattern in hardware;
Memory access – Observing how the GPU addresses the challenge of accessing large amounts of data from slow and bandwidth-limited memory;
Resource management – Maximizing resource utilization and efficiency.

After multiple iterations of the above architecture, Majmudar decided to focus on General Purpose computing on Graphics Processing Units (GPGPU) capabilities, aiming for a broader range of use cases in machine learning.

The design can be deemed as keeping pace with the times.

Everything here is in its simplest form.

The third step is to write custom assembly language for this GPU.

Majmudar stated that one of the most critical factors is that his GPU can actually execute kernels written in the SIMD programming model. To achieve this, it is necessary to design his own instruction set architecture (ISA) for the GPU to write kernels. He made his own 11 small instructions ISA inspired by the LC4 ISA. After that, he wrote some simple matrix math kernels as a proof of concept.

This is the complete table of the ISA proposed by Adam Majmudar, which includes the exact structure of each instruction.

Next, Majmudar wrote two matrix math kernels that run on his GPU. These matrix addition and multiplication kernels will demonstrate the key functionality of the GPU and provide evidential effectiveness in its application to graphics and machine learning tasks.

The kernels written for matrix addition and multiplication.

Majmudar faced many challenges in building the GPU with Verilog. This was the most challenging part. He learned a lot but also had to rewrite the code many times. Notably, Majmudar received advice and help from renowned American hacker George Hotz.

Initially, he implemented the global memory as SRAM. Feedback indicated that this contradicted the purpose of building a GPU – the biggest design challenge of a GPU is managing access to asynchronous memory (DRAM) with limited bandwidth.

Therefore, Majmudar eventually rebuilt the design using external asynchronous memory, and finally realized that a memory controller also needed to be added.

Initially, Majmudar used a warp-scheduler to implement the GPU, which was a big mistake. It was too complicated and unnecessary for the project. Fortunately, George Hotz provided timely feedback. At first, Majmudar did not have enough background knowledge to fully understand the feedback, so he spent a lot of time trying to build a Warp scheduler before he realized this.

Moreover, Majmudar did not correctly implement the scheduling in each core in his initial design, so he had to go back and incrementally design the core execution to achieve the correct control flow.

Ultimately, the third rewrite of the code by Majmudar met the goal and fixed the execution scheduling of the cores.

This is the execution process of a single thread in a GPU built with Verilog, its execution method is very similar to that of a CPU.

After a lot of redesign, we can finally see the scene of the GPU running the kernels for matrix addition and multiplication. Seeing everything work properly and the GPU outputting the correct results is an incredible feeling.

Then, we also need to take the design through the EDA process to convert it into a complete chip layout.

The complete Verilog design is realized through OpenLane EDA, using the Skywater 130nm process node (for Tiny Tapeout). Majmudar specifically explained that some Design Rule Check (DRC) failed, and rework is required.
After two weeks of efforts, the 3D visualization of Majmudar’s GPU design is shown in the figure below:

Both the CPU and GPU have been made.

Adam Majmudar stated that in a short period of time, he had learned the basics of chip architecture, mastered the details of chip manufacturing, and completed his first full chip layout, that is, hand-crafted CPU, using EDA tools.

Speaking of how to achieve “hand-crafted chips,” Majmudar summarized it into five main steps:

1. Learn the basics of chip architecture;
2. Learn the basics of chip manufacturing, including materials, wafer preparation, patterning, and packaging;
3. Begin electronic design automation by fabricating CMOS transistors layer by layer;
4. Create the first complete circuit with Verilog;
5. Implement simulation and formal verification for the circuit;

Design a complete chip layout, use OpenLane (an open-source EDA tool) for design and optimization.

In the circle of engineers, people occasionally try “hand-crafting chips” to understand the basics of chip architecture in the most hardcore way. However, in the past, most people would attempt CPUs due to their difficulty.

In 2020, the University of Chinese Academy of Sciences announced the results of the first “One Life, One Core” initiative, which sparked hot discussions. This program was the first domestic initiative with tape-out as the goal. It was led by five undergraduate students from the class of 2016 to complete the design and tape-out of a 64-bit RISC-V processor SoC chip.

This project also caught the attention of Professor David Patterson, one of the founders of the RISC architecture and a Turing Award laureate.

Thanks to the development of new industry trends such as open-source chips and agile design, the threshold for chip design is getting lower and lower.
Perhaps after the precedent of hand-crafted GPUs appears, we will see more and more powerful homemade chip practices.