The following is my understanding of the performance quirks with GPUs due to memory and cache and coalescent memory access. Just like with Primer 01, if you have a decent understanding of CUDA or OpenCL you can skip this.
Ok, buckle up.
Back in the day (I assume, the first computer I remember using had DDR-200) computer memory was FAST. Most of the time the limiting factor was the CPU, though correctly timing video output was also a driving force. As an example, the C64 ran the memory at 2x the CPU frequency so the VIC-II graphics chip could share the CPU memory by stealing half the cycles. In the almost 40 years since the C64, humanity has gotten much better at making silicon and precious metals do our bidding. Feeding data into the CPU from memory has become the slow part. Memory is slow.
Why is memory slow? To be honest, it seems to me that it’s caused by two things:
In general, the farther from the thing doing the math the ACTUAL ELECTRONS are the slower it is to access.
This leads to an optimization problem. Modern processor designers use a complex system of tiered memory consisting of several layers of small, fast, on-die memory and large, slow, distant, off-die memory.
A processor can also perform a few tricks to help us deal with the fact that memory is slow. One example is prefetching. If a program uses memory at location X it probably will use the memory at location X+1 therefore the processor prefetchs a whole chunk of memory and puts it in the cache, closer to the processor. This way if you do need the memory at X+1 it is already in cache.
I am getting off topic. For a more detailed explanation, see this thing I found on google.
GPUs on paper have TONS of memory bandwidth, my GPU has around 10x the memory bandwidth my CPU does. Right? Yeah…
If we go back into spherical cow territory and ignore a ton of important details, we can illustrate an important quirk in GPU design that directly impacts performance.
My CPU, a Ryzen 5 3600 with dual channel DDR4, gets around 34 GB/s of memory bandwidth. The GDDR6 in my GPU, a RTX 2060, gets around 336 GB/s of memory bandwidth.
But lets compare bandwidth per thread.
CPU: Ryzen 5 3600 34 GB/s / 12 threads = 2.83 GB/s per thread
GPU: RTX 2060 336 GB/s / (30 SM’s * 512 threads1) = 0.0218 GB/s or just 22.4 MB/s per thread
In the end computers need memory because programs need memory. There are a few things I think about as I program that I think help:
Again, this all boils down to the very simple notion that memory is slow, and it gets slower the farther it gets from the processor.
0 This is obviously a complex topic. In general, modern memory bandwidth has a speed, and a latency problem. They are different, but in subtle ways. If you are interested in this I would do some more research, I am just some random dude on the internet.
1 I thought this would be simple, but after double checking, I found that the question “How many threads can a GPU run at once?” is a hard question and also the wrong question to answer. According to the cuda manual, at maximum an SM (Streaming Multiprocessor) can have 16 warps executing simultaneously and 32 threads per warp so it can issue at minimum 512 memory accesses per cycle. You may have more warps scheduled due to memory / instruction latency but a minimum estimate will do. This still provides a good illustration for how little memory bandwidth you have per thread. We will get into more detail in a grouping tutorial.
All ILGPU docs are open source. See something that's wrong or unclear? Submit a pull request.
Make a contribution