Memory Subsystem Overview

We mentioned how changes to the module design can require changes to the memory controller as well. When an address arrives at the memory, it does not simply appear there directly from the CPU; we are really talking about several steps. First, the CPU sends the request to the cache, and if the data is not in the cache, the request is forwarded to the memory controller via the Front Side Bus (FSB). (In some newer systems like the Athlon 64, requests may arrive via a HyperTransport bus, but the net result is basically the same.) The memory controller then sends the request to the memory modules over the memory bus. Once the data is retrieved internally on the memory module, it gets sent from the RAM via the memory bus back to the memory controller. The memory controller then sends it onto the FSB, and eventually, the requested data arrives at the CPU.

Note that the data could also be requested/sent somewhere else. DMA (Direct Memory Access) allows other devices such as network adapters, sound cards, graphics cards, controller cards, etc. to send requests directly to the memory controller, bypassing the CPU. In this overview, we were talking about the CPU to RAM pathway, but the CPU could be replaced by other devices. Normally, the CPU generates the majority of the memory traffic, and that is what we will mostly cover. However, there are other uses of the RAM that can come into play, and we will address those when applicable.

Now that we have explained how the requests actually arrive, we need to cover a few details about how the data is transmitted from the memory module(s). When the requested column is ready to transmit back to the memory controller, we said before that it is sent in "bursts". What this means is that data will be sent on every memory bus clock edge - think of it as a "slot" - for the RAM's burst length. If the memory bus is running at a different speed than the FSB, though - especially if it's running slower - there can be some additional delays. The significance of these delays varies by implementation, but at best, you will end up with some "bubbles" (empty slots) in the FSB. Consider the following specific example.

On Intel's quad-pumped bus, each non-empty transmission needs to be completely full, so all four slots need to have data. (There are caveats that allow this rule to be "bent", but they incur a loss of performance and so they are avoided whenever possible.) If you have a quad-pumped 200 MHz FSB (the current P4 bus) and the RAM is running on a double-pumped 166 MHz bus, the FSB is capable of transmitting more data than the RAM is supplying. In order to guarantee that all four slots on an FSB clock cycle contain data, the memory controller needs to buffer the data to make sure an "underrun" does not occur - i.e. the memory controller starts sending data and then runs out after the first one or two slots. Each FSB cycle comes at 5 ns intervals, and with a processor running at 3.0 GHz, a delay of 5 ns could mean as many as 15 missed CPU cycles!

There are a couple of options to help speed up the flow of data from the memory controller to the FSB. One is to use dual-channel memory, so the buffer will fill up in half the time. This helps to explain why Intel benefits more from dual-channel RAM than AMD: their FSB and memory controller are really designed for the higher bandwidth. Another option is to simply get faster RAM until it is able to equal the bandwidth of the FSB. Either one generally works well, but having a memory subsystem with less bandwidth than what the FSB can use is not on ideal situation, especially for the Intel design. This is why most people recommend against running your memory and system busses asynchronously. Running RAM that provides a higher bandwidth than what the FSB can use does not really help, other than to reduce latencies in certain situations. If the memory can provide 8.53 GB/s of bandwidth and the FSB can only transmit 6.4 GB/s, the added bandwidth generally goes to waste. For those wondering why benchmarks using DDR2-533 with an 800 FSB P4 do not show much of an advantage for the faster memory, this is the main reason. (Of course, on solutions with integrated graphics, the additional memory bandwidth could be used for graphics work, and in servers, the additional bandwidth can be helpful for I/O.)

If you take that entire description of the memory subsystem, you can also see how AMD was able to benefit by moving the memory controller onto the CPU die. Now, the delays associated with the transmission of data over the FSB are almost entirely removed. The memory controller still has to do work, but with the controller running at CPU clock speeds, it will be much faster than before. The remaining performance deficit that Athlon 64 and Opteron processors suffer when running slower RAM can be attributed to the loss of bandwidth and the increased latencies, which we will discuss more in a moment. There are a few other details that we would like to mention first.

Understanding Memory Access Design Considerations
Comments Locked

22 Comments

View All Comments

  • ariafrost - Tuesday, September 28, 2004 - link

    Good choice. You really don't want to get generic RAM... it is generally slow, unstable, and gives you the much-hated BSOD... I've only bought CAS 2 RAM (Corsair XMS) but I may consider buying some CAS 2.5 if the price delta isn't too great.
  • IKnowNothing - Tuesday, September 28, 2004 - link

    It's like you read my mind. I'm purchasing an Athlon 64 3500+ and wasn't sure if I should purchase generic RAM or high performance RAM.

    Cheers.

Log in

Don't have an account? Sign up now