Excel file to find 1st, 4th, 8th word Latency!

SomeGuy133

2[H]4U
Joined
Apr 12, 2015
Messages
3,447
I found the 4th and 8th word equation on wikipedia and I added it to my excel sheet. So I am making a thread where i will keep it so please share it so people cna figure out exactly what RAM they want. I'll update this as I use this but its fairly complete since i can do all the major words now and you start to see some cool things in regards to MTs affecting 4th and 8th word latencys. It is interesting how DDR4 vs DDR3 can really help in regards to having higher BW.

I am curious on what a word is though and how it relates to actual data and access times. Can anyone answer that? I have been wanting to know how BW may affect access time for larger files and where file size changes from being latency dependent to BW dependent.

It aint pretty but no one else made one so deal with it :p
https://www.dropbox.com/s/7un7fpwg5y30wuj/DDR RAM True Latency Calculator.xlsx?dl=0

UPDATED LINK: if it goes down just post in thread or PM me.

The new 3200MTs CL 14 is comparable to 2400MTs CL 10 so DDR4 highend is matching highend DDR3 now. That new DDR4 released January IIRC. There were some magical DDR3s in the past supposedly that has better latency but I have never seen them in the wild so those 2 speeds are for the most part the best you can get ATM.
 
Last edited:
I don't know how much of this you know so forgive me if you find this redundant.

A word is the most commonly used size of data in a processor. This doesn't necessarily mean the largest the processor can handle (ex. floating point numbers). Generally, a 32-bit processor has registers of 32 bits and a 64-bit processor has registers of 64 bits, although I believe current x86-64 processors have a set of each.

To understand what is meant by first word... etc you have to understand how a processor reads from memory. If the processor wants a data value that is not in one of its registers, it tries to get the data from its smallest cache (L1). If the data is not in the L1 cache, this request is called a L1 cache miss and the L1 cache tries to find the data in a higher (L2) cache. If the data is not in any of the caches, the highest cache will attempt to retrieve the data from main memory.

Data in caches are typically stored in continuous multi-word sections called blocks to take advantage of spatial locality. This simply means that data that is close to an accessed data address is more likely to be accessed. A typical block size in a cache is 64 bytes, which is 512 bits or 8 64-bit words. Caches operate on blocks. So if there is a miss on a single word, the whole block the word is in will be copied into the cache form the higher level cache or memory.

There is a simple optimization that is widely used that comes from this setup called critical word first. When there is a read miss, the processor is waiting for the one word and time is being wasted. Instead of waiting for the entire block to be copied, the cache or memory will give the one word value that the processor is waiting on, then update the rest of the block. This way, the processor can get back to work ASAP.

If the processor only wants a single data value from a block (low spatial locality), your best bet would be to have as low a latency for the critical word as possible. If the data has a high degree of spatial locality, the rest of the block will probably be accessed soon after the critical word and the time to get the whole block is more important than the time to get the one critical word.

Example: If your program needs to read four words from a particular cache block, and the program cannot continue until all four are read, then the time to get the fourth word is much more important than the first word latency. On the other hand, if your program only needs a single word from a block, the first word latency is very important and the rest are irrelevant.

Of course, this is made much more complex when you take into account multiple cores, out of order execution, etc.
 
I don't know how much of this you know so forgive me if you find this redundant.

A word is the most commonly used size of data in a processor. This doesn't necessarily mean the largest the processor can handle (ex. floating point numbers). Generally, a 32-bit processor has registers of 32 bits and a 64-bit processor has registers of 64 bits, although I believe current x86-64 processors have a set of each.

To understand what is meant by first word... etc you have to understand how a processor reads from memory. If the processor wants a data value that is not in one of its registers, it tries to get the data from its smallest cache (L1). If the data is not in the L1 cache, this request is called a L1 cache miss and the L1 cache tries to find the data in a higher (L2) cache. If the data is not in any of the caches, the highest cache will attempt to retrieve the data from main memory.

Data in caches are typically stored in continuous multi-word sections called blocks to take advantage of spatial locality. This simply means that data that is close to an accessed data address is more likely to be accessed. A typical block size in a cache is 64 bytes, which is 512 bits or 8 64-bit words. Caches operate on blocks. So if there is a miss on a single word, the whole block the word is in will be copied into the cache form the higher level cache or memory.

There is a simple optimization that is widely used that comes from this setup called critical word first. When there is a read miss, the processor is waiting for the one word and time is being wasted. Instead of waiting for the entire block to be copied, the cache or memory will give the one word value that the processor is waiting on, then update the rest of the block. This way, the processor can get back to work ASAP.

If the processor only wants a single data value from a block (low spatial locality), your best bet would be to have as low a latency for the critical word as possible. If the data has a high degree of spatial locality, the rest of the block will probably be accessed soon after the critical word and the time to get the whole block is more important than the time to get the one critical word.

Example: If your program needs to read four words from a particular cache block, and the program cannot continue until all four are read, then the time to get the fourth word is much more important than the first word latency. On the other hand, if your program only needs a single word from a block, the first word latency is very important and the rest are irrelevant.

Of course, this is made much more complex when you take into account multiple cores, out of order execution, etc.

wow thanks for the response!

So first word is 64 0/1s right? and 4 words is 256 0/1s and so on?

in another way. 1 word is 8 Bytes, 4 word is 32Bs, and 8 word is 64Bs?

If i got that right thanks!

So how does out of order work and stuff like that? In your opinion what is the most common first word or larger?

Does it go larger than 8 words? I would assume it would be based off the largest die or something like that. Not sure if there is a single channel or multiple channels per die...not sure how memory functions on the grand scheme of things in that regard.
 
You are correct with the 64-bit word being 64 1's and 0's which is the same as 8 bytes ... 4 words is 32 bytes or 256 1's and 0's :)

With out-of-order execution the processor can continue executing instructions even after a read miss, since it recognizes which instructions do not depend on the miss value. This lessens the impact of memory accesses somewhat. Similarly, simultaneous multi-threading can allow a processor to execute another thread's instructions while waiting (Intel's version is called hyperthreading and allows 2 threads per core), allowing work to be done while waiting for a memory access.

The cache block sizes are important, and bigger is not necessarily better. Choosing the right block size is a complex tradeoff between several factors. For instance, a larger block size allows for better spatial locality, but with a constant cache size the larger blocks result in fewer blocks. However, generally designers try to take advantage of increasing memory bandwidth with larger block sizes.

The problem of multiple cores is significant to the block size being 8 words. With multiple cores that each have their own cache, keeping a data value updated across each core's cache becomes a major problem. If processor 2 requests a value that processor 1 has the most recent version of, the data value could be in L1, L2, or L3 cache. It is simpler to keep the block size consistent across all levels of the cache and Intel keeping the block size consistent among all the caches probably has something to do with this. Some examples of different word size caches are early Pentium 4s, which had 64 byte L1 cache blocks and 128 byte L2 cache blocks.

For the way DRAM is organized there are many words used and they can be confusing. Channels, banks, rows, columns, etc. Dies are by definition singular pieces of silicon and have capacities measured in gigabits. Parallelism occurs at much smaller levels I believe. I don't know the intricacies of DRAM but a good source is Memory Systems: Cache, DRAM, Disk by Bruce Jacob, Spencer w. Ng, and David T Wang. They go into good detail.

For good technical info on overall computer design I highly recommend the books by Hennessey and Patterson. One is called Computer Architecture and the other book is called Computer Organization and Design. They go into great detail and are pretty accessible in my opinion.
 
You are correct with the 64-bit word being 64 1's and 0's which is the same as 8 bytes ... 4 words is 32 bytes or 256 1's and 0's :)

With out-of-order execution the processor can continue executing instructions even after a read miss, since it recognizes which instructions do not depend on the miss value. This lessens the impact of memory accesses somewhat. Similarly, simultaneous multi-threading can allow a processor to execute another thread's instructions while waiting (Intel's version is called hyperthreading and allows 2 threads per core), allowing work to be done while waiting for a memory access.

The cache block sizes are important, and bigger is not necessarily better. Choosing the right block size is a complex tradeoff between several factors. For instance, a larger block size allows for better spatial locality, but with a constant cache size the larger blocks result in fewer blocks. However, generally designers try to take advantage of increasing memory bandwidth with larger block sizes.

The problem of multiple cores is significant to the block size being 8 words. With multiple cores that each have their own cache, keeping a data value updated across each core's cache becomes a major problem. If processor 2 requests a value that processor 1 has the most recent version of, the data value could be in L1, L2, or L3 cache. It is simpler to keep the block size consistent across all levels of the cache and Intel keeping the block size consistent among all the caches probably has something to do with this. Some examples of different word size caches are early Pentium 4s, which had 64 byte L1 cache blocks and 128 byte L2 cache blocks.

For the way DRAM is organized there are many words used and they can be confusing. Channels, banks, rows, columns, etc. Dies are by definition singular pieces of silicon and have capacities measured in gigabits. Parallelism occurs at much smaller levels I believe. I don't know the intricacies of DRAM but a good source is Memory Systems: Cache, DRAM, Disk by Bruce Jacob, Spencer w. Ng, and David T Wang. They go into good detail.

For good technical info on overall computer design I highly recommend the books by Hennessey and Patterson. One is called Computer Architecture and the other book is called Computer Organization and Design. They go into great detail and are pretty accessible in my opinion.

cool thanks so are there bigger words than 8 in memory? or is that the largest?

And whats more influential in performance 1, 4, 8, or what word size?
1. In OS operations explorer and basic OS functionality
2. Browsers
3. larger programs

I understand if you cant answer that question. It is a very deep technical questions so i understand if you dont have a clue. If you know the answer that it as a fact or a educated guess or a shot out into the air so i know how certain you are on the answer so i don't ummm misunderstand/misinterpret something. I don't mind educated guess as long as they are stated as such. Educated guess can be informative and i'll take whatever i can get :D
 
A "word" is whatever you define it to be.

For most systems, that's equal to the internal bus size of the processor - the amount of data that can be transferred into or out of a register in a single operation. That isn't necessarily the same size as a register, but often it will be.

Now, x86 assembly defines "word" as 16-bits as a data type. It also defines a double-word and quad-word data types.

So it really depends upon the context in which its used.

Performance based on word size: again, this depends largely upon context, but using the generic definition that a word is equal to internal bus sizing, I can't think of a situation where a smaller bus size is faster - unless you are getting into physical layout and physics of silicon design where more traces start to limit frequency response. It takes one operation to transfer a word, so only transferring something smaller than a word doesn't save you any time since you can't execute in less than 1 cycle, so I can't think of any benefit to having a smaller word size. That doesn't mean there isn't some case where their isn't one, and maybe smarter people than me can explain.

In general, the larger the bus, the larger the word size, and the faster all processes. It's not quite the same as the cache block size, where you can get too large.
 
A "word" is whatever you define it to be.

For most systems, that's equal to the internal bus size of the processor - the amount of data that can be transferred into or out of a register in a single operation. That isn't necessarily the same size as a register, but often it will be.

Now, x86 assembly defines "word" as 16-bits as a data type. It also defines a double-word and quad-word data types.

So it really depends upon the context in which its used.

Performance based on word size: again, this depends largely upon context, but using the generic definition that a word is equal to internal bus sizing, I can't think of a situation where a smaller bus size is faster - unless you are getting into physical layout and physics of silicon design where more traces start to limit frequency response. It takes one operation to transfer a word, so only transferring something smaller than a word doesn't save you any time since you can't execute in less than 1 cycle, so I can't think of any benefit to having a smaller word size. That doesn't mean there isn't some case where their isn't one, and maybe smarter people than me can explain.

In general, the larger the bus, the larger the word size, and the faster all processes. It's not quite the same as the cache block size, where you can get too large.

i wasn't asking about bus size. What I was trying to ask...might have asked it in a retarded way for all I know -_-

So what I was asking is for windows performance what is used more. 1 word accessing or 4 words or 8 words or so on. When the OS needs to access memory what are the distribution of requests? Is it just fetching 1 word or a set of words? I know this is kinda extreme question and it has tons of variables but has anyone figured out what is the common distribution of requests?

Making this up as an example:
So like for Win 7 the OS itself makes request that are:
1 word = 50%
4 word = 15%
8 word = 25%
>8 words = 10%

Same thing for Browsers, games, and other programs. Is it making tiny fetches or large fetches. Is 1 word latency most important to a snappy system or is larger word latency better?

Is there even a way to monitor this?

I am interested because i want to know what is best to focus on for memory. What latency provides the best overall performance. As you can see I am in depth and love to actually understand this stuff. I am an avid seeker of ultra low response times :D. I am going to in the near future figure out how to strip Win7 of "fixed" animations. I have a feeling windows makes your OS less snappy due to Animations being fixed in certain areas.
 
Last edited:
Back
Top