|
|
![]()
So, the maximum real RAM BW at reading/writing for K7 makes 1647MB/s (77% of the theoretical maximum of 2133MB/s). The maximum real RAM BW at reading on Opteron equals 5150MB/s (98% of the theoretical maximum equal to 5250MB/s), and at writing it makes 4880MB/s (93% of the theoretical maximum). D-Cache/RAM LatencyThis test estimates average latency of every D-cache level and RAM. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The L1 cache latency makes 3 clocks for all access methods (forward,
backward, random). At the same time, the L2 cache latency of Athlon XP
is 24 clocks in case of forward access (because of the overheads in the
Hardware Prefetch algorithms realization) and 20 clocks for backward/random
access. The respective values for Opteron are 17 clocks on average (in
all cases). In both cases it's a bit higher than the minimal latency specified.
We will try to reach it in the next test. And now we are going to estimate
the average memory access latency. On K7 it makes about 205 clocks (133
ns) for forward access and about 300 clocks (195 ns) for backward/random
access, which means that the Hardware Prefetch mechanism works well in
case of forward RAM access. Nevertheless, the RAM latency for K8 looks
much better. First of all, the Hardware Prefetch works excellently both
for forward and backward access (the memory latency makes only 50 clocks,
i.e. 28 ns!). The random access latency is a bit higher - 144 clocks (80
ns). The gradual increase of this value in both cases (at the clock size
of 1 MB and over on K7, 2 MB and over on K8) is caused by the growing number
of L2 D-TLB misses, the size of which can provide the effective random
addressing of such data volumes, but not greater.
Minimal L2 D-Cache/RAM LatencyThis test estimates the minimal latency of the L2 cache. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 Note that in both cases the minimal L2 latency (11 clocks, AMD K7; 12 clocks, AMD K8), in all access modes, can be reached in case of 24 NOP. Since it takes exactly one clock to execute every NOP = or eax, edx on AMD K7/K8, it equals the 24-clock L1-L2 bus load between two following accesses. The jumps at the point of 33 NOPs (K7) and 38 (K8) NOPs and the way it gets unloaded before the "24 NOP" mark can't be explained yet. Let's estimate the minimal RAM latency with Minimal RAM Latency, 4M Block preset. Taking into account the latency results obtained above let's reduce the block size to 1MB for K7 and to 2 MB for K8 in order to minimize the losses caused by the L2 TLB misses. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 First comes the K7 platform. The increased L2-RAM bus unload makes the memory forward access latency gradually fall down. This test doesn't show if the minimal value is reached. But the separate test demonstrates that it can be reached in case of 380 NOPs between every two successive accesses and makes 25-26 CPU clocks. At the same time, the backward and random access curves look differently - they have teeth at certain points. The backward access latency is within 280 - 304 clocks (183 - 198 ns), and the random access latency is just a bit higher (284 - 305 clocks, 185 - 199 ns) since we measured it provided that there were almost no L2 TLB misses. The teeth on the backward and random access curves are caused by the fact that a new memory exchange cycle can start/end only at the whole number of FSB clocks. In this case the teeth interval equals 23 NOPs, that is why the memory exchange cycle in this system (with the multiplier equal to 1533.3/133.3 = 11.5) can take place only at every even (or odd) memory bus clock. It's not clear why some AMD K7 platforms behave this way. However, we are sure that it doesn't depend on multiprocessing, as well as on a memory type (DDR or SDRAM), but it must be related to peculiarities of one or another chipset. The minimal RAM latency for Opteron looks different. The minimal latency
for backward/forward memory access latency was reached yet at 39 NOPs (it's
almost 10 times less than the Athlon XP has), and it comes to 21 CPU clocks
(11.7 ns). The random access latency is much higher (probably, this is
the only objective memory latency in systems with Hardware Prefetch enabled).
It varies from 139 to 148 clocks (77 - 82 ns) and has the same tooth-like
curve with the period of 11 NOPs, which means that the memory cycle can
take place at every clock of its 166 MHz bus.
D-Cache Associativity![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The curves of AMD Athlon XP and Opteron have areas that correspond to 2-way associative L1 cache (at the number of chains of 1-2), and the latency remains 3 clocks for all access types. The 3-18 area is related to the L2 cache associativity which looks overstated in our case (18 against 16 specified for L2 cache associativity). This directly depends on the exclusive architecture of AMD K7/K8 CPU caches. Since such cache architecture doesn't imply duplicating of L1 data with the help of L2, value 18 equals the "summary" associativity of the L1+L2 D-caches. The processors with the inclusive cache architecture have a different curve, and they bend at the points of associativity of the L1 and L2 caches. Note that the L2 and RAM latency in both cases is overstated at this point, especially in case of the successive access. It happens probably because of the overheads for regular re-association of the cache lines with the memory lines according to the LRU scheme (Least Recently Used) in order to keep in the CPU cache hierarchy all data taken from the "bad" addresses from the standpoint of organization of the CPU D-cache. Real L1-L2 Bus BandwidthIn this case we will use the third test again (D-Cache Bandwidth), with L1-L2 Cache Bus Bandwidth preset. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 On Athlon XP the real L1-L2 bus bandwidth was 3.2 bytes/clock for all access types. Note that every access into L2 (one L2 cache line loading into L1) in the exclusive cache architecture is followed by the additional writeback of the excluded line from L1 in L2, i.e. one access transfers a double data size by the L1-L2 bus. In this case the effective L1-L2 bus bandwidth is 6.4 bytes/clock, which corresponds to the 64-bit data bus between L1 and L2 caches. On Opteron the real L1-L2 bus bandwidth is much higher (10.9 bytes/clock with the exclusive architecture taken into account), which corresponds to the 128-bit bus between L1 and L2 caches specified in AMD's documentation. Another interesting peculiarity of the exclusive cache architecture is that the L1-L2 bus efficiency at writing is not worse than that for reading cache lines from L2 into L1. It's especially well seen in the AMD K7 architecture (the real L1-L2 bus bandwidth for reading and writing is 3.2 (6.4) bytes/clock), while in case of AMD K8 the L1-L2 bus efficiency in writing is a bit lower, and the real bus bandwidth makes 4.9 (9.8) bytes/clock. Now we are going to get some more characteristics of the L1-L2 cache with the D-Cache Arrival test which measures the summary latency of two accesses to the same data cache line. L1-L2 Bus Data Arrival, 64 bytes preset. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 On the figures above you can see how the summary double-access latency depends on the second access offset (in bytes) relative to the first one. In this case (when the bus is unloaded enough due to the 64 NOPs between two successive accesses to the neighbor cache lines) the forward, backward and random access the lines are degenerate. When the second element of the line is offset within 4-20 bytes relative to the first one the summary latency of two accesses makes 14/15 clocks (for K7/K8), which coincides with the L2 cache access (11/12 clocks) with the following access to the L1 cache (3 clocks). At the same time, the increased summary latency when the offset is 24 bytes and over is caused by the fact that the maximum theoretical bandwidth of the two-way 64-bit L1-L2 bus of the AMD Athlon XP processor at reading makes 8 bytes/clock (during 3 clocks of the L1 access the L2 can transfer as much as 8x3 = 24 bytes). Opteron performs similarly which means that the effective L1-L2 bus capacity in reading operations is 64 bits. This aspect was closely studied in Appendix 1 to the article where we examined the AMD64 architecture. Additional one-clock latency between two successive accesses to the same cache line (by increasing the number of SyncNOPs by one) shifts the diagram border by 8 bytes. If there are more than 5 SyncNOPs this border disappears because 3 (L1) + 5 (SyncNOPs) = 8 L2 cache access clocks is needed to transfer 8x8 = 64 bytes via the L1-L2 bus 8x8 = 64, which is equal to a whole cache line. Finally, let's estimate how cache lines are read from the L2 into the L1 (from its beginning, irrespective of the first element offset, or from a demanded position, turning over the end of the requested string). For this purpose we will draw the diagram that reflects how the summary latency depends on the first word offset relative to the cache line beginning with the following Custom parameters of the D-Cache Arrival Test:
![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The usage of the second word offset relative to the first one which is equal to 60 bytes means that the real offset will be -4 bytes (except the first point when the offset is 60 bytes indeed), i.e. the second word will be shifted by one position to the left from the first word, in the cache line. Now look at the diagrams. First of all we must say that the hypothesis saying that data are always read from the beginning of the line irrespective of location of the first element is wrong. In this case, since the second element reading in this test is carried out in the following order: (60, 0, 4, ..., 56), the curves would look identically but would be shifted to the left by 4 bytes. So, in these architectures (K7 and K8) reading of the cache line can start from a non-zero position that depends on where the first element is located. To find out how these values are related to each other let's look at the curves. As you can see, they are based on only two summary latency values - the maximum equal to 26/27 clocks (when the first word offset is a multiple of 8, let's call it even) and the minimum equal to 14/15 clocks (in case of odd offset). It means that in case of the odd first word offsets the data coming from the L2 cache hit the L1 cache immediately while reading of data from the even offsets cause the maximum delay of the second word arrival which is separated from the first word by -4 bytes. The results suggest that the cache line reading from L2 into L1 in AMD K7/K8 processors can start from any offset which is a multiple of 8 bytes. Then reading goes on turning over the end of the requested string - until the whole line is read. Here are two examples. In the first example the data are requested from the "even" offset equal to 24 bytes. In this case data will be read from L2 into L1, at every clock, in the following sequence: (24-31, 32-39, 40-47, 48-55, 56-63, 0-7, 8-15, 16-23). It explains why the second word delay with the actual offset of 24 - 4 = 20 bytes is maximum (26/27 clocks). Now let the first word offset be "odd", for example, 44 bytes. Data will be read in the following order: (40-47, 48-55, 56-63, 0-7, 8-15, 16-23, 24-31, 32-39). It explains why the summary latency at accessing the elements which are offset by 44 and 40 bytes turns out to be minimal (14/15 clocks). This proves the above suggestion. Instruction Cache, Decode EfficiencyFirst of all we will estimate characteristics of the I-Cache levels (remember that the L2 cache can cache both data and an executable code) and effectiveness of decoding of a simple 6-byte instruction cmp eax, 0x00000000, which allows reaching the maximum decoding rate. For this purpose we use the I-Cache test and L1i Size / Decode Bandwidth, CMP Instructions 3 preset. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The instruction L1 cache size (L1i) is as expected equal to 64 KB both
for K7 and for K8. The exclusive L1-L2 hierarchy organization doesn't change
either - it can be also applied to code caching. The decode curves have
the second bend at 320 KB (64+256) on K7 and at 1088 KB (64+1024) on K8.
The efficiency of decoding/execution of the code stored in L1i almost
reaches the maximum value for these architectures of 16 bytes/clock which
was mentioned above when we estimated the real D-Cache L1 bandwidth. Efficiency
of caching of this type of instructions with the L2 cache is much lower.
Let's see how fast other types of instructions from L1i/L2 caches are executed.
We can see that the speed of execution of instructions from the L2 doesn't depend on their type for both K7 (1.97 bytes/clock) and K8 (2.56 bytes/clock). In the second case the efficiency of the L2 code execution is a bit higher (by almost 30%), but both results are still far from the theoretical limit for the 64-bit L1-L2 bus in reading. In case of the L1i cache decoding of large instructions (like 6- and 8-byte cmp) is limited by the L1 bandwidth of 16 bytes/clock (2.66 instructions/clock and lower). In case of small independent instructions (1-, 2-, 4-byte) it's quite possible to reach the maximum decoding/execution rate which equals 3 instructions/clock for K7/K8. The decoder in K8 has some advancements for decoding simple ALU operations which allow reaching the maximum speed of their execution as compared to the decoder in K7. Now we are going to estimate associativity of the L1i and I/D L2 caches (associativity of the latter must equal the one we found before when analyzed the D-cache L1/L2 associativity). I-Cache Associativity preset. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The test results are pretty clear, especially for Opteron. In both cases the L1i associativity is equal to 2, and the summary L1i+L2 associativity equals 18, which proves that the cache levels in these processors have the exclusive organization. Now let's see how effectively these processors can handle a large amount of (actually useless) prefixes which precede a single sensible x86 instruction. Prefixed NOP Decode Efficiency preset. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The curve for K7 reaches the maximum at the number of prefixes equal to 3 (11.2 bytes/clock / 4 = 2.8 operations/clock), and as the number of prefixes grows up the decode efficiency (in quantity of operations) quickly goes down. The K8 handles prefixes a bit better, though not that good as we wanted it to be. The curve got additional maximums but they are not such when recalculated into quantity of executable operations. However, exactly such instructions - [0x66]nNOP - are recommended as a neutral code in AMD's guide for software optimization for K8 processors, for example, for aligning the cycle start border. D-TLB levels characteristicsFirst of all we are going to examine the general picture of TLB levels for each processors. For this purpose we use D-TLB test and D-TLB Size preset for Athlon XP. For the second system (Opteron) we made manual settings having increased the maximum number of D-TLB entries calculated up to 1024, because the L2 D-TLB size of this processor is greater than 512. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 Both processors have the two-level D-TLB system with the L1 D-TLB size equal to 32 entries. The L2 D-TLB size for XP makes 256 entries and in Opteron it's twice as great (512 entries). Since the second jump lies in the zone that corresponds to L2 D-TLB size, rather than to L1+L2 D-TLB we can suggest that the D-TLB structure of both processors have an inclusive architecture (unfortunately, the TLB levels organization details are usually not reflected in documentation if there are several such levels). In both cases the L1 D-TLB miss lifts up L1 access latency up to 8 clocks. L2 D-TLB misses makes a greater effect - as the number of misses grows up the L1 access latency gradually grows up to the values that definitely exceed the L2 latency. According to AMD's documentation, L1 D-TLB is fully associative for K7/K8. Let's find out if it's true. Preset: D-TLB Associativity, 16 Entries, the entries can easily fit in L1 D-TLB. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The test proves that the L1 cache latency equals 3 clocks when accessing 16 memory pages for both processors at any number of dependent access chains. The curves will look the same for the maximum number of chains equal to 32. It means that the L1 TLB is fully associative (because its associative level is not lower than its size). Now we are going to estimate the L2 D-TLB associative level and for this purpose we use the number of pages which is admittedly higher than the number of entries in L1 TLB but lower than that in L2 TLB (Presets: D-TLB Associativity, 64 Entries for K7; D-TLB Associativity, 128 Entries for K8). ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The L1 access latency when only L2 D-TLB is used sharply increases at the number of chains over 4. It means that the L2 D-TLB in these processors is 4-way associative. I-TLB level parametersI-TLB parameters will be measured the way we used for D-TLB. I-TLB Size preset in I-TLB test is used to measure the I-TLB size (level structure). ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 For Opteron we modified the standard preset by increasing the maximum number of TLB Entries up to 1024. Both processors have two I-TLB levels. The first level (L1 I-TLB) in AMD K8 was twice increased (up to 32 entries) compared to the previus-generation architecture K7. The L2 I-TLB was expanded up to 512 entries in the newer architecture. Nevertheless, both processors have the same I-TLB level interaction organization which we call inclusive, like for D-TLB. To estimate associativity of every I-TLB level we will take I-TLB Associativity, 16 Entries preset. For AMD K7 the number of I-TLB entries used will be decreased to 15 in order not to exceed the bounds of L1 I-TLB (since one of its entries is concerned with addressing of the test code page). ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 The L1 I-TLB in both cases is fully associative. Let's estimate associativity of the second level of I-TLB. Presets: I-TLB Associativity, 32 Entries for K7, and I-TLB Associativity, 128 Entries for K8. ![]() AMD Athlon XP 1800+ ![]() AMD Opteron 244 In both cases the results match the L2 I-TLB associativity specified which equals 4. ConclusionToday we carried out the first thorough low-level test of AMD K7/K8 platforms
using the universal RightMark Memory Analyzer. The results show that this
test suite can be successfully used for estimation of key low-level platform
parameters. Next time we will continue our examination and test the Intel
Pentium 4 platform.
Dmitry Besedin (dmitri_b@ixbt.com)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||