Analysing NVIDIA G8x Performance in Modern Games
This article was initially intended to continue our analysis of shader units and to show how their number affects NVIDIA G8x performance in modern games. We planned to change parameters of a G80-based graphics card to make it resemble the mid-end G84-based products. Our plan was to use RivaTuner by Alexei Nikolaychuk to leave only 32 active unified processors in the GeForce 8800, and change GPU and memory clock rates.
GPU clock rate was to be reduced to the level of G84, and video memory clock rate had to be adjusted to scale memory bandwidth of the GeForce 8800 down to the level of the GeForce 8600 (memory clock must be three times as low, because the difference between memory bus bit-capacity is 384/128=3 times). Theoretically, such cards would have differed only in the number of ROPs: 24 versus 8, as well as in on-die caches and other optimizations.
Our objective was to determine how much the different number of ROPs affects performance, and how much shader units differ in G84 and G80. We wanted to analyze performance of modern games using NVIDIA PerfKit. Unfortunately, our plans never got to the first base - mandatory hardware performance counters of the G80 do not work, when some shader units are disabled. And such an analysis would have been useless without them. So we decided not to trash our test results and to use them in a brief comparative analysis of performance and some other parameters of modern 3D games.
One more idea was discarded in the process - stream_out_busy readings. This counter proved to be practically useless, even though most our tests are Direct3D 10 applications. The counter indicated that stream output units were used only in one game - Lost Planet: Extreme Condition. Moreover, these units were loaded only by less than 1%, so we decided to discard this counter as well.
Testbed configuration and settings
We used the following testbed configuration:
- CPU: AMD Athlon 64 X2 4600+ Socket 939
- Motherboard: Foxconn WinFast NF4SK8AA-8KRS (NVIDIA nForce4 SLI)
- RAM: 2048 MB DDR SDRAM PC3200
- Graphics cards: NVIDIA GeForce 8800 GTX 768MB and NVIDIA GeForce 8600 GT 256MB
- HDD: Seagate Barracuda 7200.7 120 Gb SATA
- Operating system: Microsoft Windows Vista Home Premium
- Video driver: NVIDIA ForceWare 163.16 (instrumented)
We used only one video mode with the most popular resolution 1280x1024 (or 1280x960 for games that do not support the former), MSAA 4x and anisotropic filtering 16x. Both features were enabled from game options, nothing was changed in the control panel of the video driver.
Our bundle of game tests includes recent projects. We gave preference to games supporting Direct3D 10 or containing new interesting 3D techniques. Here is the full list: Call of Juarez DX10 benchmark, Company of Heroes, S.T.A.L.K.E.R.: Shadow of Chernobyl, Lost Planet: Extreme Condition DX10 benchmark, Colin McRae Rally: DiRT, PT Boats: Knights of the Sea DX10 benchmark, SEGA Rally Revo, Clive Barker's Jericho. Our tests included several games without built-in ways to run demos, so we had to test them tentatively. Additional software: NVIDIA PerfKit 5, Riva Tuner 2.05, and Microsoft PIX for Windows from DirectX SDK.
Unfortunately, our tests did not include such interesting applications as World in Conflict, BioShock, Medal of Honor: Airborne, Test Drive Unlimited, TimeShift, Call of Duty 4: Modern Warfare, Half-Life 2: Episode 2, etc. Some demos and games did not make it in time, a couple of projects failed to run under PIX debugger: BioShock and Test Drive Unlimited.
Test Results
Lost Planet: Extreme Condition
We start our analysis with one of the most technically advanced games launched in 2007. Lost Planet: Extreme Condition has come to PC from Xbox 360. But its high-tech features are confirmed by many changes in the engine for Direct3D 10 GPUs. Compared to its console version (which is a high-tech game as well), it has some new features: FP16 frame buffer, motion blur and depth of field of higher quality, fur, more samples and improved shadow map filtering, ambient occlusion, soft particles, advanced parallax mapping, etc.

Lost Planet was tested with a built-in demo consisting of two game levels. They differ much - there are not many objects in the first level, so its performance is limited mostly by a graphics card; the second level contains a lot of objects, so the performance is limited by both CPU and GPU.
| NVIDIA PerfKit counter |
GeForce 8800 GTX (G80) |
GeForce 8600 GT (G84) |
| FPS (avg) |
20.8 |
4.5 |
| FPS (min) |
11.0 |
2.1 |
| Video memory, MB |
507 |
505 |
| batch count (avg) |
738 |
799 |
| batch count (max) |
4375 |
4336 |
| primitive count (avg) |
497858 |
475061 |
| primitive count (max) |
2335540 |
2104705 |
| setup triangle count (avg) |
83318 |
82474 |
| setup triangle count (max) |
311494 |
292719 |
| gpu_idle, % |
0.1 |
0.1 |
| rop_busy, % |
7.9 |
5.2 |
| texture_busy (avg), % |
74.4 |
77.5 |
| texture_busy (max), % |
83.4 |
90.6 |
| shader_busy (avg), % |
84.6 |
88.3 |
| shader_busy (max), % |
89.0 |
98.0 |
| geom_busy, % |
0.2 |
1.2 |
| input_assembler_busy, % |
3.1 |
0.8 |
In our tests we used a DirectX 10 demo of the game with a built-in benchmark, which does not reflect real performance of the release with all patches applied. That's why we could evaluate the frame rate only relatively, rendering speed of the latest release is much higher. We can see that the G84 is heavily outperformed by the G80. Let's locate the bottleneck. First of all, the GeForce 8600 can be slowed down by by video memory size, which is half as much as the game uses. Secondly, judging by the results, rendering speed is affected by the number of shader units and TMUs - the difference in performance is proportional to the difference in units.
Let's have a look at interesting results. The average number of draw calls does not reflect the real picture, because it depends on levels: there are not many calls in the first scene, and in the second scene this number reaches over 4000, which is really much, even for state-of-the-art games. The amount of geometry in this game is above average, but I don't quite understand the difference between primitive count and setup triangle count. Both GPUs were constantly working, we can see that performance does not depend on a CPU. Geometry and raster units are not loaded much, while texture and shader units are working at full capacity. We haven't seen such active usage of shader units before. That's what I call good optimization - performance depends on a GPU only.