Digit-Life Hardware News
09.05.2008
[16:28] OCZ Expands on Gaming DDR3 Lineup with Special Ops Urban Elite Edition
[01:03] Plextor Unveils 1TB StorX NAS Drives
[00:52] OCZ Introduces New Additions to the Reaper HPC Series
[00:31] Iomega Announces New Camo Model in eGo Portable Hard Drive Line
[00:16] AMD Server Workstation Roadmap Updated
07.05.2008
[15:06] Daily Mailbox
[14:54] Super Talent Launches MLC SATA-II SSDs for Notebooks
[14:45] NVIDIA Introduces Hybrid SLI
[14:34] JVC Develops 1.75-inch 8K4K D-ILA Device
[14:20] OCZ Announces World-First High-Density 2GHz Solution for Ultra High-End Desktops
Your link here

Home Home
Latest News | Platform | Coolers | HDD/DVD | Video | Sound | Network | Imaging | Mobile
Monthly | Rightmark Tools | Search | Forum | Mailing | Links | Advertise | About Us
Digit-Life Articles Feed    Digit-Life News Feed

Latest Articles:

i3DSpeed, April 2008

Biostar TA780G M2+ Motherboard on AMD 780G Chipset (Socket AM2+)

NVIDIA GeForce 9800 GTX Graphics Card

NVIDIA GeForce 9800 GX2 Graphics Card

MSI K9A2 CF Motherboard on AMD 790X Chipset (Socket AM2+)






NVIDIA PerfKit 5: New Tools for 3D Developers

Introduction

Over a year has passed since our review of NVPerfKit 2. We've been using PerfKit and PerfHUD all this time to analyze modern games, and game developers have been using these tools in their projects. Much changed since that time: there appeared new Microsoft Windows Vista and DirectX 10 API, NVIDIA launched a new series of graphics cards based on G8x GPUs supporting this API.

That outdated version of PerfKit didn't support these new features, which were vital for game developers. So the kit had to be updated. In the end of this summer, NVIDIA launched PerfKit 5, where PerfHUD suffered most changes. As you can see, both names got rid of the NV prefix. Now these tools are called NVIDIA PerfKit and NVIDIA PerfHUD.

These tools are vital for 3D developers, because modern 3D applications are too complex to let developers use all features of the new GPUs. They will need utilities to help them detect mistakes and performance bottlenecks. Video chips perform a lot of various operations with video pipes during rendering. General performance of an application depends on the slowest sector, so you need convenient tools to detect such bottlenecks. Complexity of pipelines has grown for the last years, and it's very difficult to make head or tail of these processes without convenient tools. Especially as unified shaders (vertex, geometry, and pixel ones) changed the approach to optimizing 3D applications. The habitual approach to shifting the load from pixel to vertex shaders and back may not work here, especially if the number of vertices will be similar to that of pixels. Pixel or vertex shaders do not limit performance separately any more, now we've got only unified shaders to worry about.

There used to be only simple debugging tools, then special tools appeared, such as PerfHUD. The utility is successfully used to develop most popular games by all big developers. Let's list some titles: Battlefield 2142 (DICE), World of Warcraft (Blizzard Entertainment), Gamebryo (Emergent Technologies) - TES4: Oblivion, Company of Heroes (Relic Entertainment), Settlers VI (Blue Byte). These names are published by NVIDIA itself. According to our research, PerfHUD was used to develop the following projects: Armed Assault, Gothic 3, Far Cry, Serious Sam 2, S.T.A.L.K.E.R., Dark Messiah of Might and Magic, Need for Speed: Carbon, Test Drive: Unlimited, Splinter Cell: Double Agent, GTR 2, and many others.

PerfHUD reached this level of popularity, because it really helps developers optimize their projects in an easier and more efficient way. According to NVIDIA, hundreds of PerfHUD users improved performance of their programs. PerfHUD compares favorably to similar utilities, as it works in real time with the analyzed application. The entire process of debugging and detecting bottlenecks is carried out right in an application, while other utilities use offline analysis, which is less convenient. The new PerfKit 5 is available at NVIDIA web site for developers, we offer you a review of its features.

PerfKit 5 Overview

PerfKit 5 is a bundle of programs for developers of 3D applications, which contains powerful tools to analyze performance of Direct3D and OpenGL applications using performance counters of the driver and hardware counters of a GPU. Performance counters may be used to determine the reasons for low performance of 3D applications and to find out how well a given application uses available GPU capacities.

PerfKit 5 components:
  • NVIDIA ForceWare instrumented driver, Performance Data Helper (PDH) interface.
  • PerfHUD 5.0 - a powerful utility to analyze performance of Direct3D 9 and Direct3D 10 applications.
  • NVIDIA plugin for Microsoft PIX for Windows - to import data from NVIDIA counters into the DirectX SDK debugger.
  • PerfSDK - to access performance counters from OpenGL and Direct3D applications, source code samples.
  • GLExpert is a part of PerfKit to analyze performance and debug OpenGL applications.

Even though Direct3D features of PerfKit are richer than those for OpenGL, this API is also supported. NVIDIA recommends using gDEBugger for these applications. A pre-release version of this utility came with PerfKit, and now you can download it from the official web site for developers.

PerfKit 5 system requirements

  • A graphics card with a modern NVIDIA GPU from the following list: NV40, NV43, G70, G72, G80, G84. As you can see, solutions are now limited to mid-end and high-end graphics cards of the GeForce 8, GeForce 7 and GeForce 6 families, as well as corresponding professional graphics cards of the Quadro FX family. Early GPUs are not supported or their support is limited.
  • Microsoft Windows XP or Windows Vista with the latest Microsoft DirectX update (you'd better have the latest version of DirectX SDK as well).
  • Special debug drivers from NVIDIA (PerfKit 5 installs the right version automatically, including the debug option by default).

This special driver is a must for PerfKit. These debug drivers are also called instrumented drivers. They contain additional code to monitor and measure performance. Debugging tools, such as PerfHUD, communicate with drivers to get necessary information about GPU and NVIDIA driver's operations. Instrumented drivers should not be used for comparative performance tests, because they affect rendering speed. However, this negative effect does not exceed several percents, and it can be disabled in NVIDIA control panel.

Let's review the key features of PerfKit one by one. We'll start with a brief description of counters, as they are used by the entire kit, including popular PerfHUD (it actively uses counters).

PerfKit counters

PerfKit offers counters of several types: hardware counters that read data from a GPU, software Direct3D and OpenGL counters that contain data obtained from the debug driver. There are also so-called "simplified experiments" - multipass operations that provide detailed information about GPU status.

Hardware counters contain results accumulated since the last time a GPU was sampled. For example, the number of triangles in the setup_triangle_count counter equals the number of polygons processed since the last query. When PDH is used to read data from counters, for example, from built-in Performance Monitor (PerfMon) from Windows, they will be queried one time per second. But when you integrate counters into your applications, you can query them as often as you want. Unlike hardware counters, driver counters return values accumulated for the last frame rendered.

When using the PDH interface, counters can be reported in one of two methods: raw and percentage. Raw counters count events (triangles, pixels, milliseconds, etc.) since the last call. Percentage counters return the time a certain GPU unit was busy or waiting data from another unit. If you call the counter data from a program using NVPerfAPI functions, they return raw values and the total number of completed GPU cycles. Triangle and vertex counters return a number of processed elements.

Some of software performance counters for Direct3D and OpenGL applications: FPS, a number of draw calls, a number of vertices and triangles per frame (with/without instancing), video memory usage (textures, vertices, buffers, etc), several special SLI counters that show the number and volume of data transfers from GPU to GPU, the number of transferred render buffers, etc.

Hardware counters are more interesting: GPU Idle, Shader Utilization, ROP Utilization, Shader Stalls, ROP Stalls, Vertex Count, Primitive Count, Triangle Count, and Pixel Count. Let's examine GPU counters in more detail. More of them were added in PerfKit 5, so it will be easier to analyze performance of 3D applications now.

  • gpu_idle and gpu_busy count the number of clock ticks that the GPU was idle and busy since the last call. This pair of counters show GPU load. These data come in handy, when you balance GPU load. They help you find out whether an application is limited by CPU or GPU.


  • shader_busy - the percentage of time that unified processors were busy. This value shows percentage of time that shader units in general (vertex, geometry, and pixel shaders together) were busy. It will help determine whether performance is limited by unified processors.


  • vertex_shader_busy - the percentage of time that vertex shader units were busy (or unified processors that process vertex shaders, depending on a GPU). In old (non-unified) architectures, this counter can be used to find balance between the number and complexity of vertex processing and pixel shader load. In unified architectures it helps developers understand how complex vertex shaders are.


  • pixel_shader_busy - the percentage of time that pixel shader units (or unified units) were busy with processing pixels. Just like in the previous case, this value can be used in old architectures to determine whether performance is limited by pixel processors. What concerns new architectures, it can be used to evaluate complexity of pixel processing.


  • geometry_shader_busy - the percentage of time that unified processors were busy with processing geometry shaders. No use for old architectures, geometry shaders are available only in unified architectures. Values from this counter help developers find out whether performance is limited by geometry processing.


  • rop_busy - percentage of time when the ROP unit is actively doing work. High readings of the counter are usually the result of alpha-blending actively used by a 3D application. Or when overdraw is high, which is another wide-spread limiter to rendering performance. Multi-sampling also has a strong effect on the results, it increases the ROP load. The same applies to using "heavy" buffer formats - FP16, FP32, etc.


  • geom_busy - it's the percentage of time that a geometry unit was busy. It helps determine whether the overall performance is limited by much geometry. However, it rarely happens in real applications...


  • texture_busy - it's the percentage of time that a texture unit was busy. It shows how much TMUs and TFUs are loaded. This counter is important for determining whether a 3D application uses too many texture fetches and whether they limit the overall render speed.


  • stream_out_busy - the time the stream output interface (it goes after geometry shader) was busy. This feature appeared in Direct3D 10. Stream output returns the data processed by the vertex part of the pipeline for further usage. In certain conditions a 3D application can be limited by performance of this unit, the counter will be useful to detect such situations.


  • shaded_pixel_count - the number of pixels sent by the rasterizer to pixel shader units. Together with the number of processed triangles, this value can be used to determine an optimal balance between real geometry and its imitation by normal mapping.


  • rasterizer_pixels_killed_zcull_count - the number of pixels culled by rasterizer at ZCull. This counter is used to evaluate ZCull efficiency, for example.


  • input_assembler_busy - the percentage of time the input assembler was busy. This unit fetches geometry and other data from memory for other GPU units. When it's overloaded, it may limit rendering speed.


  • setup_point_count/setup_line_count/setup_triangle_count/setup_primitive_count - the number of points/lines/triangles/primitives sent to a GPU for processing. For example, it can be used to evaluate efficiency of triangle formats, such as strips and fans, and to determine geometry complexity of a scene.


  • geom_primitive_in_count/geom_primitive_out_count - the number of input and output primitives (points, lines, and triangles) that were received and transformed by the geometry unit. It can also be used to evaluate efficiency of triangle storage formats and to obtain information.


  • geom_vertex_in_count/geom_vertex_out_count - the number of input and output vertices, just like in the previous item.


  • shader_waits_for_texture - the amount of time that the unified shader units were stalled waiting for a texture fetch. Texture stalls usually happen if textures don't have mipmaps, if a high level of anisotropic filtering is used.


  • shader_waits_for_rop - percentage of time that the unified shader unit is stalled by the raster operations unit (ROP), waiting to blend a pixel and write it to the frame buffer. This can be a performance bottleneck, if the application is performing a lot of alpha blending, high anti-aliasing levels, or even if the application has a lot of overdraw.


  • shader_waits_for_geom - percentage of time that the unified shader unit is stalled by the geometry unit. Our analysis shows that application performance will hardly be limited by processing geometry, but it won't hurt to check it up.


  • input_assembler_waits_for_fb - percentage of time that the input assembler is stalled by the frame buffer unit. It often happens when performance is limited by frame buffer operations, rather than by a fetch unit. So this counter usually returns big values.


  • rop_waits_for_shader - time that the ROP unit is stalled waiting for the unified shader unit. This rarely happens in reality, but this counter may still be useful.


  • rop_waits_for_fb - percentage of time that ROP is stalled by the frame buffer unit. Just like the previous case, it rarely happens.


  • texture_waits_for_fb - percentage of time that the texture fetch unit (TMU/TFU) is stalled by the frame buffer unit.


  • texture_waits_for_shader - percentage of time that the texture fetch unit is stalled by the unified shader units. It's the reverse counter to one of the listed above. It indicates when texture units are stalled.
Alexei Berillo (sbe@ixbt.com)
October 19, 2007




Latest News | Platform | Coolers | HDD/DVD | Video | Sound | Network | Imaging | Mobile
Monthly | Rightmark Tools | Search | Forum | Mailing | Links | Advertise | About Us

Copyright © by Digit-Life.com, 1997-2008. Produced by iXBT.com
Design by Explosion