ATI RADEON 9700 preview or DX9=R300*NV30
CONTENTS
- Lecture 1 - introductory
- Lecture 2 - characteristics of the main hero
- Lecture 3 - comparison of key characteristics
- Lecture 4 - memory controller
- Lecture 5 - pixel pipelines and texture units
- Lecture 6 - vertex shaders and higher-level languages
- Lecture 7 - anti-aliasing and video capabilities
- Conclusion
Lecture 1 - introductory
The epoch of flexibly programmable graphics accelerators has at last began.
Certainly, there are drawbacks! But they are being corrected. At the same
time, programs of realistic graphics show that capabilities of even the
latest generation of accelerators are miserable. But the direction in which
they are developing is true and soon you will see an incarnation of realistic
graphics on every desktop. Thank you for your attention (lengthy applause).
Today we are aimed at preliminary examination of capabilities of the
recently announced ATI's new-generation chip, we will also discuss its
main competitor, still un announced NV30 and prospects of hardware graphics
acceleration.
Lecture 2 - characteristics of the main hero
So, ATI announced the RADEON 9700 in advance - earlier it was known as
R300:
This chip unveils a new generation of graphics architectures from ATI by realizing
the latest trends in the hardware outlined by the API DirectX 9. Some time ago
we already touched upon key requirements that the DX9
sets for accelerators.
Here are promised characteristics of the new chip and a flagship card
based - RADEON 9700:
- Technology: 0.15 micron;
- Transistors: 107 million;
- Core clock speed: 300 MHz (315/325 possible);
- Memory bus: 256bit DDR (DDR II will probably come later);
- Local memory: up to 256 MB;
- Memory clock speed: 300 DDR (600) MHz or more, 20 GB/s bandwidth;
- Interface bus: AGP 8x, 2 GB/s throughput;
- Full support of basic DX9 capabilities:
- Floating, 64 and 128bit data formats for textures and frame buffer (vectors
of 4 components of F16 or F32);
- Pixel shaders with floating arithmetic (4*F32 computation format);
- Pixel Shaders 2.0;
- 4 independent vertex pipelines;
- Vertex Shaders 2.0;
- Hardware tessellation of N-Patches with Displacement Mapping, and, optionally,
adapting detail level;
- 8 independent pixel pipelines
- 8 texture units (one per pixel pipeline) able to fulfill trilinear filtering
without speed losses and (at last) combine anisotropic and trilinear filtering.
- 4-channel (4 64bits channels) memory controller connected with the accelerator's
core and AGP with a full crossbar;
- HyperZ III memory optimization technology (quick cleanup and compression
of the Z buffer using 8x8 units, hierarchical Z buffer for quick visibility
determination);
- Early Z test (pixel shader is used only for visible pixels);
- Hardware acceleration of MPEG 1/2 decompression and compression, possibility
to process a video stream arbitrarily with pixel shaders (VIDEOSHADER technology);
- 2 independent CRTC;
- 2 integrated 10bit 400MHz RAMDACs with hardware gamma correction;
- Integrated TV-Out;
- Integrated DVI (TDMS transmitter) interface, up to 2043*1536.
- Integrated general-purpose digital interface for connection of an external
RAMDAC or a DVI transmitter or coupling with a TV tuner.
- FC packaging (FlipChip).
Well, the characteristics are really impressive. Later we will comment
on each item and now we are turning to
Lecture 3 - comparison of key characteristics
For comparison we have chosen the most popular game solutions as well as
the main future competitor of the R300 - NV30.
The given possible specs of the NV30 are not official or precise - they
are taken from different sources and based on rumors found on the Net.
The considerable part of the parameters is assumed according to the open
data on new cross-ÀPI higher-level languages Ñ Graphics / Cine FX which
are meant to facilitate programming of such flexible chips. Besides, some
assumptions are based on the DX 9 requirements:
| Accelerator |
R200 (RADEON 8500, 128MB) |
NV25(GeForce4 Ti 4600) |
RV250 (RADEON 9000 PRO) |
R300 (RADEON 9700) |
NV30 |
| Technology, transistors |
0.15, 62M |
0.15, 68M |
0.15, ~40M(?) |
0.15, 107M |
0.13, 120M |
| AGP |
4x |
4x |
8x |
8x |
8x |
| Memory bus, bit |
128 DDR |
128 DDR |
128 DDR |
256 DDR (II)(1) |
256 DDR II |
| Memory frequency, MHz |
275 |
325 |
275 |
300 (?) |
400+ (?) |
| Core frequency, MHz |
275 |
300 |
275 |
300 (?) |
400 (?) |
| Pixel pipelines |
4 |
4 |
4 |
8 |
8 |
| Texture units |
4x2 |
4x2 |
4x1 |
8x1 (2) |
8x2 (?) |
| Textures per pass, max. |
6 |
4 |
6 |
16 (3) |
16 (3)(?) |
| Vertex shaders |
2 |
2 |
1 |
4 |
4 (?) |
| Fixed T&L unit |
Yes |
No |
No |
No |
No (?) |
| N-Patches |
DX8 |
No |
DX8 (4) |
DM (DX9) |
DM (DX9) (?) |
| Vertex Shaders, version |
1.1 |
1.1 |
1.1 |
2.0 |
2.0 (5)(?) |
| Pixel Shaders, version |
1.4 |
1.3 |
1.4 |
2.0 |
2.0 (5)(?) |
| Memory controller |
2x64 |
4x32 |
1x128 |
4x64 |
4x64 (?) |
| Integrated RAMDAC |
1x400 MHz |
2x360 MHz |
2x350 MHz |
2x400 MHz |
2x400 MHz (?) |
| Memory optimization technology |
Yes (HyperZ II) |
Yes (LightSpeed II) |
Yes (HyperZ II ?) |
Yes (HyperZ III) |
Yes (LightSpeed III ?) |
Notes:
- (1) Most likely, DDR II will be supported together with the DDR.
- (2) Each texture unit can fulfill trilinear sampling itself, without performance penalty.
- (3) According to the DX9 requirements, up to 16 different textures with
8 precalculated (interpolated over the triangle) 4D texture coordinates
can be used in a pass. In a pixel shader it's possible to sample up to
32 values from these textures.
- (4) Software emulation.
- (5) To all appearances, the hardware part will have capabilities exceeding
the DX requirements for vertex and especially pixel shaders 2.0.
What general conclusions can be drawn from this comparison?
- At the moment the R300 is an undoubted leader among game accelerators (if
we ignore rumors on parameters of the yet unannounced NV30), regarding
the architecture and a rough performance estimated according to the specs
and first results of such cards.
However, its real market position can be estimated only after
comparing the specs and performance in applications of the final versions
of the R300 and NV30. And the R300 is not available yet. The potential
of the new architectures can be entirely revealed only with the DirectX
9 which is due to arrive in autumn. The NV30 will probably be released
also by that time. In autumn we will be able to witness a new battle of
giants. That is why the calendar advantage of the R300 doesn't give it
any trumps except a doubtful priority in the PR sphere.
- The .15-micron fab process, typical of the previous generation, allows
for the mass manufacture of the R300 - reportedly production volumes with
the .13-micron technology used won't be obtainable for ATI till winter.
Besides, the .15 technology is not new to ATI as it was used in its previous
products; this can help to raise percentage of operable chips in the very
beginning. On the other hand, such number of transistors with such technology
can cause a low output of good chips, high power consumption and a high
prime cost without prospects for price competition.
NVIDIA decided to take risks - being one of the first who got
an access to the .13 process, the company is in a completely different
situation. The new process must have all its imperfections corrected, the
mass production can be time shifted and percentage of operable chips can
be very low in the beginning. On the other hand, the process will be tweaked,
NVIDIA will get more benefits regarding the prime cost and clock speed
(originally higher-frequency architectures of NVIDIA + the finer technology
give 400 against 300). So, time works for NVIDIA; that is probably why
ATI was in such a hurry with the "paper" release and will possibly put
on the market first cards yet before the DirectX 9.
However that may be, the stake on a king for a day is risky.
-
The R300 complies with the DX9 requirements and is a deliberate hardware
incarnation of this API. The rumor has it that the NV30 can offer more.
The question is whether these NV30 capabilities will be included
into the DX (for example, as a DX9.1, shaders 2.1 etc.) or will be available
only as OpenGL extensions.
-
We are about to enjoy a tough competition between two products close in
characteristics, aimed at the same market niche and probably going to be
released at the same time.
Lecture 4 - memory controller
In the new product ATI uses a familiar (from NVIDIA products) approach
for memory control, which includes a 4-channel memory controller and an
internal switch on the chip:
Well, earlier ATI preferred two- or one-channel controllers and large data
blocks, while NVIDIA's caching and operation with memory is based on smaller
blocks yet since the NV20. Both approaches have advantages and disadvantages,
for example, the NVIDIA's one warms up memory stronger and is more critical
to its parameters and quality. As a result, an overclocking potential is
lower. The ATI's approach copes with memory better but is less efficient
in complex tasks which use a lot of streams to access memory. As accelerators
become more flexible, the number of streams which can be simultaneously
read from the memory increases - there are several data flows for vertex
shaders, 4 or 6 textures in a single pass. That is why the NVIDIA's approach
is more effective in modern applications, and since the release of the
R300 ATI also uses it :-).
The memory optimization technology got one more Roman one in its name
- now it is called HyperZ III. The idea is the same - new techniques are
lacking but the old ones are improved. The technology provides quick compression
and cleanup of the Z-buffer using 8x8 blocks, and 3 levels of a hierarchical
presentation of the Z buffer for early determination of visibility of whole
blocks of polygons.
So, we have a shaded polygon (1) located close to an observer. And we want
to shade polygon 2 located further and, therefore, partially overlapped.
First of all we search at the highest level of the hierarchical Z buffer
which stores distances to the largest 4x4 blocks, then we mark the unit
which entirely belongs to the above triangle (3) and doesn't need to be
shaded. Thus, we get rid of 16 pixels. Then we go to a lower level and
cast aside 8 2x2 blocks. At the last level of the 1-pixel precision we
find several pixels more which mustn't be shaded. Although this illustration
is simplified, it is enough to get an idea of the principle of operation
of the Z buffer and of a computation benefit.
Like all modern accelerators, the R300 sports an Early Z Test. Its idea
is simple - real color values (hence texture values and test results as
well) are calculated for visible pixels. Obviously, with more complicated
shaders and methods of texturing this technology will save more on a memory
bandwidth and computational clocks of the accelerator. On a typical scene,
with an overdraw factor of 2, it will throw off about a quarter or a third
of pixels, at best - 50% in case of an ordered rendering of a scene.
It is interesting how NVIDIA is going to name the similar technologies
of its new chip - LMA III or not like ATI - LMA 3? However that may be, but
clear that NVIDIA won't take the previous name LMA II :-).
Lecture 5 - pixel pipelines and texture units
With the DX9 the requirements to complexity of pixel pipelines of the chip
will rise. The main catalyst of these requirements is the 2.0 version of
pixel shaders:
| Version |
1.1 |
1.4 |
2.0 |
| Textures per pass, max. |
4 |
6 |
16 |
| Texture sampling instructions, max. |
4 |
6*2 |
32 |
| Computational instructions, max. |
8 |
8*2 |
64 |
| Data formats |
I8[4] |
I12[4] |
F32[4] |
| Instruction flow management |
No |
No |
No |
| Output of several values |
No |
No |
Up to 4 values |
| Z buffer access |
No |
record |
read and record |
| Constant registers |
8 |
8 |
16 |
| General registers |
2 |
2 |
8 |
When describing the Cine FXm - an API-independent analog of higher-level
effect files of the DirectX 9 compiled both for the latest versions of
the OpenGL and for the DX9, NVIDIA mentions pixel shaders of 1024 instructions
(!) processed continuously in one pass. The pixel shader can enable up
to 512 constants each considered as one instruction. It seems that in this
respect the NV30 is far ahead of the DX9 requirements.
Earlier, pixel shaders were used with stages - the number of texture
stages was equal to the maximum number of textures used, the number of
computational stages was equal to the maximum number of instructions. Each
computation stage has a normal ALU and could implement any shader instruction.
Stages were adjusted for their instructions and then combined in a chain.
As a result, data (values of two general registers) when processed passed
all stages, and each carried out an instruction over them. It took a clock
to fulfill an operation, hence a pipeline of 8 stages which processed up
to 8 different pixels at different stages. The pipeline got the following
results at a clock:
| |
1 clock |
2 clock |
3 clock |
4 clock |
| 1 stage (ADD) |
1 pixel |
2 pixel |
3 pixel |
4 pixel |
| 2 stage (MUL) |
- |
1 pixel |
2 pixel |
3 pixel |
| 3 stage (MUL) |
- |
- |
1 pixel |
2 pixel |
| Result |
- |
- |
- |
1 pixel |
But the chip makers couldn't actually afford even 8 stages per pipeline
- 32 normal ALUs, of even an integer-valued format, would occupy too much
space on the chip. Usually each pixel pipeline was given 2 or 4 stages
(the Matrox Parhelia 512 had 5), and in case of a longer shader stages
of 2 or 4 pipelines were combined in a chain. The number of shaded pixels
fell down 2-4 times in that case.
As the shaders are getting more complex, such approach ceases to be
advantageous. It is necessary to provide at least 64 single-clock ALUs
(for the stage approach), which is unrealizable, especially in case of
floating precision of data representation. Besides, the number of temporary
registers values of which are to be stored in each ALU and transferred
from stage to stage at each clock is increasing. And what should we do
when shaders become lengthier?
Let's see what we have on the R300. There are 8 pixel pipelines each
equipped with its own processor for pixel shaders. This is not a set of
switched stages with ALU but exactly a processor (RISC) which implements
an instruction at a clock. Lack of instruction flow management simplifies
the matters. The longer the shader, the higher the expected result. On
the other hand, complexity of tasks to be fulfilled at a clock is not so
crucial anymore: now we can build almost any scene in one or two passes,
and this is much more beneficial than several passes of speedier but simpler
shaders. The restriction in the number of instructions in the new approach
is very conditional - nothing prevents the processor from fulfilling 256
or 1024 instructions in turn - the only thing required is memory on the
chip. It's interesting that to provide compatibility with the first versions
of the shaders the pixel pipeline of the R300 and the NV30 supports calculations
not only in floating formats F32 and F16 but also in the integer format
I12. Without such support processing of old shaders could bring some unpleasant
problems - emulation of some instructions might require up to 4 operations!
Editor's note: Almost a portrait of the author of this article.
Moreover, to accelerate calculations we can try a superscalar approach,
let it be the simplest version like in the first superscalar RISC processors.
Each ALU has several functional units - addition and subtraction unit,
multiplication unit, division unit, a separate device managing data transfer
between registers. It's not a great problem to create a processor which
can simultaneously process instructions which relate to different units
provided that they are not dependent, i.e. when a following instruction
can be processed irrespective of a result of a previous one. That is why
accelerator developers and Microsoft recommend taking into account dependences
between neighbouring instructions and getting rid of them, if possible.
On the other hand, a more advanced, speculative execution with rearrangement
and rollback of instructions and register renaming of results for shader
processors makes no sense now - it is too expensive taking into account
an unjustified increased of complexity of each shader processor. As usual,
in graphics it's more advantageous to make parallel fulfillment of shaders
on the object level (level of vertices and pixels) by increasing the number
of parallel processors dealing with blocks than to make parallel operation
at the instruction level: the algorithms are not great and neighboring
instructions are too tightly bound. That is why the number of pixel and
vertex pipelines is twice greater as compared with the R200.
In the near future pixel processors will become entire doubles (as
to capabilities) of vertex ones because of the same data format and the
same arithmetic instructions; the only thing lacking is an instruction
order management, but this problem can be solved. The distinction between
pixel and vertex processors will be vanishing. In several architecture
generations a graphics accelerator will turn into a set of identical general-purpose
vector processors which will have flexible configurable queues for asynchronous
transfer of parameters between them. Processors' efforts will be distributed
on the fly depending on an approach used for making an image of a balance
of a required performance on certain tasks:
- some will be in charge of animation and tessellation (geometry generation),
- some will control geometrical transformations,
- some will manage shading and lighting,
- some will deal with texture sampling (they will be intelligent texture
units able to program arbitrary filtering methods or calculate procedure
textures).
The R300 is based on the 8x1 configuration - each pixel pipeline has only
one texture unit connected:
One of eight pixel pipelines of the R300
It seems that this is a forced economy caused by the .15-micron fab
process. We can come up with a lot of real situations in a pixel shader
when expectation of results of one texture unit significantly slows down
processing of the shader! And it's possible to avoid such standstill with
a second texture unit, thus, lifting the speed of pixel shader processing
1.5 or 2 times. Well, let's leave it for ATI and be happy that in spite
of just one texture unit it's possible to enable trilinear filtering using
this unit without speed losses. As well as combine trilinear and anisotropic
filtering types (which was a well-known downside of the R200).
For such long shaders it's rational to use a bit different approach
of organization of texture units. Let's consider that units are not bound
to a certain pixel pipeline but service any of them as requests for texture
sampling are received. We thus could run shaders on different pipelines
with some time shift of several instructions to make up for irregular interleaving
of calculations and access to textures. First all units would service those
pipelines which are waiting for textures and other pipelines would fulfill
calculations. Then the situation would be vice versa. In this case the
downtime would be much lower and 8 shared units would be enough for 8 pipelines.
It's possible that ATI follows this approach but doesn't want to reveal
the details. And it's possible that NVIDIA will take this approach for
one of its future chips - because this idea was once discussed by engineers
from 3dfx absorbed by NVIDIA.
Lecture 6 - vertex pipelines and higher-level languages
Vertex shaders haven't changed much like pixel ones, but at the same time
they are improved by a great margin - they are now able to control an instruction
flow. Now we have subroutines, loops, conditional and unconditional jumps.
| Versions |
1.0 |
2.0 |
| Instructions, max. |
128 |
256 |
| Instruction flow management |
No |
Yes |
| Data format |
F32[4] |
F32[4] |
| Constant registers |
96 |
256 |
| General registers |
8 |
16 |
At present all decisions to change an instruction flow are based on
constants coming to the shader; this make problems in making decisions
on-the-fly separately for each vertex. It's not clear why Microsoft have
decided on it - the ATI R300 (and NVIDIA NV30) are likely not to unroll
loops and subroutines into a continuous row of instructions but allow an
indicator of the next instruction to move around the memory of instructions
inside the chip. Well, in the next DX generation this limitation will be
eliminated, and we will be able to call vertex pipelines of any accelerator
vertex processors. Contrary to the R300, the NV30 is already able to control
an order of instructions according to data from temporary registers - like
any usual processor. On the other hand, the R300 allows fulfilling shaders
of up to 1024 instructions, the NV30 only up to 256 (and up to 65536 instructions
in case of unrolling of loops and subroutines).
Everything that was said in the previous part about the superscalar
implementation can be also referred, probably to the greater degree, to
vertex shaders. Quite lenthgy shaders make us think about optimization
for successful combined execution of instructions.
When the hardware and API developers got a possibility to execute shaders
of thousands of instructions they turned to higher-level languages. It's
much more pleasant to deal with some Ñ dialect than with an assembler code
which isn't used for already 8-10 years. At last the hardware corresponds
to the required level, and now instead of thousands of constants and instructions
we have hundreds and instead of hundreds we have tens. Soon complexity
of programs for an accelerator can become equal to that of programs for
ordinary processors, at least, for the part that manages 3D graphics.
For example, NVIDIA announced its Ñ Graphics (CG) dialect which first
wasn't user-friendly at all, despite all disadvantages it is a cross API
tool - a shader code could be compiled both in the OpenGL and in the Direct3D
environments. The compiler comes with a rich set of effects and samples.
There is a new CG version - for DX9 - which is more handy regarding data
binding and utilization and it can be called a de facto standard.
Microsoft in not in a hurry either - it is debugging its HLSL which
is actually the same CG (or it can be vice versa because the development
works were carried out by NVIDIA and Microsoft together) but working only
within the DirectX. Besides, at present the HLSL works only with vertex
shaders.
ATI doesn't stand idle either and announces its Render Monkey. This
dialect is different. The NVIDIA's CG and Cine FX (an analog of techniques
and effects from DX9, as well as the CG cross API!) are the most convenient
ones, at least, due to export plugins for popular packets of 3D modeling
and realistic graphics.
Rendered Monkey :-)
Lecture 7 - anti-aliasing, video capabilities
There is no a breakthrough in the anti-aliasing technique, we have the
same SMOOTHVISION 2x, 4x and 6x, although it is named SMOOTHVISION 2.0.
However, despite the same approach to forming pseudorandom templates now
we have the multisampling method (MSAA), which must improve performance
of the method as compared with the SSAA SMOOTHVISION in the R200. However,
the first one was also good. The speed of the MSAA version has reportedly
become greater - maybe because of the wider bus or the optimized algorithm.
In the practical part of the review on the R300 we will carefully examine
performance drop issues when FSAA and anisotropic filtering are enabled.
It should be noted also that on transparent textures (with an alpha channel)
the chip switches to the SSAA mode and select all samples for each pixel
of triangle (not only for its edges).
It is interesting what NVIDIA is going to offer in its new chip whose
various hybrids based on the MSAA look outdated as compared with the SMOOTHVISION.
One more significant aspect of the R300 is a VideoShader technology.
It uses computational capabilities of pixel pipelines for some tasks of
encoding/decoding of MPEG1/2 video streams, conversion of color spaces,
deinterlacing and some other video processing tasks. The following diagram
shows which tasks fall on the shoulders of the pixel shaders and which
are still fulfilled by hardware units:
In the near future flexibility and performance of shader processors
will let them solve quite complicated 2D video tasks (or, rather, parts
of such tasks which are most intensive in calculations) up to MPEG4 decoding.
It might also be possible to lay on them sound compression and voice recognition!
Why not to use the huge power for turning an accelerator into a general-purpose
coprocessor?
Conclusion
Well, it's to early to consider ATI and its R300 winners - I'd rather say
the company offered the best combination of the price and capabilities
with the junior chip of the 9000 line - RV250. It's also unfair to consider
the R300 a loser because it is a competitive solution. So, let's wait for
the cards and for the DX9.
According to the information that is available now and ignoring yet
unknown prices I'd put the competitors into the following order: NV30,
R300. Well, friendship loses again.