RCC
Turbo-Charges Simulation
High-density 10 million gate
designs and reconfigurability are posing severe restrictions on conventional
simulation. But RCC-super-charged simulation hands designers a faster,
more efficient methodology.
Table of Contents
Overview
System-on-Chip (SoC) and reconfigurable designs are introducing
major new exciting advantages for networking, telecommunication
and multi-media industries. But at the same time, this emerging
technology is posing design issues and challenges, most notably
in the simulation and verification arenas. Chip densities and complexities
of SoC designs pose part of the problem; the other is the mere fact
that system designers are now dealing with the reconfigurability
of chips that incurs an inordinate amount of testing.
In these cases, general-purpose CPU-based and
conventional fixed instruction set processor-based simulation engines
prove ineffective. General-purpose CPU-based simulation is limited
in performance, while processor-based simulation engines have their
own set of shortcomings. The most glaring is a fixed simulation
algorithm for a fixed design style as well as low performance. Moreover,
this simulation system is considerably more expensive due to the
high cost of the special purpose processor driving it.
Ideally, for simulating SoC designs(See Box:
RCC Simulates SoC Designs), a special engine is critical for turbo-charging
simulation speed. And in this instance, the idea is to deploy a
considerably great number of processing elements to perform parallel
simulation. Rather than simulating a gate at a time as in conventional
simulation, this special engine performs parallel simulation in
one clock cycle. For example, such a system can simulate millions
of elements at a time, and that's how the high speed is achieved.
ReConfigurable Computing (RCC) engines for
simulating this new wave of SoC designs use FPGAs as basic building
components, which present virtually unlimited configurations. Also,
the simulation algorithm is reconfigurable so that the system designer
or developer can execute different algorithms for different design
styles and use modes.
The same RCC engine can be configured for simulating
one design style, and then it can be configured again a different
way to simulate yet another design style to achieve even better
performance. Moreover, the RCC engine can be configured to perform
emulation, as well. Some processor-based engines cannot do emulation
because they have fixed algorithm for simulation only.
RCC-based simulation hands designers a considerably
more cost-effective and faster approach than does conventional simulation.
The reason is an FPGA is a standard component, widely used in many
applications, thus providing system OEMs with a device that is well-proven
and cost-effective based on semiconductor economies of scale.
The RCC compilation time is short with the
complete process taking less than one hour for designs up to ten
million gates. For design changes, incremental compilation is available
which will reduce compilation time to less than 30 minutes.
An RCC architecture for functional verification
achieves its high speed by having a co-processor containing a massively
parallel structure of processing elements (PEs) specially configured
for each design. A processing element is a small compact processor
dedicated to perform one function. An example is Axis Systems' Xcite®
product, which utilizes custom processing elements to simulate Verilog
RTL "case" and "if" statements.
Figure 1
When executing, the RCC engine acts as co-processors
to obtain instructions and data from the host microprocessor (µP)
and sends the execution command to a single-instruction, multiple-data
(SIMD) controller, which sequences the evaluation and communication
of all RCC processing elements. The sequencer's next step is to
collect all evaluation results from the processing elements, pack
them in a data-stream format and send the resulting data back to
the µP to continue simulation.
By compilingthe design's RTL constructs onto
its custom interconnected processing elements, the RCC engine is
programmed for maximum performance execution for each design being
verified. Using a proprietary systolic array interconnection architecture,
communication between processing elements and between multiple devices
is fast and efficient.
Different FPGA Technologies
for Design Verification
Currently, there are two main applications of FPGA technology in
design verification: direct or re-timing prototype , and processor
composition (RCC engine). In the first category, a design is mapped
wire for wire, gate for gate into the FPGAs. This approach lacks
built-in intelligence or a methodology for composing or changing
a design, and no other processor is introduced on top of the design
for the purpose of executing that design.
The prototyping-based approach is to first
synthesize an RTL or gate-level design into the netlist. Then, the
netlist is partitioned into multiple FPGAs, followed by timing management.
Performing timing management is crucial to achieve design success
once the design is mapped into multiple FPGAs. However, prototyping-based
approaches to simulation pose several issues associated with timing
management, processing speed, and inter-FPGA communication.
As far as timing management issues, race conditions
and glitches arise in prototyping -based approaches when the design
is partitioned into multiple FPGAs. In these instances, clock swings
reach different chips at different times. Hence, when a set of data
arrives earlier than the clock, a different value is consequently
latched.
Different technologies are to resolve this
issue. In direct prototyping, the delay is calculated. Once a single
wire crosses a chip boundary, extra delay is introduced. Therefore,
it is important to compensate for those delays. Here, a path delay
adjustment is implemented to avoid race conditions and glitches.
In effect, delay logic is added.
Global re-timing to a virtual clock is used
in the re-timing prototype scheme to avoid race conditions and glitches.
A reference clock is used here to re-time and schedule communication
within the entire system. Thus, all latches and flip flops are enabled
based on the reference clock so that circuit components are properly
scheduled for operation.
Processing speed issues are similar for prototyping
based technologies. In direct prototype, the whole system must be
slowed down to meet the longest delay path in the prototype. The
same is true for re-timing prototype. The whole system is slowed
down to the longest scheduled virtual clock cycles after re-timing.
Lastly, there are issues associated with inter-FPGA
communication. Direct prototype utilizes direct wiring, which requires
a global crossbar network to provide the needed wire. Here, chip
pins are configured to specific locations, however, those configurations
change for different designs. Therefore, the system needs dynamic
interconnects between chips. A crossbar dynamically changes configuration
between chips. The re-timing technology, on the other hand, uses
a virtual wire or clock to schedule a signal to statically transfer
across the chip boundary.
In total, these timing management, processing
speed, and inter-FPGA communication issues present a real problem
to the user of prototyping-based engines. Basically, they are very
difficult to use. There are three major drawbacks. First, they cannot
swap states with a software simulator. Secondly, they cannot compare
results with a software simulator. And thirdly, they don't have
much design debug support. The only way to debug such a design is
to use a logic analyzer on a particular signal, which proves extremely
difficult for the system designer.
Processor Composition
Processor composition is an entirely different approach to utilize
reconfigurability of FPGAs, and this technology is used to compile
a design into the ReConfigurable Computing (RCC) engine. This compilation
technology generates code and composes processors at the same time.
It is different than the compiler for processor-based engines, which
generate code for a fixed-processor architecture. Rather than mapping
wire to wire, this approach directly compiles the RTL or gate primitives
into processing elements (PEs), which are thencomposed into design
specific processors and a global SIMD sequencer.
The composed processors have custom instructions
based on a particular design style. Examples include read/write
next register word (an I/O function is needed for swapping with
the software simulator), execute clock, update registers, execute
logic, and read/write memory elements. The global SIMD sequencer
provides the sequencing and synchronization to execute instructions.
Most processor instructions are synchronous
and take a single clock cycle. Other instructions are asynchronous
with event sensing condition feedback to the SIMD sequencer. For
instance, the execute logic instruction is asynchronous with event
sensing. The SIMD sequencer knows the instruction finishes only
when no further event is detected from all the processors.
With this approach, timing management isn't
required since it is processor-based controlled. Here, the design
is compiled to a processor, and the processor executes the model.
And since it is compiled to processor-based control, it is timing
insensitive and glitch-free by construction.
Also, a new technology called dynamic event
sensing is used to self-adjust the processing time per input, as
opposed to fixed processing time for all possible inputs used in
the prototyping-based schemes. Dynamic event sensing permits processing
time to be adjusted to the best possible time for a given input,
thus optimizing the processing. By adjusting the system to execute
at the minimum delay time based on the input stimulus, the system
can run 10 to 30 times faster compared to the fixed worst delay
time. The reason is more than 95 percent input will exercise less
than 10 percent of the logic in most ASIC designs. Lastly, event-driven
communication is used for inter-FPGA communication. This means that
only the signals that change values are sent to other chips. If
no signals change, there is no communication overhead incurred.
Best of all, this processor composition approach
allows the designer to swap the simulation state at the RTL level.
This allows designers to change simulation to run fast or slow under
their control. Also, processor composition permits complete state
compare, time step by time step, without creating value change dump
(VCD). This approach allows the designer to run the software simulator
and then compare the design with the hardware engine at the same
time. This is important for designers who start out their chip designs
using a software simulator at the beginning due to convenience.
But once the design becomes bigger and more complex, the designer
migrates it from the software simulation environment to a hardware
acceleration engine.
This migration process is difficult in many
cases because the hardware may produce different results than those
produced by software simulation. Processor composition is valuable
in this regard because once the design is mapped, the designer can
keep the software simulator running and compare the software and
hardware simulation results, time step by time step, to determine
if there is a mismatch. If there is, processor composition can tell
which signal out of millions in the design is mismatched. Also,
the design can be checked for race and glitch problems, as well
as other design bugs.
As for design debug support, processor composition
completely eliminates the logic analyzer since the technology allows
swapping between the software and hardware simulation. It also includes
a special debugging technology called Value Change Dump (VCD) on-demand.
VCD, a term used in hardware description language (HDL), refers
to the entire waveform of a simulation.
In traditional simulation, the designer specifies
where he or she wants to create a waveform or VCD. Here, the designer
has to specify a small size area in a million or multi-million gate
design. Otherwise, if the VCD or waveform selected is too intensive,
the designer will use up the system's storage disk. The created
waveform is then analyzed to determine the root cause of a design
problem. In this case, disk space requirement for a million cycle
simulation of a 2 million-gate design is about 100 gigabytes.
VCD on-demand technology, on the other hand,
eliminates the need for creating a waveform or VCD file during simulation.
More importantly, it permits the designer to have on demand extraction
of waveform history information during simulation. For example,
as shown in Fig. 2, the designer can recall necessary data after
five hours of simulation to check on a particular portion of a design
and create a value change dump on-demand. All node changes in a
waveform format can be extracted within any time range and design
hierarchy from time zero without restarting simulation. VCD on-demand
requires only one megabyte of disk storage for a million cycle simulation
of a 2 million gate design, which is a 100,000 times reduction in
disk space.
Box: RCC Simulates SoC Level
Designs and Reconfigurable Systems
The RCC technology is largely directed at system-on-a-chip (SOC)
designs, however, it is highly applicable to reconfigurable systems.
In the future, RCC-simulation will be an even more important tool
as the industry increasingly moves into reconfigurable applications.
Reconfigurable systems are gaining greater momentum in many design
camps, particularly in networking applications, which rely heavily
on high density four to 10 million ASIC chips. Typically, network
system OEMs use these chips for Layer 1 (100 BaseT and SONET) and
Layer 2 (Ethernet and ATM) functions of various port types. These
are general purpose chips that comply with most system designs.
However, aside from their configurability in some cases, they are
highly complex and demand intensive system simulation. A design
in these instances includes multiple RISC or reconfigurable computing
processors, memory, buffers, lookup tables, and comparison logic.
Due to the size of these designs, some
reaching upwards of 10 million gates, verification based on conventional
simulation is incurring an inordinate amount of time. For example,
a regression run for an OC-192 performance system design takes five
days on multiple workstations.
Moreover, the programmability or reconfigurability
nature of reconfigurable systems and chips further exacerbates these
simulation issues. The crux of the issue is the need for running
a virtually endless variety of tests because a design can have many
and different configurations. In theory, the designer should test
all possible configurations to verify a chip design, although, practically
speaking, that is impossible. Still, the more tests performed on
an reconfigurable design, the greater the confidence gained on the
chip's eventual successful functionality.
Due to increasing complexities, densities,
programmability, and increasing reconfigurability popularity in
next-generation systems, RCC-based simulation is in great demand.
The reason is it is inherently more efficient and faster and is
critical for significantly reducing functional verification time
from days to hours to help system OEMs to improve their time to
market.
|