RCC Turbo-Charges Simulation
High-density 10 million gate designs and reconfigurability are posing severe restrictions on conventional simulation. But RCC-super-charged simulation hands designers a faster, more efficient methodology.

Table of Contents


Overview

System-on-Chip (SoC) and reconfigurable designs are introducing major new exciting advantages for networking, telecommunication and multi-media industries. But at the same time, this emerging technology is posing design issues and challenges, most notably in the simulation and verification arenas. Chip densities and complexities of SoC designs pose part of the problem; the other is the mere fact that system designers are now dealing with the reconfigurability of chips that incurs an inordinate amount of testing.

In these cases, general-purpose CPU-based and conventional fixed instruction set processor-based simulation engines prove ineffective. General-purpose CPU-based simulation is limited in performance, while processor-based simulation engines have their own set of shortcomings. The most glaring is a fixed simulation algorithm for a fixed design style as well as low performance. Moreover, this simulation system is considerably more expensive due to the high cost of the special purpose processor driving it.

Ideally, for simulating SoC designs(See Box: RCC Simulates SoC Designs), a special engine is critical for turbo-charging simulation speed. And in this instance, the idea is to deploy a considerably great number of processing elements to perform parallel simulation. Rather than simulating a gate at a time as in conventional simulation, this special engine performs parallel simulation in one clock cycle. For example, such a system can simulate millions of elements at a time, and that's how the high speed is achieved.

ReConfigurable Computing (RCC) engines for simulating this new wave of SoC designs use FPGAs as basic building components, which present virtually unlimited configurations. Also, the simulation algorithm is reconfigurable so that the system designer or developer can execute different algorithms for different design styles and use modes.

The same RCC engine can be configured for simulating one design style, and then it can be configured again a different way to simulate yet another design style to achieve even better performance. Moreover, the RCC engine can be configured to perform emulation, as well. Some processor-based engines cannot do emulation because they have fixed algorithm for simulation only.

RCC-based simulation hands designers a considerably more cost-effective and faster approach than does conventional simulation. The reason is an FPGA is a standard component, widely used in many applications, thus providing system OEMs with a device that is well-proven and cost-effective based on semiconductor economies of scale.

The RCC compilation time is short with the complete process taking less than one hour for designs up to ten million gates. For design changes, incremental compilation is available which will reduce compilation time to less than 30 minutes.

An RCC architecture for functional verification achieves its high speed by having a co-processor containing a massively parallel structure of processing elements (PEs) specially configured for each design. A processing element is a small compact processor dedicated to perform one function. An example is Axis Systems' Xcite® product, which utilizes custom processing elements to simulate Verilog RTL "case" and "if" statements.

Figure 1

RCC

When executing, the RCC engine acts as co-processors to obtain instructions and data from the host microprocessor (µP) and sends the execution command to a single-instruction, multiple-data (SIMD) controller, which sequences the evaluation and communication of all RCC processing elements. The sequencer's next step is to collect all evaluation results from the processing elements, pack them in a data-stream format and send the resulting data back to the µP to continue simulation.

By compilingthe design's RTL constructs onto its custom interconnected processing elements, the RCC engine is programmed for maximum performance execution for each design being verified. Using a proprietary systolic array interconnection architecture, communication between processing elements and between multiple devices is fast and efficient.


Different FPGA Technologies for Design Verification

Currently, there are two main applications of FPGA technology in design verification: direct or re-timing prototype , and processor composition (RCC engine). In the first category, a design is mapped wire for wire, gate for gate into the FPGAs. This approach lacks built-in intelligence or a methodology for composing or changing a design, and no other processor is introduced on top of the design for the purpose of executing that design.

The prototyping-based approach is to first synthesize an RTL or gate-level design into the netlist. Then, the netlist is partitioned into multiple FPGAs, followed by timing management. Performing timing management is crucial to achieve design success once the design is mapped into multiple FPGAs. However, prototyping-based approaches to simulation pose several issues associated with timing management, processing speed, and inter-FPGA communication.

As far as timing management issues, race conditions and glitches arise in prototyping -based approaches when the design is partitioned into multiple FPGAs. In these instances, clock swings reach different chips at different times. Hence, when a set of data arrives earlier than the clock, a different value is consequently latched.

Different technologies are to resolve this issue. In direct prototyping, the delay is calculated. Once a single wire crosses a chip boundary, extra delay is introduced. Therefore, it is important to compensate for those delays. Here, a path delay adjustment is implemented to avoid race conditions and glitches. In effect, delay logic is added.

Global re-timing to a virtual clock is used in the re-timing prototype scheme to avoid race conditions and glitches. A reference clock is used here to re-time and schedule communication within the entire system. Thus, all latches and flip flops are enabled based on the reference clock so that circuit components are properly scheduled for operation.

Processing speed issues are similar for prototyping based technologies. In direct prototype, the whole system must be slowed down to meet the longest delay path in the prototype. The same is true for re-timing prototype. The whole system is slowed down to the longest scheduled virtual clock cycles after re-timing.

Lastly, there are issues associated with inter-FPGA communication. Direct prototype utilizes direct wiring, which requires a global crossbar network to provide the needed wire. Here, chip pins are configured to specific locations, however, those configurations change for different designs. Therefore, the system needs dynamic interconnects between chips. A crossbar dynamically changes configuration between chips. The re-timing technology, on the other hand, uses a virtual wire or clock to schedule a signal to statically transfer across the chip boundary.

In total, these timing management, processing speed, and inter-FPGA communication issues present a real problem to the user of prototyping-based engines. Basically, they are very difficult to use. There are three major drawbacks. First, they cannot swap states with a software simulator. Secondly, they cannot compare results with a software simulator. And thirdly, they don't have much design debug support. The only way to debug such a design is to use a logic analyzer on a particular signal, which proves extremely difficult for the system designer.


Processor Composition

Processor composition is an entirely different approach to utilize reconfigurability of FPGAs, and this technology is used to compile a design into the ReConfigurable Computing (RCC) engine. This compilation technology generates code and composes processors at the same time. It is different than the compiler for processor-based engines, which generate code for a fixed-processor architecture. Rather than mapping wire to wire, this approach directly compiles the RTL or gate primitives into processing elements (PEs), which are thencomposed into design specific processors and a global SIMD sequencer.

The composed processors have custom instructions based on a particular design style. Examples include read/write next register word (an I/O function is needed for swapping with the software simulator), execute clock, update registers, execute logic, and read/write memory elements. The global SIMD sequencer provides the sequencing and synchronization to execute instructions.

Most processor instructions are synchronous and take a single clock cycle. Other instructions are asynchronous with event sensing condition feedback to the SIMD sequencer. For instance, the execute logic instruction is asynchronous with event sensing. The SIMD sequencer knows the instruction finishes only when no further event is detected from all the processors.

With this approach, timing management isn't required since it is processor-based controlled. Here, the design is compiled to a processor, and the processor executes the model. And since it is compiled to processor-based control, it is timing insensitive and glitch-free by construction.

Also, a new technology called dynamic event sensing is used to self-adjust the processing time per input, as opposed to fixed processing time for all possible inputs used in the prototyping-based schemes. Dynamic event sensing permits processing time to be adjusted to the best possible time for a given input, thus optimizing the processing. By adjusting the system to execute at the minimum delay time based on the input stimulus, the system can run 10 to 30 times faster compared to the fixed worst delay time. The reason is more than 95 percent input will exercise less than 10 percent of the logic in most ASIC designs. Lastly, event-driven communication is used for inter-FPGA communication. This means that only the signals that change values are sent to other chips. If no signals change, there is no communication overhead incurred.

Best of all, this processor composition approach allows the designer to swap the simulation state at the RTL level. This allows designers to change simulation to run fast or slow under their control. Also, processor composition permits complete state compare, time step by time step, without creating value change dump (VCD). This approach allows the designer to run the software simulator and then compare the design with the hardware engine at the same time. This is important for designers who start out their chip designs using a software simulator at the beginning due to convenience. But once the design becomes bigger and more complex, the designer migrates it from the software simulation environment to a hardware acceleration engine.

This migration process is difficult in many cases because the hardware may produce different results than those produced by software simulation. Processor composition is valuable in this regard because once the design is mapped, the designer can keep the software simulator running and compare the software and hardware simulation results, time step by time step, to determine if there is a mismatch. If there is, processor composition can tell which signal out of millions in the design is mismatched. Also, the design can be checked for race and glitch problems, as well as other design bugs.

As for design debug support, processor composition completely eliminates the logic analyzer since the technology allows swapping between the software and hardware simulation. It also includes a special debugging technology called Value Change Dump (VCD) on-demand. VCD, a term used in hardware description language (HDL), refers to the entire waveform of a simulation.

In traditional simulation, the designer specifies where he or she wants to create a waveform or VCD. Here, the designer has to specify a small size area in a million or multi-million gate design. Otherwise, if the VCD or waveform selected is too intensive, the designer will use up the system's storage disk. The created waveform is then analyzed to determine the root cause of a design problem. In this case, disk space requirement for a million cycle simulation of a 2 million-gate design is about 100 gigabytes.

VCD on-demand technology, on the other hand, eliminates the need for creating a waveform or VCD file during simulation. More importantly, it permits the designer to have on demand extraction of waveform history information during simulation. For example, as shown in Fig. 2, the designer can recall necessary data after five hours of simulation to check on a particular portion of a design and create a value change dump on-demand. All node changes in a waveform format can be extracted within any time range and design hierarchy from time zero without restarting simulation. VCD on-demand requires only one megabyte of disk storage for a million cycle simulation of a 2 million gate design, which is a 100,000 times reduction in disk space.


Box: RCC Simulates SoC Level Designs and Reconfigurable Systems

The RCC technology is largely directed at system-on-a-chip (SOC) designs, however, it is highly applicable to reconfigurable systems. In the future, RCC-simulation will be an even more important tool as the industry increasingly moves into reconfigurable applications. Reconfigurable systems are gaining greater momentum in many design camps, particularly in networking applications, which rely heavily on high density four to 10 million ASIC chips. Typically, network system OEMs use these chips for Layer 1 (100 BaseT and SONET) and Layer 2 (Ethernet and ATM) functions of various port types. These are general purpose chips that comply with most system designs. However, aside from their configurability in some cases, they are highly complex and demand intensive system simulation. A design in these instances includes multiple RISC or reconfigurable computing processors, memory, buffers, lookup tables, and comparison logic.

Due to the size of these designs, some reaching upwards of 10 million gates, verification based on conventional simulation is incurring an inordinate amount of time. For example, a regression run for an OC-192 performance system design takes five days on multiple workstations.

Moreover, the programmability or reconfigurability nature of reconfigurable systems and chips further exacerbates these simulation issues. The crux of the issue is the need for running a virtually endless variety of tests because a design can have many and different configurations. In theory, the designer should test all possible configurations to verify a chip design, although, practically speaking, that is impossible. Still, the more tests performed on an reconfigurable design, the greater the confidence gained on the chip's eventual successful functionality.

Due to increasing complexities, densities, programmability, and increasing reconfigurability popularity in next-generation systems, RCC-based simulation is in great demand. The reason is it is inherently more efficient and faster and is critical for significantly reducing functional verification time from days to hours to help system OEMs to improve their time to market.

© Copyright 2005 Verisity Design, Inc. All rights reserved. Privacy Policy.