Skip to main content

iFlow

iEDAAbout 10 min

1. Introduction to the Backend Design Flow

1.1 Chip Design Flow (From Function Definition to Chip Return)

The basic chip design process is shown in the following figure:

Digital IC backend design process: (From the post-synthesis netlist to GDS)

1.2 Logic Synthesis (Synthesis)

1.2.1 Logic Foundation

Map the RTL code from the front-end to a specific technology library, add constraint information, and perform logical optimization on the RTL code to generate a gate-level netlist.

  • Synthesis tool: yosys

  • Synthesis process: Translation + Optimization + Mapping

    • Translation: yosys uses its internal IP library to perform structural and logical optimizations on the RTL code and generate a netlist in GTECH format. (Independent of the technology library)
    • Optimization: Perform structural optimization of the cells based on constraint information (timing, area, power consumption constraints).
    • Mapping: Map the cells into the corresponding gate-level circuits in the technology library.

Basic strategies of synthesis:

  • Top-down: Take the top-level module as the current design module and complete the synthesis of the entire design at once.
    • Advantage: Good optimization effect for medium-scale designs, and no additional processing is required for module boundaries.
    • Disadvantage: Slow synthesis speed for very large-scale designs and may even fail to converge.
  • Bottom-up: Synthesize the bottom-level modules first, and then the top-level module calls the sub-modules generated by the synthesis to complete the entire synthesis process.
    • Advantage: Reduces the memory requirement and is suitable for very large-scale designs.
    • Disadvantage: Additional processing is required for module boundaries.

1.2.2 Input Files

  • RTL code (including Verilog, VHDL).
  • Library files (.lib files), containing information of all standard cells and macro cells:
    • Cell information: Function, area, power consumption, etc.
    • Wire load model: Resistance, capacitance.
    • Working environment: Technology, voltage, temperature.
    • Constraint rules involved: Maximum and minimum capacitance, maximum and minimum transition time, maximum and minimum fanout.
  • Constraint files (.sdc files), containing all timing constraints in the design: PVT (select worst case), Input drives (driving capacity), Transition times (transition time), Capacitive output loads (driving capacitive load), internal parasitic RC (wire load model):
    • Environmental conditions PVT (process, voltage, temperature): The influence of the surrounding environment such as technology, voltage, and temperature on device delay. (fast, typical, slow) The higher the temperature, the slower the speed; the higher the voltage, the faster the speed.
* Wire load model: To accurately calculate the path delay, in addition to the gate unit delay, there is also the wire delay. * Driving strength: With the addition of a driving unit, the input will have a slope, that is, telling DC that this input port is driven by a real external unit, not ideal, and DC knows the transition time reaching the input port, and can accurately calculate the delay of the input circuit. * Capacitive load: Specify the external capacitive load on the port, and can accurately calculate the delay of the output circuit. * Maximum transition time (Max transition): The time for the signal to change from 0->1 or 1->0. * Maximum fanout (Max fanout): The maximum number directly driven by a single logic gate. In Figure 6, the fanout of BUF1 is 3. * Maximum capacitance (Max capacitance): The maximum load value that the output can drive. In Figure 7, the capacitance value of BUF2 is 0.05 + 0.03 + 0.02 + 0.07 = 0.17. * Timing constraints: The timing path is the data path from point to point, and the data is transferred along the timing path.
> Path1: Input port to register

Assume the external input circuit delay is 4ns, and the clock cycle $T_{clk}$ is 10ns, then the maximum delay from the input end to the register (internal logic) is 10 - 4 - $T_{setup}$ (ns), where $T_{setup}$ is the setup time.

> Path2: Register to register

The delay should satisfy $T_{comp}$ < $T_{clk}$ - $T_{ck2q}$ - $T_{setup}$, where $T_{comp}$ is the combinational logic delay, and $T_{ck2q}$ is the delay from the CK end to the Q end of the register.

> Path3: Register to output port

Assume the external output path delay is 4ns, the clock cycle $T_{clk}$ is 10ns, and the maximum internal logic delay is 10 - 4 - $T_{ck2q}$.

> Path4: Input port to output port

Combinational logic delay: $T_{clk}$ - $T_{input\_delay}$ - $T_{output\_delay}$

1.3 Formal Verification (Formal Verification)

1.3.1 Formal Verification

Compare two designs through logical abstraction to ensure consistent functionality (only compare logic, not check timing). As shown in Figure 9, formal verification includes RTL vs netlist (post-synthesis), netlist (post-synthesis) vs netlist (post-PR). Since the yosys synthesis lacks the svf file required by the commercial tool formality, the comparison of RTL vs netlist (post-synthesis) cannot be performed.

1.3.2 Comparison Principle

  • Divide the design into multiple combinations of Logic Cone and Compare Point. (Logic Cone: A conical logic where a group of inputs finally converge to a comparison point, which can be the output of a register, the input of a port, or the output of a black box. Compare Point: Comparison point, including the input of a register, the output of a port, and the input of a black box)
* Comparison of logical interfaces, checking whether the Compare points of the reference design and the implementation design match (this process is called match). * Comparison of logical functions, apply excitation to the Logic cones and check whether the output results of the Compare points are consistent (this process is called verify).

1.3.3 Reasons Leading to Unmatch

Table 1 Reasons and Solutions for Unmatch in Formal Verification

PerformancePossible ReasonsSolutions
The number of unmatched points in ref and imp is differentThe design has been renamed- Manually set user match - Turn on the signature analysis option
The number of unmatched in ref is more than that in impRedundant registers have been logically optimized during synthesisNo special processing is required
Some missing cells have generated Black boxesRead in the missing cells
The number of unmatched in ref is less than that in impExtra logic has been generated during synthesisCheck the logical mapping

1.4 Placement and Routing

1.4.1 Placement and Routing

Placement and routing is the process of converting the circuit netlist into a physical layout. The design process is shown in Figure 13:

1.4.2 Init

Input data:

  • Gate-level netlist after synthesis or DFT.
  • Physical library: techlef and cell lef.
  • Timing library:.lib, and.db is also used in commercial tools.

1.4.3 Floorplan

  1. Area Planning
  • Die area: The area occupied by the entire layout.
  • Core area: The area available for placing cells.
  • Standard cell utilization = Total area of standard cells / (Core area - Area of macro cells), the initial empirical value is between 70% - 80%. Due to the immaturity of the placement and routing functions of open-source EDA tools, too high utilization may affect routing and can be solved by reducing the utilization rate.
  1. Planning of Macro Cell Placement Positions

    Issues to be considered: Optimal timing (iterative), no routing congestion (iterative), power supply feasibility, narrow channels caused by macro cell placement, and the Port positions of macro cells.

    The narrow channels reserved for macro cell placement can be used to place standard cells on the one hand and facilitate the routing of macro cell Ports and reduce congestion on the other hand.

  2. Planning of Port Placement Positions

    Generally, they are grouped and placed according to the Port functions and signal directions.

  3. Power Supply Planning

As shown in Figure 15, the odd-numbered layers of the power supply lines are used for horizontal routing, and the even-numbered layers are used for vertical routing. TM1 and TM2 are used to design the main power supply network, M2-M8 are used for the secondary power supply network, and M1 is the power supply network of the standard cell library.

Power supply capacity meets the requirements:

Isup(TM2)>Ptotal/VsupI_{sup(TM2)}>P_{total}/V_{sup}

Isup(TM1)>Ptotal/VsupI_{sup(TM1)}>P_{total}/V_{sup}

Isup(M4)>Pstdcel/VsupI_{sup(M4)}>P_{stdcel}/V_{sup}

Isup(M5)>Pmacrocel/VsupI_{sup(M5)}>P_{macrocel}/V_{sup}

In the formulas, PtotalP_{total} is the total power consumption of the entire design, PstdcelP_{stdcel} is the total power consumption of the standard cells, PmacrocelP_{macrocel} is the total power consumption of the macro cells, and VsupV_{sup} is the supply voltage.

Factors to be considered in power supply planning:

  • Routing resources: Metal layers available for implementing the power supply network and the maximum power supply capacity.
  • Power supply requirements: The maximum current demand under a given voltage.
  • Component power PINs: It is necessary to understand the PINs of VDD and VSS of macro cells and standard cells and their approximate connection methods to the power supply network.
  • Narrow channels: Special attention needs to be paid to the power supply of standard cells in narrow channels.

1.4.4 CTS Clock Tree Synthesis (Clock Tree Synthesis)

Clock tree synthesis ensures that the clock buffer/inverter tree from the Clock's root point to each sink point is grown, and the time deviation (skew) of the clock signal reaching the clock terminals of each register is as small as possible.

As shown in Figure 16, before clock tree synthesis, a clock source is fanned out to the clock terminals of many registers. After clock tree synthesis, a clock tree is composed of multiple levels of buffers.

  1. Clock Source

    External crystal oscillator + internal clock generator + high-frequency clock generated by internal PLL + various frequencies of clocks generated by internal frequency division.
    First, generate a certain frequency clock (such as 25 MHz) from the crystal oscillator or clock generator, then generate a frequency-multiplied clock (high-frequency clock) through the PLL, and finally generate various frequency clocks through the frequency divider and send them to each functional module.

  2. Number of Phase-Locked Loops (PLL)

    PLL occupies a large area, so the number of PLLs should be as small as possible. First, count the clock frequency requirements of each functional module, design the frequency divider, and finally calculate the number of PLLs.

  3. Location of PLL

    The location of the PLL determines the length of the clock tree (Clock Tree Latency). It is necessary to clarify the multiplexing relationship of each clock, which modules the PLL frequency-multiplied clock supplies and the locations of these modules.

  4. Clock Constraints

    • The first part is crystal oscillator -> PLL
    • The second part is PLL -> clock gen module (generating divided clock signals)
    • The third part is the output of the frequency divider -> each functional module
  5. CTS Steps

    1. Grow the clock tree
    2. Optimize the clock tree and timing
    3. Route the clock tree
    4. Manually adjust the clock tree
    5. View the clock tree report and repeat the previous four processes

1.4.5 Route

  • Track: Yellow and blue dashed lines, without width. Routing based on the grid requires all metal traces to be on the track.
  • Pitch: The distance between two tracks.
  • Trace: The actual metal trace on the track, with width.
  • Grid point: The intersection of two tracks.
  • The height and width of the standard cell are integer multiples of the pitch, and the pins of the standard cell are placed on the grid points during placement.

Steps of Route:

  1. Global routing (Global Routing)

    Global routing is to plan the routing paths, determine the general position and direction, and does not make actual connections.

  2. Track assignment (Track Assignment)

    Assign each wire to a track and perform actual routing for the connection. When routing, try to make the metal longer and reduce the number of vias. This stage does not perform DRC design rule checks.

  3. Detail Routing (Detailed Routing)

    Use the paths generated during global routing and track assignment to route and drill vias. Since track assignment only considers taking long lines as much as possible, many DRC violations will occur. During detailed routing, fixed-size sboxes are used to fix violations. Sboxes are small grids evenly divided in the entire layout. Violations within the small grids will be fixed, but DRC violations at the boundaries of the small grids cannot be fixed, which needs to be completed in the next step.

  4. Search and repair

    Repair DRC violations that have not been completely eliminated in detailed routing. In this step, gradually increase the size of the sbox to find and repair DRC violations.

Note: The clock tree routing has the highest priority.

1.4.6 Insert fillers

Connect the N-wells of each row of standard cells to improve the stability of the power supply network.

Insert redundant vias: Replace single vias with double vias as much as possible to improve the yield.

1.4.7 Export Files

Export the layout gds file and the Verilog gate-level netlist for use in subsequent processes.

1.5 Static Timing Analysis (Static Prime Analysis, STA)

Static timing analysis is a method of verifying the timing validity of a circuit by checking the timing information of all paths. Its principle is shown in Figure 18.

  1. Divide the design into several paths
  2. Calculate the delay of each path separately
  3. Check whether the delay of each path meets the requirements

1.5.1 Setup Time and Hold Time

  1. Setup Time TsetupT_{setup}

    The time during which the data must remain stable before the rising edge of the clock.

    The arrival time of the data at the D terminal of UFF1:

    TaT_a = TlaunchT_{launch} + Tck2qT_{ck2q} + TdbT_{db}

    The longest time allowed to meet the setup:

    TrT_r = TcaptureT_{capture} + TclkT_{clk} - TsetupT_{setup}

    TslackT_{slack} = TrT_r - TaT_a > 0, that is, TcaptureT_{capture} + TclkT_{clk} - TsetupT_{setup} - TlaunchT_{launch} - Tck2qT_{ck2q} - TdbT_{db} > 0

    Let TcaptureT_{capture} - TlaunchT_{launch} = TskewT_{skew}, and after arrangement:

    TskewT_{skew} + TclkT_{clk} > TsetupT_{setup} + Tck2qT_{ck2q} + TdbT_{db}

    Methods to fix TsetupT_{setup} timing violations:

    1. Increase TclkT_{clk}: Decrease the frequency.
    2. Decrease TdbT_{db}: Optimize combinational logic, divide the pipeline, and reduce the load on the critical path.
    3. Decrease Tck2qT_{ck2q}: Replace with a faster timing logic unit, such as HVT->LVT.
  2. Hold Time TholdT_{hold}

    The time during which the data must remain stable after the rising edge of the clock.

    The arrival time of the data at the D terminal of DFF1:

    TaT_a = TlaunchT_{launch} + Tck2qT_{ck2q} + TdbT_{db}

    The longest time allowed to meet the hold:

    TrT_r = TcaptureT_{capture} + TholdT_{hold}

    TslackT_{slack} = TaT_a - TrT_r > 0, that is, TlaunchT_{launch} + Tck2qT_{ck2q} + TdbT_{db} - TcaptureT_{capture} - TholdT_{hold} > 0

    Let TcaptureT_{capture} - TlaunchT_{launch} = TskewT_{skew}, and after arrangement:

    TskewT_{skew} + TholdT_{hold} < Tck2qT_{ck2q} + TdbT_{db}

    Methods to fix TholdT_{hold} timing violations:

    1. Increase TdbT_{db}: Increase the combinational path delay and insert buffers.
    2. Decrease TskewT_{skew}: Even use a negative skew.

1.5.2 Input Files

  1. db file: Consistent with the db file of synthesis, and libraries under multiple corners such as ss and ff are required

  2. Gate-level netlist

  3. Constraint file.db

  4. Back-annotation files: sdf, spef

    SDF (Standard Delay Format): Standard delay format, describes the timing information in the design, indicates the delay between module pins and pins, the delay from the clock to the data, and the internal connection delay. The sdf file can be directly used for post-simulation of the circuit.

    SPEF (Standard Parasitic Exchange Format): Standard parasitic exchange format, the RC value information extracted from the netlist, a file format for transferring RC information between the extraction tool and the timing verification tool. SPEF provides RC information, and the delay calculation is relatively more accurate.

SDF file back-annotation includes cell delay and wire delay, and parasitic SPEF back-annotation describes RC parameters. SDF back-annotation runs faster than SPEF back-annotation.

2. Introduction to the Open Source EDA Process iFlow

2.1 Build iFlow

System environment: iFlow is supported for use under Ubuntu 20.04, and versions lower than 20.04 are not recommended.

Install dependent tools and libraries:

Tools

  • build-essential 12.8
  • cmake 3.16.3
  • clang 10.0
  • bison 3.5.1
  • flex 2.6.4
  • swig 4.0
  • klayout 0.26

Library

  • libeigen3-dev 3.3.7-2
  • libbo