Very Large Scale Integration (VLSI)

Sunday, 14 August 2011

How do I reset my FPGA?

In an FPGA design, a reset acts as a synchronization signal that sets all the storage elements to a known state. In a digital design, designers normally implement a global reset as an external pin to initialize the design on power-up. The global reset pin is similar to any other input pin and is often applied asynchronously to the FPGA. Designers can then choose to use this signal to reset their design asynchronously or synchronously inside the FPGA.
But with the help of a few hints and tips, designers will find ways to choose a more suitable reset structure. An optimal reset structure will enhance device utilization, timing and power consumption in an FPGA.

Understanding the flip-flop reset behavior
Before we delve into reset techniques, it is important to understand the behavior of flip-flops inside an FPGA slice. Devices in the Xilinx 7 series architecture contain eight registers per slice, and all these registers are D-type flip-flops. All of these flip-flops share a common control set.
The control set of a flip-flop is the clock input (CLK), the active-high chip enable (CE) and the active-high SR port. The SR port in a flip-flop can serve as a synchronous set/reset or an asynchronous preset/clear port (see Figure 1).

Fig-1: Slice flip-flop with control signals

The RTL code that infers the flip-flop also infers the type of reset a flip-flop will use. The code will infer an asynchronous reset when the reset signal is present in the sensitivity list of an RTL process (as shown in Figure 2a). The synthesis tool will infer a flip-flop with an SR port configured as a preset or clear port (represented by the FDCE or FDPE flip-flop primitive). When the SR port is asserted, the flip-flop output is immediately forced to the SRVAL attribute of the flip-flop.

Asynchronous Reset

signal Q : std_logic := 1;              -- Init
--
async    : process(clk, rst)
begin
if (rst = '1') then
    Q <= '0';                           -- Reset Value
else
    if (clk'event and clk = '1') then
      Q <= D;
    end if;
end if;
end process async;

Synchronous Reset

signal Q : std_logic := 1;              -- Init
--
sync    : process(clk, rst)
begin
if (clk'event and clk = '1') then
    if (rst = '1') then
      Q <= '0';                         -- Reset Value
    else
      Q <= D;
    end if;
end if;
end process sync;

Above is the VHDL code to infer asynchronous and synchronous reset.

In the case of synchronous resets, the synthesis tool will infer a flip-flop whose SR port is configured as a set or reset port (represented by an FDSE or FDRE flip-flop primitive). When the SR port is asserted, the flip-flop output is forced to the SRVAL attribute of the flip-flop on the next rising edge of the clock.

In addition, you can initialize the flip-flop output to the value the INIT attribute specifies. The INIT value is loaded into the flip-flop during configuration and when the global set reset (GSR) signal is asserted.

The flip-flops in Xilinx FPGAs can support both asynchronous and synchronous reset and set controls. However, the underlying flip-flop can natively implement only one set / reset / preset / clear at a time. Coding for more than one set / reset / preset / clear condition in the RTL code will result in the implementation of one condition using the SR port of the flip-flop and the other conditions in fabric logic, thus using more FPGA resources.

If one of the conditions is synchronous and the other is asynchronous, the asynchronous condition will be implemented using the SR port and the synchronous condition in fabric logic. In general, it’s best to avoid more than one set/reset/preset/clear condition. Furthermore, only one attribute for each group of four flip-flops in a slice determines if the SR ports of flip-flops are synchronous or asynchronous.

Reset methodology
Regardless of the reset type used (synchronous or asynchronous), you will generally need to synchronize the reset with the clock. As long as the duration of the global reset pulse is long enough, all the device flip-flops will enter the reset state. However, the deassertion of the reset signal must satisfy the timing requirements of the flip-flops to ensure that the flip-flops transition cleanly from their reset state to their normal state. Failure to meet this requirement can result in flip-flops entering a metastable state.

Furthermore, for correct operation of some subsystems, like state machines and counters, all flip-flops must come out of reset on the same clock edge. If different bits of the same state machine come out of reset on different clocks, the state machine may transition into an illegal state. This reinforces the need to make the deassertion of reset synchronous to the clock.

For designs that use a synchronous reset methodology for a given clock domain, it is sufficient to use a standard metastability resolution circuit (two back-to-back flip-flops) to synchronize the global reset pin onto a particular clock domain. This synchronized reset signal can then initialize all storage elements in the clock domain by using the synchronous SR port on the flip-flops. Because both the synchronizer and the flip-flops to be reset are on the same clock domain, the standard PERIOD constraint of the clock covers the timing of the paths between them. Each clock domain in the device needs to use a separate synchronizer to generate a synchronized version of the global reset for that clock domain.

Now let’s get down to brass tacks. Here are some specific hints and tips that will help you arrive at the best reset strategy for your design.

Tip 1: When driving the synchronous SR port of flip-flops, every clock domain requires its own localized version of the global reset, synchronized to that domain.

Sometimes a portion of a design is not guaranteed to have a valid clock. This can occur in systems that use recovered clocks or clocks that are sourced by a hot-pluggable module. In such cases, the storage elements in the design may need to be initialized with an asynchronous reset using the asynchronous SR port on the flip-flops. Even though the storage elements use an asynchronous SR port, the deasserting edge of the reset must still be synchronous to the clock. This requirement is characterized by the reset-recovery timing arc of the flip-flops, which is similar to a setup requirement of the deasserting edge of an asynchronous SR to the rising edge of the clock. Failure to meet this timing arc can cause flip-flops to enter a metastable state and synchronous subsystems to enter unwanted states.

The reset bridge circuit shown in Figure 2 provides a mechanism to assert reset asynchronously (and hence take effect even in the absence of a valid clock) and deassert reset synchronously. In this circuit, it is assumed that the SR ports of the two flip-flops have an asynchronous preset functionality (SRVAL=1).

Figure 2. Reset bridge circuit asserts asynchronously
and deasserts synchronously.

You can use the output of such a reset bridge to drive the asynchronous reset for a given clock domain. This synchronized reset can initialize all storage elements in the clock domain by using the asynchronous SR port on the flip-flops. Again, each clock domain in the device needs a separate, synchronized version of the global reset generated by a separate reset bridge.

Tip 2: A reset bridge circuit provides a safe mechanism to deassert an asynchronous reset synchronously. Every clock domain requires its own localized version of the global reset with the use of a reset bridge circuit.

The circuit in Figure 2 assumes that the clock (clk_a) for clocking the reset bridge and the associated logic is stable and error free. In an FPGA, clocks can come directly from an off-chip clock source (ideally via a clock-capable pin), or can be generated internally using an MMCM or phase-locked loop (PLL). Any MMCM or PLL that you’ve used to generate a clock requires calibration after it is reset. Hence, you may have to insert additional logic in the global reset path to stabilize that clock.

Tip 3: Ensure that the clock the MMCM or PLL has generated is stable and locked before deasserting the global reset to the FPGA.

Figure 3 illustrates a typical reset implementation in an FPGA. The SR control port on Xilinx registers is active high. If the RTL code describes active-low set / reset / preset / clear functionality, the synthesis tool will infer an inverter before it can directly drive the control port of a register. You must accomplish this inversion with a lookup table, thus taking up a LUT input. The additional logic that active-low control signals infers may lead to longer runtimes and result in poorer device utilization. It will also affect timing and power.

Figure 3. Typical reset implementation in FPGAs

The bottom line? Use active-high control signals wherever possible in the HDL code or instantiated components. When you cannot control the polarity of a control signal within the design, you need to invert the signal in the top-level hierarchy of the code. When described in this manner, the inferred inverter can be absorbed into the I/O logic without using any additional FPGA logic or routing.

Tip 4: Active-high resets enable better device utilization and improve performance.

It’s important to note that FPGAs do not necessarily require a global reset. Global resets compete for the same routing resources as other nets in a design. A global reset would typically have high fanout because it needs to be propagated to every flip-flop in the design. This can consume a significant amount of routing resources and can have a negative impact on device utilization and timing performance. As a result, it is worth exploring other reset mechanisms that do not rely on a complete global reset.

When a Xilinx FPGA is configured or reconfigured, every cell (including flip-flops and block RAMs) is initialized as shown in Figure 4. Hence, FPGA configuration has the same effect as a global reset in that it sets the initial state of every storage element in the FPGA to a known state.

Figure 4. FPGA initialization after configuration

You can infer flip-flop initialization values from RTL code. The example shown in Figure 6 demonstrates how to code initialization of a register in RTL. FPGA tools can synthesize initialization of the signals even though it is a common misconception that this is not possible. The initialization value of the underlying VHDL signal or Verilog reg becomes the INIT value for the inferred flip-flop, which is the value loaded into the flip-flop during configuration.

Signal initialization in RTL code (VHDL)

signal reg : std_logic_vector (7 downto 0) := (others <= ’0’); -- Init
--
process(clk, rst)
begin
if (clk'event and clk = '1') then
    if (rst = '1') then
      reg <= '0';                       -- Reset Value
    else
      reg <= D;
    end if;
end if;
end process;

As with registers, you can also initialize block RAMs during configuration. With the increase in embedded RAMs in processor-based systems, BRAM initialization has become a useful feature. This is because a predefined RAM facilitates easier simulation setup and eliminates the requirement to have boot-up sequences to clear memory for embedded designs.

The global set reset (GSR) signal is a special prerouted reset signal that holds the design in the initial state while the FPGA is being configured. After the configuration is complete, the GSR is released and all of the flip-flops and other resources now possess the INIT value. In addition to operating it during the configuration process, a user design can access the GSR net by instantiating the STARTUPE2 module and connecting to the GSR port. Using this port, the design can reassert the GSR net, which will return all storage elements in the FPGA to the state specified by their INIT property.

The deassertion of GSR is asynchronous and can take several clocks to affect all flip-flops in the design. State machines, counters or any other logic that can change state autonomously will require an explicit reset that deasserts synchronously to the user clock. As a result, using GSR as the sole reset mechanism can result in an unreliable system. Hence, you are better served by adopting a mixed approach to manage the startup effectively.

Tip 5: A hybrid approach that relies on the built-in initialization the GSR provides, along with explicit resets for portions of the design that can start auto¬nomously, will result in better utilization and performance.

After using the GSR to set the initial state of the entire design, use explicit resets for logic elements, like state machines, that require a synchronous reset. Generate the synchronized version of the explicit reset using either a standard metastability resolution circuit or a reset bridge.

Use appropriate resets to maximize utilization
The style of reset used in RTL code can have a significant impact on the ability of the tools to map a design to the underlying FPGA resources. When writing RTL code, it is important that designers tailor the reset style of their subdesign to enable the tools to map to these resources.

Other than using the GSR mechanism for initialization, you cannot reset the contents of SRLs, LUTRAMs and block RAMs using an explicit reset. Thus, when writing code that is expected to map to these resources, it is important to code specifically without reset. For example, if RTL code describes a 32-bit shift register with an explicit reset for the 32 stages in the shift register, the synthesis tool would not be able to map this RTL code directly to an SRL32E because it cannot meet the requirements of the coded reset using this resource. Instead, it would either infer 32 flip-flops or infer some additional circuitry around an SRL32E in order to implement the required reset functionality. Both of these solutions would require more resources than if you had coded the RTL without reset.

Tip 6: When mapping to SRLs, LUTRAMs or block RAMs, do not code for a reset of the SRL or RAM array.

In 7 series devices, you cannot pack flip-flops with different control signals into the same slice. For low-fanout resets, this can have a negative impact on overall slice utilization. With synchronous resets, the synthesis tool can implement the reset functionality using LUTs (as shown in Figure 5) rather than control ports of flip-flops, thereby removing the reset as a control port. This allows you to pack the resulting LUT/flip-flop pair with other flip-flops that do not use their SR ports. This may result in higher LUT utilization but improved slice utilization.

Figure 5. Control set reduction on SR

Tip 7: Synchronous resets enhance FPGA utilization. Use them in your designs rather than asynchronous resets.

Some of the larger dedicated resources (namely block RAMs and DSP48E1 cells) contain registers that can be inferred as part of the dedicated resource functionality. Block RAMs have optional output registers that you can use to improve clock frequency by means of an additional clock of latency. DSP48E1 cells have many registers that you can use both for pipelining, to increase maximum clock speed, as well as for cycle delays (Z-1). However, these registers only have synchronous set/reset capabilities.

Tip 8: Using synchronous resets allows the synthesis tool to use the registers inside dedicated resources like DSP48E1 slices or block RAMs. This improves overall device utilization and performance for that portion of the design, and also reduces overall power consumption.

If the RTL code describes asynchronous set/reset, then the synthesis tool will not be able to use these internal registers. Instead, it will use slice flip-flops since they can implement the requested asynchronous set/reset functionality. This will not only result in poor device utilization but will also have a negative impact on performance and power.

Many options
Various reset options are available for FPGAs, each with its own advantages and disadvantages. The recommendations outlined here will help designers choose a suitable reset structure for their design. An optimal reset structure will enhance the device utilization, timing and power consumption of an FPGA.

Friday, 5 August 2011

2-bit Counter VHDL Code

library ieee;
use ieee.std_logic_1164.all;

entity bit_counter is
port (
    clk       : in std_logic;
    rst       : in std_logic;
    count_out : out std_logic_vector(1 downto 0));

end bit_counter;

architecture bit_counter_ar of bit_counter is

-- Defining Internal Signals

signal sig1,sig2 : std_logic;
signal count_out_sig : std_logic_vector (1 downto 0);

begin -- 2bit_counter_ar

process (clk, rst)

-- purpose: sequential part of counter
-- type : sequential
-- inputs : clk, rst
-- outputs:

begin -- process
    if rst = '1' then                   -- asynchronous reset (active high)
      count_out_sig <= "11";
    elsif clk'event and clk = '1' then -- rising clock edge
      count_out_sig(0) <= sig1;
      count_out_sig(1) <= sig2;
    end if;
end process;

-- Combinational Logic

sig1 <= not count_out_sig(0);
sig2 <= count_out_sig(1) xor count_out_sig(0);
count_out <= count_out_sig;

end bit_counter_ar;

Friday, 15 July 2011

Polar Satellite Launch Vehicle (PSLV-C17) Successfully Launches GSAT - 12 Satellite

India's Polar Satellite Launch Vehicle (PSLV-C17) successfully launched GSAT-12 communication satellite today (July 15, 2011) from Satish Dhawan Space Centre (SDSC) SHAR, Sriharikota. The launch of PSLV-C17 was the eighteenth successive successful flight of PSLV.

After a smooth countdown of 53 hours, the vehicle lifted-off from the Second Launch Pad at the opening of the launch window at 16:48 hrs (IST). After about 20 minutes of flight time, GSAT-12 was successfully injected into sub-Geosynchronous Transfer Orbit (sub-GTO) with a perigee of 284 km and an apogee of 21,020 km with an orbital inclination of 17.9 deg.

The preliminary flight data indicates that all major flight events involving stage ignition and burnouts, performance of solid and liquid stages, indigenously developed advanced mission computers and telemetry systems have performed well.

ISRO Telemetry Tracking and Command Network (ISTRAC)'s ground station at Biak, Indonesia acquired the signals from GSAT-12 immediately after the injection of the satellite. The solar panels of the satellite were deployed automatically. Initial checks on the satellite have indicated normal health of the satellite.

India's Advanced Communication Satellite GSAT-8 Launched Successfully

India's advanced communication satellite, GSAT-8, was successfully launched at 02:08 hrs IST today (May 21, 2011) by the Ariane-V launch vehicle of Arianespace from Kourou. French Guiana. Ariane V placed GSAT-8 into the intended Geosynchronous Transfer Orbit (GTO) of 35,861 km apogee and 258 km perigee, with an orbital inclination of 2.503 deg with respect to equator.
ISRO's Master Control Facility (MCF) at Hassan in Karnataka acquired the signals from GSAT-8 satellite immediately after the injection. Initial checks on the satellite have indicated normal health of the satellite. The satellite was captured in three-axis stabilization mode. Preparations are underway for the firing of 440 Newton Liquid Apogee Motor (LAM) during the third orbit of the satellite on May 22, 2011 at 03:58 hrs IST as a first step towards taking the satellite to its geostationary orbital home.

About GSAT-8:

GSAT-8, India’s advanced communication satellite, is a high power communication satellite being inducted in the INSAT system. Weighing about 3100 Kg at lift-off, GSAT-8 is configured to carry 24 high power transponders in Ku-band and a two-channel GPS Aided Geo Augmented Navigation (GAGAN) payload operating in L1 and L5 bands.
The 24 Ku band transponders will augment the capacity in the INSAT system. The GAGAN payload provides the Satellite Based Augmentation System (SBAS), through which the accuracy of the positioning information obtained from the GPS Satellite is improved by a network of ground based receivers and made available to the users in the country through the geostationary satellites.

Mission : Communication

Weight : 3093 kg (Mass at Lift – off) 1426 kg (Dry Mass)

Power : Solar array providing 6242 watts three 100 Ah Lithium Ion batteries

Physical Dimensions : 2.0 x 1.77 x 3.1m cuboid

Propulsion : 440 Newton Liquid Apogee Motors (LAM) with mono Methyl Hydrazine (MMH) as fuel and Mixed oxides of Nitrogen (MON-3) as oxidizer for orbit raising.

Stabilizations : 3-axis body stabilized in orbit using Earth Sensors, Sun Sensors, Momentum and Reaction Wheels, Magnetic Torques and eight 10 Newton and eight 22 Newton bipropellant thrusters

Antennas : Two indigenously developed 2.2 m diameter transmit/receive polarization sensitive dual grid shaped beam deployable reflectors with offset-fed feeds illumination for Ku-band; 0.6 m C-band and 0.8x0.8 sq m L-band helix antenna for GAGAN

Launch date : 21-May-11

Launch site : Kourou, French Guiana

Launch vehicle : Ariane-5 VA-202

Orbit : Geosynchronous (55° E)

Mission life : More Than 12 Years

Saturday, 9 July 2011

Intel’s Haswell Microarchitecture

Haswell is the codename for a processor microarchitecture to be developed by Intel's Oregon team as successor to the Sandy Bridge architecture.Haswell will use a 22 nm process.CPUs based on the Haswell microarchitecture are expected to be released in 2013. There are currently no details regarding this microarchitecture's development.

Haswell is confirmed to have:

A 22 nm process.
3D tri-gate transistors.
Advanced Vector Extensions 2(AVX2) instruction set (or Haswell New Instructions)

Haswell is expected to have:

FMA3 instructions.
A 14 stage pipeline.
A new cache design.
Up to 8 cores available.
New advanced power-saving system.
64 kB data + 64 kB instruction L1 cache per core, 8-way associativity
1 MB L2 cache per core, 8-way associativity.
Up to 32 MB L3 cache shared by all cores, 16-way associativity.

Sandy Bridge

Sandy Bridge is the codename for a processor microarchitecture developed by Intel's Israel Development Center. Development began in 2005 targeting the 32 nm process. The codename for this architecture was previously "Gesher" (which means "bridge" in Hebrew). Sandy Bridge processors were first released on January 9, 2011. Intel first previewed a Sandy Bridge processor with A1 stepping at 2 GHz during the Intel Developer Forum in 2009. The yet-to-be released 22 nm die shrink of Sandy Bridge has the codename Ivy Bridge.

Sandy Bridge is one of the most ambitious and aggressive microprocessors designed at Intel. The degree of complexity and integration is simply astounding. It combines a new CPU microarchitecture, a new graphics microarchitecture, each of which is a substantial departure from the previous generation. On top of that, the chip level integration has taken a huge step forward; with a much more complex system agent and a new L3 cache and ring interconnect shared by all the components. Coherent communication between the CPU and GPU in Sandy Bridge is a substantial advance for the industry and presents many opportunities. Dealing with all these different facets of Sandy Bridge in a single discussion is impossible given the scope of changes.

Sandy Bridge is a fundamentally new microarchitecture for Intel. While it outwardly resembles Nehalem and the P6, it is internally far different. The essence of an out-of-order microarchitecture is tracking, re-ordering, renaming and dynamically scheduling operations to achieve the limit of data flow. Nehalem and Westmere rely on the same mechanisms that date back to the original P6. Sandy Bridge changes the underlying out-of-order engine and uses the more efficient approach taken by the EV6 and P4. That one change alone qualifies Sandy Bridge as a different breed entirely from the P6. But, there are changes in almost every other aspect of the design. The uop cache is a huge improvement for the front-end, largely by eliminating many of the vagaries of x86 fetch and decode. The implementation is quite clever and achieves many of the aims of the P4’s trace cache, but in a far more efficient and reliable manner. AVX improves execution throughput and most importantly, the more flexible memory pipelines benefit almost all workloads.

In the coming year, three new microarchitectures will grace the x86 world. This abundance of new designs is exciting; especially since each one embodies a different philosophy. At the high-end, Sandy Bridge focuses on efficient per-core performance, while Bulldozer explicitly trades away some per-core performance for higher aggregate throughput. AMD’s Bobcat takes an entirely different road, emphasizing low-power, but retaining performance. In contrast, Intel’s Atom is truly intended to reach the most power sensitive applications. The two high-end microarchitectures, Sandy Bridge and Bulldozer, are shown below in Figure 7. Note that each Bulldozer module would include two integer cores while sharing the front-end and floating point cluster. Also, the floating point cluster in Bulldozer does not directly access memory, instead it uses the memory pipelines in the two attached cores, which then forward results to the FP cluster.

With the limited details, it is hard to predict the chip level performance for products based on these two microarchitectures. Frequencies are still undisclosed, or have yet to be determined and the client and server products will be rather different. In the case of Sandy Bridge, the clock speed should be in the same vicinity as Nehalem or Westmere – however, Bulldozer is clearly intended to run faster, but the frequency will probably be dictated by power consumption. For Bulldozer, there are also numerous details on the integration (e.g. L3 cache design, snoop filter) that are undisclosed. Nonetheless, it is possible to make some educated estimates about the performance of the two microarchitectures.

In looking at the two designs, it is sensible to compare a multi-threaded Sandy Bridge core to a Bulldozer module and separately consider single threaded operation as a special case. Both support two threads although the resources are very different. At a high level, Sandy Bridge shares everything between threads, whereas Bulldozer flexibly shares the front-end and floating point units, while separating the integer cores.

A Sandy Bridge core should have substantially higher performance than a Bulldozer module across the board for single threaded or lightly threaded code. It will also have an additional advantage for floating point workloads that use AVX, (e.g. numerical analysis for finance, engineering). With AVX, each Sandy Bridge core can have up to 2X the FLOP/cycle of a Bulldozer module, although they would be at parity if the code is compiled to use AMD’s FMA4 (e.g. via OpenCL). FMA4 will be relatively rare because, while elegant, it is likely to be a historical footnote for x86, supplanted by Intel’s FMA3. For software still relying on SSE, the difference between the two should be minimal. In comparison, Bulldozer will favor heavily multi-threaded software. Each module has twice the memory pipelines and slightly more resources (e.g. retirement queue/ROB entries, memory buffers) than a single Sandy Bridge core with two threads, so Bulldozer should do very well in many highly parallel integer workloads that exercise the memory pipelines.

In many ways, the strengths of Sandy Bridge reflect the intentions of the architects. Sandy Bridge is first and foremost a client microprocessor – which requires single threaded performance. Bulldozer is firmly aimed at the server market, where sacrificing single threaded performance for aggregate throughput is an acceptable decision in some cases. Perhaps in future articles, we can examine the components of performance in greater detail (e.g. frequency, IPC, etc.), but for now, high level guidance seems appropriate – given the level of disclosure from both vendors.

Ultimately, we will be waiting for real hardware to see how the Sandy Bridge client performs in the wild. The base clocks, realistic turbo frequencies and power consumption will all be very interesting to observe – and help estimate server performance as well. For now the hardware certainly looks promising and while we await products, we’ll have other reports on different aspects of Sandy Bridge to keep us occupied. The design team certainly deserves a round of congratulations for a job well done, redoing the microarchitecture from the ground up while tackling all the integration challenges.

Intel Nehalem Architecture

Nehalem is the codename for an Intel processor microarchitecture, successor to the Core microarchitecture. Nehalem processors use the 45 nm process. A preview system with two Nehalem processors was shown at Intel Developer Forum in 2007. The first processor released with the Nehalem architecture was the desktop Core i7, which was released in November 2008.

Nehalem, a recycled codename, refers to a completely different architecture from Netburst, although Nehalem still has some things in common with NetBurst. Nehalem-based microprocessors utilize higher clock speeds and are more energy-efficient than Penryn microprocessors. Hyper-Threading is reintroduced along with an L3 Cache missing from most Core-based microprocessors.