Very Large Scale Integration (VLSI)

Thursday, 6 September 2012

Choosing FPGA or DSP for your Application

FPGA or DSP - The Two Solutions

The DSP is a specialised microprocessor - typically programmed in C, perhaps with assembly code for performance. It is well suited to extremely complex maths-intensive tasks, with conditional processing. It is limited in performance by the clock rate, and the number of useful operations it can do per clock. As an example, a TMS320C6201 has two multipliers and a 200MHz clock – so can achieve 400M multiplies per second.

In contrast, an FPGA is an uncommitted "sea of gates". The device is programmed by connecting the gates together to form multipliers, registers, adders and so forth. Using the Xilinx Core Generator this can be done at a block-diagram level. Many blocks can be very high level – ranging from a single gate to an FIR or FFT. Their performance is limited by the number of gates they have and the clock rate. Recent FPGAs have included Multipliers especially for performing DSP tasks more efficiently. – For example, a 1M-gate Virtex-II™ device has 40 multipliers that can operate at more than 100MHz. In comparison with the DSP this gives 4000M multiplies per second.

Where They Excel

When sample rates grow above a few Mhz, a DSP has to work very hard to transfer the data without any loss. This is because the processor must use shared resources like memory busses, or even the processor core which can be prevented from taking interrupts for some time. An FPGA on the other hand dedicates logic for receiving the data, so can maintain high rates of I/O.

A DSP is optimised for use of external memory, so a large data set can be used in the processing. FPGAs have a limited amount of internal storage so need to operate on smaller data sets. However FPGA modules with external memory can be used to eliminate this restriction.

A DSP is designed to offer simple re-use of the processing units, for example a multiplier used for calculating an FIR can be re-used by another routine that calculates FFTs. This is much more difficult to achieve in an FPGA, but in general there will be more multipliers available in the FPGA.

If a major context switch is required, the DSP can implement this by branching to a new part of the program. In contrast, an FPGA needs to build dedicated resources for each configuration. If the configurations are small, then several can exist in the FPGA at the same time. Larger configurations mean the FPGA needs to be reconfigured – a process which can take some time.

The DSP can take a standard C program and run it. This C code can have a high level of branching and decision making – for example, the protocol stacks of communications systems. This is difficult to implement within an FPGA.

Most signal processing systems start life as a block diagram of some sort. Actually translating the block diagram to the FPGA may well be simpler than converting it to C code for the DSP.

Making a Choice

There are a number of elements to the design of most signal processing systems, not least the expertise and background of the engineers working on the project. These all have an impact on the best choice of implementation. In addition, consider the resources available – in many cases, I/O modules have FPGAs on board. Using these with a DSP processor may provide an ideal split.

As a rough guideline, try answering these questions:

What is the sampling rate of this part of the system? If it is more than a few MHz, FPGA is the natural choice.
Is your system already coded in C? If so, a DSP may implement it directly. It may not be the highest performance solution, but it will be quick to develop.
What is the data rate of the system? If it is more than perhaps 20-30Mbyte/second, then FPGA will handle it better.
How many conditional operations are there? If there are none, FPGA is perfect. If there are many, a software implementation may be better.
Does your system use floating point? If so, this is a factor in favour of the programmable DSP. None of the Xilinx cores support floating point today, although you can construct your own.
Are libraries available for what you want to do? Both DSP & FPGA offer libraries for basic building blocks like FIRs or FFTs. However, more complex components may not be available, and this could sway your decision to one approach or the other.

In reality, most systems are made up of many blocks. Some of those blocks are best implemented in FPGA, others in DSP. Lower sampling rates and increased complexity suit the DSP approach; higher sampling rates, especially combined with rigid, repetitive tasks, suit the FPGA.

Some Examples

Here are a few examples of signal processing blocks, along with how we would implement them:

First decimation filter in a digital wireless receiver. Typically, this is a CIC filter, operating at a sample rate of 50-100MHz. A 5-stage CIC has 10 registers & 10 adds, giving an "add rate" of 500-1000MHz.
At these rates any DSP processor would find it extremely difficult to do anything. However, the CIC has an extremely simple structure, and implementing it in an FPGA would be easy. A sample rate of 100MHz should be achievable, and even the smallest FPGA will have a lot of resource left for further processing.
Communications Protocol Stack – ISDN, IEEE1394 etc; these are complex large pieces of C code, completely unsuitable for the FPGA. However the DSP will implement them easily. Not only that, a single code base can be maintained, allowing the code stack to be implemented on a DSP in one product, or a separate control processor in another; and bringing the opportunity to licence the code stack from a specialist supplier.
Digital radio receiver – baseband processing. Some receiver types would require FFTs for signal acquisition, then matched filters once a signal is acquired. Both blocks can be easily implemented by either approach. However, there is a mode change – from signal acquisition to signal reception.
It may well be that this is better suited to the DSP, as the FPGA would need to implement both blocks simultaneously. Note that the RF processing is better in an FPGA, so this is likely to be a mixed system.
(Note – with today’s larger FPGAs, both modes of this system could be included in the FPGA at the same time.)
Image processing. Here, most of the operations on an image are simple and very repetitive – best implemented in an FPGA. However, an imaging pipeline is often used to identify "blobs" or "Regions of Interest" in an object being inspected. These blobs can be of varying sizes, and subsequent processing tends to be more complex. The algorithms used are often adaptive, depending on what the blob turns out to be… so a DSP-based approach may be better for the back end of the imaging pipeline.

Summary

FPGA and DSP represent two very different approaches to signal processing – each good at different things. There are many high sampling rate applications that an FPGA does easily, while the DSP could not. Equally, there are many complex software problems that the FPGA cannot address.

Feature Detection on FPGA

Autonomous landing and roving on the Moon require fast computation. Space tolerant processors are insufficient for processing computer vision algorithms to land safely on the surface of the Moon. Field Programmable Gate Arrays (FPGAs) are capable of parallelizing the computations that a processor executes serially. An FPGA programmer has the ability of directly programming the logic fabric of the board. A processor is restricted to a set of assembly commands which it fetches, decodes, and executes in order to run a program. The video above executes several blurs on FPGA. Blurring is required for the Scale Invariant Feature Transform (SIFT) feature detection algorithm that is currently being developed.

Monday, 3 September 2012

Intel Chips Will Support Wireless Charging by 2014

Earlier this month it was rumored that Intel was developing a new wireless charging solution it could integrate into the Ultrabook platform. In so doing, it would remove the need to plug your Ultrabook into a power socket directly, instead placing it on a charging pad or at least near a power source.

Intel’s interest in wireless charging has today been confirmed through a new partnership with Integrated Device Technology (IDT). IDT will develop a new integrated transmitter and receiver for Intel, allowing for wireless charging using resonance technology from up to several feet away.

“Our extensive experience in developing the innovative and highly integrated IDTP9030 transmitter and multi-mode IDTP9020 receiver has given IDT a proven leadership position in the wireless power market,” said Arman Naghavi, vice president and general manager of the Analog and Power Division at IDT.

As for when we can expect to see an Intel-branded wireless charging system, IDT is working to provide samples in the first half of 2013, suggesting product integration should happen in time for the major holiday season at the end of next year.

IDT’s Gary Huang has also suggested that eventually wireless charging will expand to power everything on your desktop. So your wireless keyboard, mouse, backup storage device, smartphone, and PC/laptop will all be completely wireless, with each including the necessary components and battery to be charged.

The chipmaker is entering a market where there is already a proposed standard called Qi. Qi has received a wide array of support, including Energizer, Texas Instruments, Verizon, and phone manufacturers including Nokia, Research In Motion, LG, and HTC.

Currently, 88 products are listed by the Wireless Power Consortium as being Qi-compatible, including phones from NTT DoCoMo and HTC.

Intel is not a part of that group, and its wireless charging effort is based ona platform created by IDT is apparently not Qi-compatible. Since Qi is already getting widespread support and Intel’s chips have made it in to very few mobile devices so far, Intel has some work ahead if it is to be a success.

A completely wire-free desk at home sounds great to me, and if this IDT/Intel venture is successful it could be a reality within a year or two.

Wednesday, 29 August 2012

Altera’s 28-nm Stratix V FPGAs- Contains Industry’s Fastest Backplane-capable Transceivers

San Jose, Calif., July 31, 2012—Altera Corporation (Nasdaq: ALTR) today announced it is shipping in volume production the FPGA industry’s highest performance backplane-capable transceivers. Altera’s Stratix® V FPGAs are the industry’s only FPGAs to offer 14.1 Gbps transceiver bandwidth and are the only FPGAs capable of supporting the latest generation of the Fibre Channel protocol (16GFC). Developers of backplanes, switches, data centers, cloud computing applications, test and measurement systems and storage area networks can achieve significantly higher data rate speeds as well as rapid storage and retrieval of information by leveraging Altera’s latest generation 28-nm high-performance FPGA. For OTN (optical transport network) applications, Stratix V FPGAs allow carriers to scale quickly to support the tremendous growth of traffic on their networks.

Altera started shipping engineering samples of 28-nm FPGAs featuring integrated 14.1 Gbps transceivers over one year ago. These high-performance devices are the latest in Altera’s 28-nm FPGA portfolio to ship in volume production. The transceivers in Stratix V GX and Stratix V GS FPGAs deliver high system bandwidth (up to 66 lanes operating up to 14.1 Gbps) at the lowest power consumption (under 200 mW per channel). Transceivers in Altera’s FPGAs are equipped with advanced equalization circuit blocks, including low-power CTLE (continuous time linear equalization), DFE (decision feedback equalization), and a variety of other signal conditioning features for optimal signal integrity to support backplane, optical module, and chip-to-chip applications. This advanced signal conditioning circuitry enables direct drive of 10GBASE-KR backplanes using Stratix V FPGAs.

“Developers of next-generation protocols need to leverage the latest test equipment that integrates the latest technologies,” said Michael Romm, vice president of product development, at LeCroy Protocol Solutions Group, a leading manufacturer of test and measurement equipment. “Altera’s latest family of 28-nm FPGAs gives us the capability to build the most sophisticated and advanced test equipment so our customers can rapidly develop and bring to market their next-generation systems.”

Published by Altera, click here to read the whole article.

Sunday, 26 August 2012

Inside TSMC – A FAB Tour

An up to date and current overview of semiconductor manufacturing technology from TSMC in Taiwan. Nicely produced and informative if you tune-out the voice-over slightly. Better access than any Fab tour.
Recommended if you have any interest in how semiconductors are made/manufactured in volume right now.

In the microelectronics industry a semiconductor fabrication plant (commonly called a fab) is a factory where devices such as integrated circuits are manufactured.

A business that operates a semiconductor fab for the purpose of fabricating the designs of other companies, such as fabless semiconductor companies, is known as a foundry. If a foundry does not also produce its own designs, it is known as a pure-play semiconductor foundry.

Fabs require many expensive devices to function. Estimates put the cost of building a new fab over one billion U.S. dollars with values as high as $3–4 billion not being uncommon. TSMC will be investing 9.3 billion dollars in its Fab15 300 mm wafer manufacturing facility in Taiwan to be operational in 2012.

The central part of a fab is the clean room, an area where the environment is controlled to eliminate all dust, since even a single speck can ruin a microcircuit, which has features much smaller than dust. The clean room must also be dampened against vibration and kept within narrow bands of temperature and humidity. Controlling temperature and humidity is critical for minimizing static electricity.

The clean room contains the steppers for photolithography, etching, cleaning, doping and dicing machines. All these devices are extremely precise and thus extremely expensive. Prices for most common pieces of equipment for the processing of 300 mm wafers range from $700,000 to upwards of $4,000,000 each with a few pieces of equipment reaching as high as $50,000,000 each (e.g. steppers). A typical fab will have several hundred equipment items.

Taiwan Semiconductor Manufacturing Company, Limited or TSMC is the world's largest dedicated independent semiconductor foundry, with its headquarters and main operations located in the Hsinchu Science Park in Hsinchu, Taiwan.

Facilities at TSMC:

One 150 mm (6 inches) wafer fab in full operation (Fab 2)
Four 200 mm (8 inches) wafer fabs in full operation (Fabs 3, 5, 6, 8)
Two 300 mm (12 inches) wafer fabs in production (Fabs 12, 14)
TSMC (Shanghai)
WaferTech, TSMC's wholly owned subsidiary 200 mm (8 inches) fab in Camas, Washington, USA
SSMC (Systems on Silicon Manufacturing Co.), a joint venture with NXP Semiconductors in Singapore which has also brought increased capacity since the end of 2002

TSMC announced plans to invest US$9.4 billion to build its third 12-inch (300 mm) wafer fabrication facility in Central Taiwan Science Park (Fab 15), which will use advanced 40 and 20-nanometer technologies. It is expected to become operational by March 2012. The facility will output over 100,000 wafers a month and generate $5 billion per year of revenue. On January 12, 2011, TSMC announced the acquisition of land from Powerchip Semiconductor for NT$2.9 billion (US$96 million) to build two additional 300 mm fabs to cope with increasing global demand. Further, TSMC has disclosed plans that it will build a 450-mm fab, which may begin its pilot lines 2013, and production as early as 2015.

Tuesday, 21 August 2012

Blocking and Non-Blocking Assignment

keysymbols: =, <=.

Blocking (the = operator)

With blocking assignments each statement in the same time frame is executed in sequential order within their blocks. If there is a time delay in one line then the next statement will not be executed until this delay is over.

integer a,b,c,d;

initial begin
	a = 4;  b = 3;					example 1
	#10 c = 18;
	#5 d = 7;
end

Above, at time=0 both a and b will have 4 and 3 assigned to them respectively and at time=10, c will equal 18 and at time=15, d will equal 7.

Non-Blocking (the <= operator)

Non-Blocking assignments tackle the procedure of assigning values to variables in a totally different way. Instead of executing each statement as they are found, the right-hand side variables of all non-blocking statements are read and stored in temporary memory locations. When they have all been read, the left-hand side variables will be determined. They are non-blocking because they allow the execution of other events to occur in the block even if there are time delays set.

integer a,b,c;

initial begin
	a = 67;
	#10;
	a <= 4;						example 2
	c <= #15 a;
	d <= #10 9;
	b <= 3;
end

This example sets a=67 then waits for a count of 10. Then the right-hand variables are read and stored in tempory memory locations. Here this is a=67. Then the left-hand variables are set. At time=10 a and b will be set to 4 and 3. Then at time=20 d=9. Finally at time=25, c=a which was 67, therefore c=67.

Note that d is set before c. This is because the four statements for setting a-d are performed at the same time. Variable d is not waiting for variable c to complete its task. This is similar to a Parallel Block.

This example has used both blocking and non-blocking statements. The blocking statement could be non-blocking, but this method saves on simulator memory and will not have as large a performance drain.

Application of Non-Blocking Assignments:

We have already seen that non-blocking assignments can be used to enable variables to be set anywhere in time without worrying what the previous statements are going to do.

Another important use of the non-blocking assignment is to prevent race conditions. If the programmer wishes two variables to swap their values and blocking operators are used, the output is not what is expected:

initial begin
	x = 5;
	y = 3;
end
							example 3
always @(negedge clock) begin
	x = y;
	y = x;
end

This will give both x and y the same value. If the circuit was to be built a race condition has been entered which is unstable. The compliler will give a stable output, however this is not the output expected. The simulator assigns x the value of 3 and then y is then assigned x. As x is now 3, y will not change its value. If the non-blocking operator is used instead:

always @(negedge) begin
	x <= y;						example 4
	y <= x;
end

both the values of x and y are stored first. Then x is assigned the old value of y (3) and y is then assigned the old value of x (5).

Another example when the non-blocking operator has to be used is when a variable is being set to a new value which involves its old value.

i <= i+1;
or							examples 5,6
register[5:0] <= {register[4:0] , new_bit};

Race condition in Verilog

In Verilog certain type of assignments or expression are scheduled for execution at the same time and order of their execution is not guaranteed. This means they could be executed in any order and the order could be change from time to time. This non-determinism is called the race condition in Verilog.

Verilog execution order:

If you look at the active event queue, it has multiple types of statements and commands with equal priority, which means they all are scheduled to be executed together in any random order, which leads to many of the faces..

Lets look at some of the common race conditions that one may encounter.

1) Read-Write or Write-Read race condition.

Take the following example :

always @(posedge clk)
x = 2;

always @(posedge clk)
y = x;

Both assignments have same sensitivity ( posedge clk ), which means when clock rises, both will be scheduled to get executed at the same time. Either first ‘x’ could be assigned value ’2′ and then ‘y’ could be assigned ‘x’, in which case ‘y’ would end up with value ’2′. Or it could be other way around, ‘y’ could be assigned value of ‘x’ first, which could be something other than ’2′ and then ‘x’ is assigned value of ’2′. So depending on the order final value of ‘y’ could be different.

How can you avoid this race ? It depends on what your intention is. If you wanted to have a specific order, put both of the statements in that order within a ‘begin’…’end’ block inside a single ‘always’ block. Let’s say you wanted ‘x’ value to be updated first and then ‘y’ you can do following. Remember blocking assignments within a ‘begin’ .. ‘end’ block are executed in the order they appear.

always @(posedge clk)
begin
x = 2;
y = x;
end

2) Write-Write race condition.

always @(posedge clk)
x = 2;

always @(posedge clk)
x = 9;

Here again both blocking assignments have same sensitivity, which means they both get scheduled to be executed at the same time in ‘active event’ queue, in any order. Depending on the order you could get final value of ‘x’ to be either ’2′ or ’9′. If you wanted a specific order, you can follow the example in previous race condition.

3) Race condition arising from a ‘fork’…’join’ block.

always @(posedge clk)
fork
x = 2;
y = x;
join

Unlike ‘begin’…’end’ block where expressions are executed in the order they appear, expression within ‘fork’…’join’ block are executed in parallel. This parallelism can be the source of the race condition as shown in above example.

Both blocking assignments are scheduled to execute in parallel and depending upon the order of their execution eventual value of ‘y’ could be either ’2′ or the previous value of ‘x’, but it can not be determined beforehand.

4) Race condition because of variable initialization.

reg clk = 0

initial
clk = 1

In Verilog ‘reg’ type variable can be initialized within the declaration itself. This initialization is executed at time step zero, just like initial block and if you happen to have an initial block that does the assignment to the ‘reg’ variable, you have a race condition.

There are few other situations where race conditions could come up, for example if a function is invoked from more than one active blocks at the same time, the execution order could become non-deterministic.