Very Large Scale Integration (VLSI)

Sunday, 11 September 2016

64 core processor from Chinese chip maker Phytium

While the world awaits the AMD K12 and Qualcomm Hydra ARM server chips to join the ranks of the Applied Micro X-Gene and Cavium ThunderX processors already in the market, it could be upstart Chinese chip maker Phytium Technology that gets a brawny chip into the field first and also gets traction among actual datacenter server customers, not just tire kickers.

Phytium Technology has announced a 64-core ARM server CPU, which according to the press release will deliver 512 gigaflops of performance. The new chip, known as FT-2000/64, is aimed at “high throughput and high performance servers.”

Phytium is a chip design enterprise, based in Tianjin, China. In March 2015, the company released its first products: the FT-1500A/4 and FT-1500A/16, 4-core and 16-core implementations, respectively of the ARMv8 design.

Phytium was on hand at last week’s Hot Chips 28 conference, showing off its chippery and laptop, desktop and server machines employing its “Earth” and “Mars” FT series of ARM chips. Most of the interest that people showed in the server variants, which are both based on variants of the “Xiaomi” core design that the company has cooked up based on ARMv8 intellectual property licensed from ARM Holdings. There is chatter that one of the three Chinese exascale machines, which we wrote about here, will employ a future Phytium processor, but we were unable to confirm this with the Phytium executives at the event. What we can tell you is that the first engineering samples of the two Earth ARM chips, the FT-1500A/4 and the FT-1500A/16, as well as the one Mars ARM chip, the FT-2000/64, are back from Taiwan Semiconductor Manufacturing Corp and that we saw systems running the Kylin Linux operating system (a variant of Canonical’s Ubuntu) at the Hot Chips event.

Here are the key chip features from the FT-2000/64 product page:

Process：Manufacturing with 28nm process
Core：Integrating sixty-four FTC661 cores
Frequency：Running at 1.5GHz~2.0GHz
Cache：Integrating 32MB L2 cache and extending 128MB LLC
Extension Interface：Integrating eight proprietary extension interfaces, each delivering 19.2GB/s effective r/w bandwidth
Memory Interface：Extending sixteen DDR3-1600 memory controllers, which can deliver 204.8GB/s memory access bandwidth.
I/O Interface：Integrating two x16 or four x8 PCIE Gen3 interface
Power：Max. power 100W
Package：FCBGA package with 2892 pins

No pricing was provided on the new chips, and it’s unclear from the press release if the product is available today. The next time we hear about the FT-2000/64 might very well be when it shows up in a TOP500 supercomputer. Stay tuned.

4μm thick fabric like flexible circuit

According to the Korea Advanced Institute of Science and Technology (KAIST), complete with substrate, an active matrix for a flexible display need only be 4μm thick.

Initially on a sacrificial laser-reactive substrate the matrix of ultra-thin n-type transparent oxide thin-film transistors (TFTs) were fabricated for the back plane.

Laser irradiation from the backside of the substrate split off only the oxide TFT array as a result of reaction with the laser-reactive layer.

The free transistors were transferred to a 4μm polyethylene terephthalate (PET) substrate, and then the combination was further transferred con-formally to the surface of human skin and artificial leather to demonstrate the possibility of the wearable application.

“The attached oxide TFTs showed high optical transparency of 83% and 40cm2/Vs even under several cycles of severe bending tests,” said KAIST.

The method is called inorganic-based laser lift-off (ILLO).

“By using our ILLO process, the technological barriers for high performance transparent flexible displays have been overcome at a relatively low cost by removing expensive polyimide substrates. Moreover, the high-quality oxide semiconductor can be easily transferred onto skin-like, or any flexible, substrate for wearable application,” said Professor Keon Jae Lee.

Con-formal displays are a potential application.

“With the advent of the Internet of Things era, demand has grown for wearable and transparent displays that can be applied to fields such as augmented reality and skin-like thin flexible devices,” said KAIST. “However, previous flexible transparent displays have poor transparency and low electrical performance. To improve the transparency and performance, past research efforts have tried to use inorganic-based electronics, but the fundamental thermal instabilities of plastic substrates have hampered the high temperature process, an essential step necessary for the fabrication of high performance electronic devices.”

Monday, 18 July 2016

Mega Processor to Understand Micro Processor

Have you ever imagine how the work or what's going on inside? Think about a bigger version of a microprocessor where you can walk inside and look how it is working in real.

You may have heard that your smartphone contains more computing power than all the computers used on the Apollo mission combined. But imagine taking the computing power of a Super Nintendo, and packing it into a computer the size of--a living room?

The "mega-processor" is essentially a blown up version of a tiny chip that allows you to see how all the elements of a computer chip join together and how it actually works.

A Cambridge resident has finished building a 10-metre wide and 2-metre high computer in his living room, which he uses to play the video game Tetris.

James Newman took four years and £40,000 to build the processor which works exactly like a small microprocessor chip in a regular desktop computer or laptop that's about the size of a sim card.

This room-sized megaprocessor has 40,000 transistors, 10,000 LED lights, weighs around half a tonne (500kg) and burns 500W of electricity, according to Newman, who explains the entire contraption in a video.

James Newman said his Mega Processor relies almost entirely on the hand-soldered components, and will ultimately demonstrate how data travels through and is processed in a simple CPU core. He's just finished putting together the general purpose registers, and in May completed the arithmetic and logic unit.

Each transistor acts like a digital switch, and can be chained together to form huge decision-making circuits that execute software, instruction by instruction.

Newman, whose background is in software development and FPGA programming, told The Register he has spent about £40k on the project to date. He started planning the processor in 2012, and began building the beast a year later.

Monday, 4 July 2016

The World's First 1,000 Processor Chip ( KiloCore Chip )

A team of scientists from the University of California has created the world's first microchip with 1,000 independent processors. Called 'KiloCore' chip, it is also claimed to be the world's fastest chip ever designed at a university. The chip, which was presented this week at the 2016 Symposium on VLSI Technology and Circuits, is capable of 1.78 trillion instructions per second and contains 621 million transistors. The partially Department of Defense-funded KiloCore chip was ultimately built by IBM using existing 32 nanometer semiconductor fabrication technology.

Unfortunately, a 1,000 core chip isn't something that could just be plugged into the next line of MacBook Pros. It wouldn't even really suffice as a graphics processor, where massively parallel computation is the norm. In fact, many GPUs exceed the 1,000 cores of the UC Davis chip, but with the caveat that the individual cores are directed according to a central controller. The KiloCore, by contrast, is built from completely independent cores capable of running completely independent computer programs.

Here's all you need to know about the chip:

This microchip has been designed by a team at the University of California, Davis, Department of Electrical and Computer Engineering.
KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.
Each processor core can run its own small program independently of the others, which is a fundamentally more flexible approach than the Single-Instruction-Multiple-Data approaches utilized by processors such as graphics processing unit (GPU). Because each processor is independently clocked, it can shut itself down to further save energy when not needed.
The chip has been fabricated by IBM using its 32nm CMOS technology. KiloCore's each processor core can run its own small program independently of the others.
Cores operate at an average maximum clock frequency of 1.78 GHz, and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data.

The independence of the cores makes the KiloCore chip a multiple instruction multiple data (MIMD) computer. This is in contrast to the more typical single instruction multiple data (SIMD) variety of parallel computation, as would be expected in a graphics processor. A SIMD machine's version of parallelism is to implement the same single operation across many different cores - that is, do the same thing to many different units of data. This is the norm in image processing, for example, where a lot of different pixels holding different a lot of different values are all updated in the same way. A MIMD machine can be expected to do much more complex calculations.

Together, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts. As noted in a UC Davis press release, this power requirement is low enough that it could be supplied by a single AA battery, achieving an efficiency of around 100 times that of a normal laptop processor.

The energy savings here largely has to do with the abandoning of the traditional system memory architecture, in which data for multiple cores is stored in a central RAM unit. Rather than sharing data in this way, the KiloCore chip uses a built-in networking scheme in which data is transferred directly between the different processors using packet- and circuit-switched networking.

Friday, 22 January 2016

Radix number systems and conversions

We have learned and use the decimal numbering system simply because humans are born with ten fingers! Hence, the numeric system we is the decimal number system, but this system is not convenient for machines since the information is handled codified in the shape of ON or OFF bits.

This means, we have to learn the binary system in addition to the decimal system. We also will discuss the octal and hexadecimal systems because conversion to/from binary is easy and numbers in these systems are easier to read than binary numbers for humans.

This way of codifying takes us to the necessity of knowing the positional methods of calculation which will allow us to express a number in any base where we need it.

A base of a number system or radix defines the range of values that a digit may have.

Binary Number System

In the binary system or base 2, there can be only two values for each digit of a number, either a "0" or a "1".

Digital and computer technology is based on the binary number system, since the foundation is based on a transistor, which only has two states: on or off.

Each digit of the number is called a bit or which is a short for binary digits.

An 8-bit group is referred to as a Byte
An 4-bit group is referred to as a nibble

Each bit is weighted based on its position in the sequence (powers of 2) from the Least

Significant Bit (LSB) to the Most Significant Bit (MSB).

Each bit must be less than 2 which means it has to be either 0 or 1.

For example (1010.11)2 is evaluated as:

(1010.11)2 = 8 + 0 + 2 + 0 + 0.5 + 0.25 = (10.75)10

Note: The general term for decimal point is radix point

In binary, the count starts at 0 (called 0-referencing), where in decimal, the count typically starts

with 1 (called 1-referencing)

Octal Number System

In the octal system or base 8, there can be eight choices for each digit of a number:

"0", "1", "2", "3", "4", "5", "6", "7".

Octal number systems are used by humans as a representation of long strings of bits since they are:

Easier to read and write, for example 347 in octal is easier to read and write than 011100111 in binary.
Easy to convert (Groups of 3 or 4)
The most common way is to use Hex to write the binary equivalent; two hexadecimal digits make a Byte (groups of 8-bit), which are basic blocks of data in Computers.

Decimal Number System

In the decimal system or base 10, there are ten different values for each digit of a number:

"0", "1", "2", "3", "4", "5", "6", "7", "8", "9".

Decimal number system is default and easy to use for us. For example when you see a number 56 your assumption is that its base or radix is 10 i.e. “56 base 10”.

Each digit is weighted based on its position in the sequence (power of 10) from the Least Significant Digit (LSD, power of 0) to the Most Significant Digit (MSD, highest power).
Each digit must be less than 10 (0 to 9)

Hexadecimal Number System

In the hexadecimal system, we allow 16 values for each digit of a number:

"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", and "F".

Where “A” stands for 10, “B” for 11 and so on.

Conversion among different radices

1. Convert from Decimal to Any Base

Let’s think about what you do to obtain each digit. As an example, let's start with a decimal number 1234 and convert it to decimal notation. To extract the last digit, you move the decimal point left by one digit, which means that you divide the given number by its base 10.

1234/10 = 123 + 4/10

The remainder of 4 is the last digit. To extract the next last digit, you again move the decimal point left by one digit and see what drops out.

123/10 = 12 + 3/10

The remainder of 3 is the next last digit. You repeat this process until there is nothing left. Then you stop. In summary, you do the following:

Conversion of decimal number to binary

Now, let's try a nontrivial example. Let's express a decimal number 1341 in binary notation.

Note that the desired base is 2, so we repeatedly divide the given decimal number by 2.

Conversion of decimal number to octal

Now, let's express the same decimal number 1341 in octal notation.

Conversion of decimal number to hexadecimal

Let's express the same decimal number 1341 in hexadecimal notation.

The easiest way to convert fixed point numbers to any base is to convert each part separately. We begin by separating the number into its integer and fractional part. The integer part is converted using the remainder method, by using a successive division of the number by the base until a zero is obtained. At each division, the reminder is kept and then the new number in the base r is obtained by reading the remainder from the lat remainder upwards.

The conversion of the fractional part can be obtained by successively multiplying the fraction with the base. If we iterate this process on the remaining fraction, then we will obtain successive significant digit. This methods form the basis of the multiplication methods of converting fractions between bases.

Example:

Convert the decimal number 3315 to hexadecimal notation. What about the hexadecimal equivalent of the decimal number 3315.3?

Solution:

Conversion of Any Base to Decimal
Let's try to understand what a decimal number means. For example, 1234 means that there are four boxes (digits); and there are 4 one's in the right-most box (least significant digit), 3 ten's in the next box, 2 hundred's in the next box, and finally 1 thousand's in the left-most box (most significant digit). The total is 1234:

or simply, 1*1000 + 2*100 + 3*10 + 4*1 = 1234

Thus, each digit has a value: 10^0 =1 for the least significant digit, increasing to 10^1 =10, 10^2 =100, 10^3 =1000, and so forth.

Likewise, the least significant digit in a hexadecimal number has a value of

16^0 =1 for the least significant digit, increasing to
16^1 =16 for the next digit,
16^2 =256 for the next,
16^3 =4096 for the next, and so forth.

Thus, 1234 means that there are four boxes (digits); and there are 4 one's in the right-most box (least significant digit), 3 sixteen's in the next box, 2 256's in the next, and 1 4096's in the left-most box (most significant digit). The total is:

1*4096 + 2*256 + 3*16 + 4*1 = 4660

In summary, the conversion from any base to base 10 can be obtained from the formulae

Where b is the base, di the digit at position i, m the number of digit after the decimal point, n the number of digits of the integer part and X10 is the obtained number in decimal. This form the basic of the polynomial method of converting numbers from any base to decimal

Example: Convert 234.14 expressed in an octal notation to decimal.

Example: Convert the hexadecimal number 4B3 to decimal notation. What about the decimal equivalent of the hexadecimal number 4B3.3?

Example: Convert 234.14 expressed in an octal notation to decimal.

Tuesday, 10 November 2015

UVM Interview Questions - 5

Q31: What is virtual sequencer and virtual sequence in UVM?

A virtual sequencer is a sequencer that is not connected to a driver itself, but contains handles for sequencers in the testbench hierarchy. It is an optional component for running of virtual sequences - optional because they need no driver hookup, instead calling other sequences which run on real sequencers.

A sequence which controls stimulus generation across more than one sequencer, coordinate the stimulus across different interfaces and the interactions between them. Usually the top level of the sequence hierarchy i.e. 'master sequence' or 'coordinator sequence'. Virtual sequences do not need their own sequencer, as they do not link directly to drivers. When they have one it is called a virtual sequencer.

Here is a good article which explains how to use virtual sequence and virtual sequencer.

http://www.learnuvmverification.com/index.php/2016/02/23/how-virtual-sequence-works-part-1/

Q32: How set_config_* works?

The uvm_config_db class provides a convenience interface on top of the uvm_resource_db to simplify the basic interface that is used for configuring uvm_component instances.

Configuration is a mechanism in UVM that higher level components in a hierarchy can configure the lower level components variables. Using set_config_* methods, user can configure integer, string and objects of lower level components. Without this mechanism, user should access the lower level component using hierarchy paths, which restricts re-usability.

This mechanism can be used only with components. Sequences and transactions cannot be configured using this mechanism. When set_config_* method is called, the data is stored w.r.t strings in a table. There is also a global configuration table.

Higher level component can set the configuration data in level component table. It is the responsibility of the lower level component to get the data from the component table and update the appropriate table.

following are the method to configure integer, string and object of uvm_object based class respectively.

function void set_config_int (string inst_name, string field_name, uvm_bitstream_t value)

function void set_config_string (string inst_name, string field_name, string value)

function void set_config_object (string inst_name, string field_name, uvm_object value, bit clone = 1)

Q33: What are the advantages of uvm RAL model ?

The RAL (register abstraction layer) provides accesses to DUT and also keeps a track of register content of DUT.
UVM RAL can be used to automate the creation of high level, object oriented abstraction model of registers and memory in DUT.
Register layer makes the register abstraction and access of its contents independent of the bus protocol which is used to transfer data in and out of registers inside the design.
Hierarchical model provided by RAL makes the reusability of test bench components very easy.
The changes in initial configuration of registers or specifications can be easily communicated in the entire environment. RAL layer supports both front door and backdoor access. The backdoor access does not use the bus interface rather it uses the HDL defined paths for direct communication with the device. Thus in zero simulation time the registers of device can be reconfigured using the backdoor access and verification can be started.
One more advantage of backdoor access is that it can be used for verify if the access through front door are happening correctly. To achieve this the front door, write is verified using backdoor read.

Q34: What are the different override types?

Two type of overriding is supported by UVM

1. Type overriding

Type overriding means that every time a component class type is created in the Testbench hierarchy, a substitute type i.e. derived class of the original component class, is created in its place. It applies to all the instances of that component type.

Syntax:

<original_type>::type_id::set_type_override(<substitute_type>::get_type(), replace);

where “replace” is a bit which is when set equals to 1, enables the overriding of an existing override else existing override is honoured.

2. Instance overriding

In Instance Overriding, as name indicates it substitutes ONLY a particular instance of the component OR a set of instances with the intended component. The instance to be substituted is specified using the UVM component hierarchy.

Syntax:

<original_type>::type_id::set_inst_override(<substitute_type>::get_type(), <path_string>);

Where “path_string” is the hierarchical path of the component instance to be replaced.

Q35: Explain end of simulation in UVM?

Different approaches to finish the UVM Test using the objection mechanism are

1. Raising & dropping objections

raise_objection() and drop_objection() are the methods to be used to do that.

2. phase_ready_to_end

phase_ready_to_end method is executed automatically by UVM once ‘all dropped’ condition is achieved during Run Phase.

3. set_drain_time

Another approach supported by UVM is setting the drain time for the simulation environment. Drain time concept is related to the extra time allocated to the UVM environment to process the left over activities e.g. last packet analysis & comparison etc after all the stimulus is applied & processed.

<< PREVIOUS NEXT >>

Monday, 5 October 2015

IBM steps forward to replace Silicon Transistors with Carbon Nanotubes

Carbon Nanotube

The breakthrough is that - IBM improves carbon nanotube scaling below 10nm. How ever before calling it as breakthrough we should also check out what other giants like Intel, AMD, TSMC or Samsung is working on. This breakthrough has relation with the Moore's Law. Yes you got right..!!It says that the transistor counts double only every 18 month or so. It’s the time that Intel marks 40 years of the 4004 microprocessor and here now lying some fear that progress will soon hit a wall.

You can refer to my post History and Evolution of Integrated Circuits where it shows clear progress of semiconductor industry.

But not to worry, IBM has developed a way that could help the semiconductor industry continue to make ever more dense chips to support Moore's law. These chips will be both faster and more power efficient.

Few glimpse of carbon nanotube transistors

Carbon nanotube transistors can operate at ten nanometers
Equivalent to 10,000 times thinner than a strand of human hair
Less than half the size of today’s leading silicon technology
Could also mean wearables that attach directly to skin and internal organs

Here I have an animation for Animated Nanofactory in Action.

As a result of this the devices will become smaller, increased contact resistance for carbon nanotubes has hindered performance gains until now.

These results could overcome contact resistance challenges all the way to the 1.8 nanometer node – four technology generations away.

A project at IBM is now aiming to have transistors built using carbon nanotubes ready to take over from silicon transistors soon after 2020. According to the semiconductor industry’s roadmap, transistors at that point must have features as small as five nanometers to keep up with the continuous miniaturization of computer chips.

IBM has previously shown that carbon nanotube transistors can operate as excellent switches at channel dimensions of less than ten nanometers – the equivalent to 10,000 times thinner than a strand of human hair and less than half the size of today’s leading silicon technology.

IBM's new contact approach overcomes the other major hurdle in incorporating carbon nanotubes into semiconductor devices, which could result in smaller chips with greater performance and lower power consumption.

Earlier this summer, IBM unveiled the first 7 nanometer node silicon test chip, pushing the limits of silicon technologies and ensuring further innovations for IBM Systems and the IT industry.

By advancing research of carbon nanotubes to replace traditional silicon devices, IBM is paving the way for a post-silicon future and delivering on its $3 billion chip R&D investment announced in July 2014.

IBM’s chosen design uses six nanotubes lined up in parallel to make a single transistor. Each nanotube is 1.4 nanometers wide, about 30 nanometers long, and spaced roughly eight nanometers apart from its neighbors. Both ends of the six tubes are embedded into electrodes that supply current, leaving around 10 nanometers of their lengths exposed in the middle. A third electrode runs perpendicularly underneath this portion of the tubes and switches the transistor on and off to represent digital 1s and 0s.

The IBM team has tested nanotube transistors with that design, but so far it hasn't found a way to position the nanotubes closely enough together, because existing chip technology can’t work at that scale. The favored solution is to chemically label the substrate and nanotubes with compounds that would cause them to self-assemble into position. Those compounds could then be stripped away, leaving the nanotubes arranged correctly and ready to have electrodes and other circuitry added to finish a chip.