High-Performance Graphics, CPUs, FPGA Systems and More at the Hot Chips Conference

August 22nd, 2017

Although the weather in northern California has cooled down in mid-August, the upcoming Hot Chips Conference, held at the Flint Center at DeAnza College in Cupertino, Calif., Aug. 20-22, promises to heat things up. Presentations will cover the latest high-performance graphics engines, compute engines, field-programmable gate array accelerators, and other application processors, including one presentation detailing a five microwatt self-timed microcontroller powered by energy harvesting approaches. Two tutorials on Sunday, Aug. 20 covered the new P4 language and hardware implementation issues for Software Defined Networks in the morning, and End-to-End Autonomous Vehicle Platforms in the afternoon.

Due to the solar eclipse on Monday, Aug. 21, after the opening paper from Microsoft detailing its Scorpio processor in its forthcoming Xbox One X system, a short break will take place for eclipse viewing. After the conference resumes, AMD and Nvidia will describe their respective high-performance graphics engines, the Vega 10 and Volta. The remainder of the morning session includes a paper by SiFive detailing the company’s Freedom system-on-chip processors based on the open-source RISC-V CPU core. Following that presentation, ETA Compute will show off its self-timed ARM M3-based microcontroller that consumes just 5 microwatts, thus allowing it to be powered by various energy-harvesting power sources.

Monday afternoon will kick off with a keynote speech discussing “The Direct Human/Machine Interface and Hints of a General Artificial Intelligence” presented by Dr. Phillip Alvelda, now at Wiseteachers.com. Following his presentation are two papers covering autonomous vehicle technology, one by Renesas Electronics Corp. and the other by Swift Navigation. About eight poster papers on various topics will be hosted during the afternoon coffee break – topics include “Using texture compression hardware for neural network inference; Sound tracing: real-time sound propagation hardware accelerator; A memory efficient persistent key-value store on eNVM SSDs; Accelerating big data workloads with FPGAs; Loom, a precision exploiting neural network accelerator; Epiphany-V, a TFLOPS-scale 16 nm 1024-core 64-bit RISC array processor; Fully-integrated surround vision and mirror replacement SoC for ADAS/automated driving; and GRVI Phalanx – a 1680 core, 26 Mbyte RISC-V FPGA based parallel processor.

Closing out the day on Monday, three processor presentations from Baidu/Intel, UCSD/Cornell/University of Michigan, and ThinCI provide some insights into highly parallel solutions. The Baidu/Intel paper details a programmable FPGA accelerator that handles diverse workloads, the UCSD et al paper details Celerity, a tiered accelerator fabric based on the open-source RISC-V processor, and the ThinCI presentation shows off a graph streaming processor the company deems a “next-generation computing architecture.”

Kicking of the second day of the Hot Chips conference will be FPGA papers by Xilinx, Altera/Intel, and a second Xilinx paper, with a paper by Amazon completing the foursome. The first Xilinx presentation details the monolithic integration of RF data converters on a programmable fabric using 16-nm FinFETs for digital-RF communications applications. Following that, Altera/Intel will show off a 14-nm heterogeneous FPGA system-in-package that forms a platform for system-level integration. The second Xilinx presentation highlights a 16-nm FPGA family that incorporates high-bandwidth memory modules and targets datacenter applications. Lastly Amazon will show how it uses FPGAs to accelerate computing subsystems in its AWS F1 instances.

Following the morning break on Tuesday, attention turns to neural networks with presentations from Wave Computing discussing a dataflow processing chip for training deep neural networks, and another from Microsoft discussing the acceleration of persistent neural networks at the scale of datacenters. Google follows these two papers with a keynote presentation examining recent advances in artificial intelligence via machine learning and its implications for computer system design. Additional presentations on neural networks follow the lunch break with papers from Harvard/ARM Research, KAIST, and Google. Harvard/ARM will show off a deep neural network inference engine, KAIST will also detail a deep neural network processor with on-chip stereo matching, and Google will provide a performance analysis of its Tensor processing unit for its AI algorithms.

The last sessions of the conference focus on processor architectures, including Cisco’s 400 Gbit/s multicore network processor, and ARM’s DynamicIQ–a processor employing cluster-based multiprocessing. Last but not least, the final four papers spotlight some extreme performance processors–as IBM takes us through its z14 microprocessor chip set, AMD highlights its next-generation enterprise server processor architecture. Intel will explore its recently released Xeon scalable processor (formerly Skylake-SP), and Qualcomm will dive into its Centriq 2400 processor. The Centriq processor, also known as Falkor, is based on the 64-bit ARM V8 compliant architecture and was designed for cloud-computing applications.

For program details or at-conference registration, go to hotchips.org

Dave Bursky
Semiconductor Technology Editor


Transitioning the Internet of Things to the Internet of Everything

June 8th, 2016

By Dave Bursky, Semiconductor Technology Editor, Chip Design

Voice biometrics to ubiquitous connectivity, this year’s IoT smorgasbord covered a lot of ground.

The huge growth predicted for the Internet of things (IoT) so that every electronic device will be interconnected can only happen if the system and device suppliers can overcome the many challenges and instill confidence that the devices will be secure and interoperate.

These are just two of the many issues raised at the IoT DevCon conference that took place in Santa Clara on May 25 and 26. Many of the keynotes and technical sessions examined security issues and approaches to making the systems more secure. Additional presentations focused on defining the ways that disparate devices can intercommunicate through the use of a standard platform or gateway that can accept devices with different interfaces (WiFi, ZigBee, Bluetooth, Z-Wave, and proprietary interfaces and protocols).

Digital Uniqueness

For example, in the Wednesday morning keynote, Maarten Bron, the Director of Innovations at Underwriters Laboratories, examined the state of IoT security today and took a look at the future, where crowd-sourced testing and public ledger technology could improve security. On Wednesday afternoon, a keynote by Rod Schultz of Rubicon Labs looked at the challenges of provisioning the identity of millions of devices. To do that properly he expects systems will provide secure digital uniqueness coupled with a system or service that validates that uniqueness. Competing for attention with the Rubicon presentation, a presentation by Steven Woo of Rambus examined the trends in semiconductors and all the potential threat sources that a device can face (Figure 1).

Figure 1: From the day it is manufactured to its end-use in a product, a chip can face multiple security challenges as illustrated in this scenario suggested by Rambus.

On Thursday, an entire track from morning till evening focused on multiple aspects of security, with presentations by companies such as Silicon Labs, Barco Silex, Renesas Electronics, Infineon, aicas GmbH and Xilinx, Icon Labs, Knurld, Intel, Secure RF Corp., and still others. Presentations examined chip-level approaches, encryption options, the use of voice biometrics, and still other techniques to ensure the IoT device and the system are secure.

The many interface options that connect all the IoT devices to the gateway was the focus of Wednesday’s panel and an all-day track of presentations dealing with Connectivity, Protocols and Standards as well as the design of gateways. A presentation by Ericsson, for example, examined running Internet protocol on IoT devices to provide ubiquitous connectivity using standard protocols. A speaker from Infiswift offered an in-depth overview of multiple low-power wide area network technologies to help designers select the best connectivity option for their IoT application. A related presentation by Silicon Labs examined various wireless protocols to best fit a connectivity option to an application. The issue of interoperability and the use of standards was examined by a presenter from Real-Time Innovations.

The design challenges of IoT gateways was also a key theme discussed on Thursday, with presenters from Mentor Graphics, PTC Inc., Dell Computers, PrismTech, and ARM, examining different aspects of gateway design. For example, the design of a secure converged reference design for an IoT gateway was the focus of Mentor’s presentation, while the presenter from Dell examined the performance of IoT gateways.

Thus, by creating an open platform that can handle multiple communication wireless interfaces, designers can achieve a high degree of interoperability while maintaining secure communications from multiple end-point devices, through the gateways and on to the host system.

EDA Community Honors Lucio Lanza with Phil Kaufman Award

November 14th, 2014

Over the many years I have spent as an editor, there are a few people that stand out for their creativity and impact on the semiconductor industry. One such person, Dr. Lucio Lanza, more than qualifies as one of those standouts. He was just honored by the Electronic Design Automation community with the 2014 Phil Kaufman Award for his contributions to the EDA industry. The award, named in honor of Phil Kaufman, a former Intel employee who actually worked side-by-side with Lucio in the late 70’s, and went on to become CEO of Quickturn Systems, and founded Silicon Compilers, where he was the chairman and president. Unfortunately, during one of his overseas business trips for Silicon Compilers he died of a sudden heart attack.

Lucio’s career started in 1968 as an engineer at Olivetti in Italy where he was responsible processor architecture and design. In 1977 he joined Intel Corp. and moved to California where he rose to Chairman of the Microprocessor Strategic Business Segment. During his work at Intel he worked closely with Phil Kaufman, who managed the graphics and communications product lines. While at Intel Kaufman was a driving force behind the IEEE Ethernet standard and played a large role in developing the IEEE Floating-Point Standard.

While at Intel Lucio saw the need for design tools that would help design productivity keep pace with the increasing complexity of the chips that designers had to create. Keeping an eye out for trends, Lucio also saw opportunities in Ethernet communications and invested in Crescendo Communications, a successful investment venture that was eventually purchased by Cisco, which had great success with its Catalyst series of network switches that emerged from the acquisition.

The need to fill gaps in the design chain gave Lucio the impetus to leave Intel in 1983, where he joined Daisy Systems as VP of Marketing and became the general manager of the company’s EDA division. After parting ways with Daisy in 1986, he started EDA Systems, a company that would fill the gap he perceived by creating a tool framework that eased the integration of third-party tools into a unified environment. That company was acquired by Digital Equipment Corp., and following the acquisition he connected with Cadence Design Systems, serving as a consultant and guided the company through 13 acquisitions of other tool vendors.

In 1995 he parted with Cadence and helped start PDF Solutions and other familiar organizations such as Sandcraft (a MIPS IP supplier) and Forte Design Systems. (Forte was recently acquired by Cadence.) Around that time he also joined U.S. Venture Partners, a venture capital company and he also struck out on his own and spent about five years as an independent consultant to companies in the semiconductor, communications, and EDA industries.

In 2001 he decided to switch full time to the investment side and started his own company, Lanza TechVentures, an early stage venture capital and investment company, and since 2008 he has been a general partner and chief technology strategist for Radnorwood Capital LLC., an investor in public technology companies. Never one to sit still, Lucio also joined up with ARM Holdings and became a non-executive director at the company. He resigned his post at ARM in 2010 to pursue still other investment opportunities, some in non-electronic areas such as pharmaceuticals and health-related products, as well as automotive systems, analog design tools, and the internet of things.

Lucio doesn’t see any end in sight for the opportunities and even has some words of advice to startups – be sure you are working on something that has the potential to change the world, and be intellectually honest (don’t lie to yourself), analyze your strategy and rethink it regularly. The CEO job is the loneliest job and you can’t show “feels” – you have to demonstrate concise thinking based on market facts. These words of advice from Lucio echo how he has driven his career over the years in developing new companies and investing in many others.

Again, I want to congratulate Lucio on winning the Kaufman award and wish him continued success in his future endeavors.

Dave Bursky
Semiconductor Technology Editor

Server system-on-chips pack up to 48 64-bit ARM cores

June 18th, 2014

Targeting secure cloud servers, storage servers, compute servers, and data-plane applications, the ThunderX series of multicore SoCs deliver power-efficient computing solutions

Dave Bursky
Semiconductor Technology Editor

Multicore processors based on x86 cores are a very common choice for servers and for handling packets in data-networking applications. Although x86-based servers command most of the IT market, other processors such as MIPS and PowerPC are key players in the deeply embedded applications such as network switches and routers, handling both data plane and control plane functions. ARM processors have started to make inroads in the server market, and with the release of the A57 64-bit core, the ARM processors are poised to make significant inroads into all the applications that are currently employing the x86, MIPS, and PowerPC cores.

One example of that opportunity takes aim at low-power servers and secure network communications — the just-released ThunderX series of multicore processors from Cavium. This family includes versions containing from 8 to 48 customized ARM 64-bit processor cores that can operate at up to 2.5 GHz. There will actually be four families of processors in the ThunderX series–each optimized for a different type of workload. The ThunderX_SC is targeted at security applications, the ThunderX_ST for storage control and management, the ThunderX_NT for networking systems, and the ThunderX_CP for computational applications.

Implemented in a low-power 28-nm process, the basic ThunderX architecture brings together up to 48 full custom 64-bit processor cores that are fully compliant with the ARMv8 architecture specification and ARM’s Server Base System Architecture (SBSA). Included on each multi-core chip are a cache subsystem (each processor has level 1 instruction and data caches, and all processors share an L2 cache), Ethernet interfaces capable of 10/40/100 Gbit/s data rates, multiple PCIe gen3 and SATA v3 interfaces, up to four DDR3/4 memory controllers, additional I/O ports, and various accelerators depending on the market segment the processor is optimized to tackle (see the figure).

Members of the ThunderX family from Cavium contain up to 48 ARM64 processor cores, application-specific hardware accelerators, high-speed Ethernet ports, both PCIe gen3 and SATA v3 ports and many other system support features to support Compute, Storage, Networking, and Secure Computing applications.






For example, the ThunderX_SC family is optimized for Secure Web frontend, security appliances and Cloud RAN type workloads. It includes specialized hardware accelerators consisting of Cavium’s 4th generation NITROX and TurboDPI technology with acceleration for IPSec, SSL, Anti-virus, Anti-malware, firewall and DPI. The NITROX engine can deliver 50 Mbps to 40Gbps of encryption bandwidth with 1K to 200K RSA/DH operations per second. Additionally, the TurboDPI block employs the company’s Uniscan technology that simultaneously blocks malicious or inappropriate URLs, identifies hundreds of widely used protocols and applications, helps block thousands of different intrusion attempts and locates over a hundred thousand varieties of virus and malware threats, all with just a single scan of the data stream,

Also integrated on the Thunder_SC are multiple 10/40 Gbit/s Ethernet ports, multiple PCIe Gen3 and SATA 3 ports, up to four high-memory-bandwidth DDR 3 or DDR 4 72-bit memory controllers able to support 2400 MHz memories, a cache-coherent interconnect across dual sockets thanks to the Cavium Coherent Processor Interconnect, and a scalable fabric for east-west as well as north-south traffic connectivity. Most of these features are also available on the other Thunder families along with accelerators for each target application segment – the ST series includes storage accelerators for data protection, data integrity, security and compression, as well as efficient user-to-user data movement, the CP series includes core-to-I/O virtualization in hardware, and the NT series processors include full virtualization support and network accelerators for QoS, traffic shaping, tunnel termination, and high packet-throughput processing, network virtualization, and data monitoring.

Interesting product developments at DAC

June 4th, 2014

Dave Bursky

Many interesting IP and design verification announcements were one of the key topics running through this year’s Design Automation Conference. Several IP announcements from CAST Inc., for example, offer solutions in video decoding, graphics acceleration, and image decoding. Although developed by the Fraunhofer Henrich Hertz Institute, a H.265 HVEC decoder core is now available from CAST, and it is the first in a series of high-efficiency video coding cores that CAST will offer. The core implements the MPI-D main profile intra HVEC decoding and will be available in the third quarter of this year. The decoder design makes clever use of internal and external memory and its application-specific internal memory architecture enables the core to reuse already fetched data, thus reducing the number of memory fetches. Fewer fetches give more memory bus bandwidth back to the CPU, while at the same time reducing the power needed by the core.

Another core offered by CAST that was developed by IP partner Think Silicon, saves energy in graphics applications by offloading a GPU or a CPU that does not include GPU support. The Think2.5D graphics accelerator is a rendering engine that accelerates two-dimensional graphics functions and pseudo three-dimensional effects such as reflected and shadowed icons. The engine significantly offloads a system’s GPU, performing the calculations at a reduced power level. And for systems without a GPU, the core can offload the host CPU and accelerate the calculations, providing a “snappier” feel to the screen operations –and at lower power consumption levels. Also available from CAST is a graphics processing unit that was also created by Think Silicon. The ThinkVG core supports the Khronos Group OpenVG 1.1 standard, and CAST claims it is one of the smallest and lowest power GPU cores available. Inside the core is a floating-point SIMD streaming engine specifically designed for graphics applications (Vshader) plus graphic accelerators for the blending, rasterization, and texture-mapping functions.

For still images, Alma Technologies, another CAST partner developed a 12-bit extended-resolution JPEG decoder, the JPEG-D-X, that CAST supports. The core supports applications requiring images with greater dynamic range, such as in medical imaging and machine vision. Able to decode static images or motion JPEG streams compressed in Baseline or Extended JPEG formats with 8-or 12-bits per sample precision. The decoder complements the company’s previously-release 12-bit JPEG encoder, and provides efficient, low-latency decompression do deep color images and video with a tiny silicon footprint and low power consumption.

It’s not often tool vendors will offer a free version of one of their new tools, but Agnisys has done just that – free versions of DVinsight, a correct-by-construction tool for design and verification applications. The tool is an integrated development environment for he development of Universal Verification Methodology (UVM) based System Verilog (SV) design verification (DV) code. DVinsight ensures compliance with best practices in using UVM while adhering to established standards. The tool provides on-the-fly checks and guides for creating SV/UVM code, provides auto code completion, context-based hints and includes many built-in rules to ensure correct-by-construction DV code development.

Another newcomer in the DV space is SmartDV North America, the U.S. arm of SmartDV Technologies India Private Ltd. The company provides well-supported verification IP blocks that include compliance test suites and complete functional coverage models that help accelerate time to market. The verification models are generated by the company’s internally developed compiler technology, which allows the company to rapidly generate the verification IP and tweak the IP very rapidly (in days rather than weeks) if customers need any customization or a bug must be corrected. Also offering verification IP, TrueChip provides support for USB 3, various versions of ARM’s AMBA bus, and will shortly have the new USB 3.1 verification IP.
Additional DAC product updates will appear in the next column.

From 3-D transistors to 2.5D or 3D systems

December 31st, 2013

From the ultra-small 3D transistors described in papers at this month’s International Electron Devices Meeting (IEDM) in Washington, D.C., to the 2.5D and 3D multichip structures described at the 3D Architectures for Semiconductor Integration and Packaging (ASIP) conference held in Burlingame, Calif., designers are finding more ways to pack more transistors on a chip and to pack more functions into a limited area on a printed-circuit board. For instance, at IEDM TSMC Shien-Yang Wu and his team of researchers described a 16-nm FinFET process in paper 9.1 that they feel is one of the world’s most advanced semiconductor technologies.

The process is the first integrated technology platform to be announced below the 20 nm node, with key capabilities that include a 48-nm fin pitch and the smallest SRAM cell ever incorporated into an integrated process—a 128-Mb SRAM with a cell area of just 0.07 µm2 per bit. The process’ short-channel effects were well-controlled, with DIBL <30 mV/V, saturation current of 520/525 µA/µm at 0.75V (NMOS and PMOS, respectively) and an off-current of 30 pA/µm. Depending on the designer’s goal, the process delivers either a 35% speed gain or a 55% power reduction in comparison with TSMC’s existing 28-nm high-k/metal-gate planar process, and with twice the transistor density (Figure 1).

Figure 1: The 16 nm process platform developed by TSMC allows designers to get 55% reduction in operating power or a 35% improvement in operating speed vs the company’s established 28 nm high-K/metal-gate process.

Creation of a “superchip” was the goal of researchers at the New Industry Creation Hatchery Center at Tohoku University in Sendai, Japan. The heterogeneous 3D integration described by the Professor Mitsumasa Koyanagi in a plenary presentation at IEDM allows various kinds of device chips with different sizes, different functions, and different materials to be stacked to form the superchip. A key technology developed to achieve this consists of self-assembly and electrostatic (SAE) temporary bonding. To demonstrate the technology, the university fabricated several prototype superchips—examples include stacking MEMS chips, spin memory chips and a photonic device chip on a CMOS logic chip; a 3D back-illuminated image sensor with through-silicon vias stacked on top of an image processing chip; and a 3D microprocessor with self-test and self-repair functions.

The assembly process to create superchips starts with known good die (KGD) that are sorted from several device wafers and simultaneously bonded as a batch onto a carrier wafer (Figure 2, left). High alignment accuracy is achieved using the self-assembly and electrostatic bonding. The process repeats with additional carrier wafers. Multiple carrier wafers with the KGDs are then stacked onto a target interposer wafer. This allows multiple superchips to simultaneously be fabricated. The surface tension of liquid is used in the self-assembly scheme to simultaneously align many dies in parallel. Hydrophilic areas and hydrophobic areas are formed on the surface of the wafer or chip to obtain high alignment accuracy. As many as 500 chips have been simultaneously aligned with an average alignment accuracy of 0.05 µm within 0.1 seconds.

Figure 2: The assembly process to create superchips starts with known good die (KGD) that are sorted from several device wafers and simultaneously bonded as a batch onto a carrier wafer (left). The process repeats with additional carrier wafers and then multiple carrier wafers with the KGDs are then stacked onto a target interposer wafer using electrostatic bonding and debonding (right).


The electrostatic temporary-bonding and de-bonding method for assembly of the multiple carrier wafers allows the stacking integration of multiple chips (Figure 2, right). Many chips are simultaneously bonded onto the electrostatic carrier wafer (e-carrier) by the electrostatic force after the simultaneous alignment by self-assembly. The electrostatic force for temporary bonding is generated by applying a high voltage to the electrodes embedded in the e-carrier wafer. A high voltage with opposite polarity is applied to the electrodes for de-bonding the chips.

These two presentations are just the proverbial tip of the iceberg representing several hundred paper presentations at IEDM that covered process and manufacturing, memory technology, nano-device technology, power and compound semiconductors, advanced CMOS technology, and many other subjects. For more information, go to www.ieee-iedm.org.

Running concurrently with IEDM but on the opposite coast, the 3D-ASIP Conference in Burlingame delved into many aspects of 2.5 and 3D integration, ranging from basic integration, to various interposer technologies, to wafer handling and thermal challenges to name a few. Many of the presentations examined the evolution of assembly techniques to move from 2D to 2.5 D to true 3D implementations. Doug Yu, the Director of the Integrated Interconnect and Packaging Division at TSMC provided an overview of wafer-level system integration technology, while Robert Patti, the CTO of Tezzaron Semiconductor described a combination of dis-integration and then integration to create a high-density and high-performance memory stack (See “Advances in DRAM and non-volatile memories keep upping system performance”, Aug. 26, 2013). The architecture of the memory array provides 256 independent channels, each containing 256 Mbits of storage and capable of transferring data at 64 Gbits/s with a latency of just 9 ns
Another memory presentation by Eric Beyne, the Program Director for 3D System Design at IMEC examined high-bandwidth memory-logic 3D integration by either direct stacking or the use of interposers. One of the key aspects of leveraging the 3D integration is to reduce the power consumption of the chip-to-chip interconnects by lowering the voltage swing, widening the I/O to lower the transfer frequency, and use vertical interconnects in a chip stack to reduce the wiring length (Figure 3). Using 3D through-silicon vias and microbump interconnects designers at IMEC were able to assemble high-density chip stacks, but encountered issues with the increased power density. The high power density can result in thermal issues (increased temperatures) and higher temperatures could affect DRAM data retention since retention time decreases as temperature increases.

Figure 3: Multiple factors such as the I/O width, the load capacitance, the transfer frequency, and the operating voltage must be taken into account when estimating the power consumed in chip-to-chip interconnects. (Source, IMEC).

Just such thermal issues were discusses by Joseph Maurer a support contractor to DARPA, and by Muhannad Bakir, Associate Professor, School of Electrical and Computer Engineering at the Georgia Institute of Technology. At DARPA, Maurer described multiple projects aimed at pulling out the heat and improving thermal conductivity. Techniques such as the use of copper nanosprings; near-junction thermal transport with liquid cooling and high-thermal conductivity diamond substrates; the use of a 3D vapor chamber with vibrating elements; the use of thin-film superlattice materials; and still other approaches are all being explored. Examining the use of microfluidic cooling on 3D ICs, Bakir showed a potential solution using coolant cycled through a multichip stack composed of two processor layers and a memory stack (Figure 4). With such a stack there are concerns about the reliability of circulation system used the microfluidic cooling, as well as the endurance of the TSVs since they are under pressure from the liquid flowing between the layers.

Figure 4: Microfluidic cooling between layers of chips, can pull out the heat, but there are concerns about the reliability of the microfluidic I/O technology as well as power-supply noise and the durability of TSVs due to the pressure of the liquid coolant as it flows through the package. (diagram courtesy of Georgia Tech).


These few papers were just a few of the presentations at the 3D-ASIP conference. For more details, go to www.3dasip.org to view the program or purchase the proceedings.

Dave Bursky
Semiconductor Technology Editor

Advances in CPUs, System Architecture, Heat Up the Performance Race at Hot Chips

September 14th, 2013

Celebrating its 25th year, the annual Hot Chips Conference held at Stanford University last month lived up to its reputation for highlighting high-performance processor solutions as well as advances in low-power designs for mobile applications. One of the highest performance processors unveiled at the conference was the Power 8 – the next generation Power series processor developed by IBM. The processor chip was fabricated using a 22 nm silicon-on-insulator process that allows designers to pack 12 CPU cores, lots of level-2 cache (512 kbytes 8-way per core), a large embedded-DRAM level-3 shared cache (96 Mbytes, 12 x 8 Mbytes 8-way bank), PCI Gen 3 ports capable of 8 Gbits/s transfers, eight DRAM ports, and many other sub-processors to offload the CPU and manage power (Figure 1). All these functions add up to about 5 billion transistors, all squeezed into a 650 mm2 chip with 15 layers of metal that can be clocked at 4 GHz and consumes over 300 W.


Figure 1: Containing about 5 billion transistor, the Power8 CPU developed by IBM combines 12 processor cores, 96 Mbytes of embedded DRAM for a level 3 on-chip cache, dual gen 3 PCIe ports, and eight memory ports. A separate memory buffer chip that connects to each memory port handles four channels of DDR3 memory.


The high throughput of the multicore processor posed another challenge – transferring data from or to the memory at a rate that won’t starve the process cores. To solve that problem, along with the processor chip, IBM designers crafted a memory buffer chip dubbed Centaur that packs 16 Mbytes of cache and provides a 9.6 Gbyte/s high-speed interface to the processor. The Power 8 processor can support up to eight of the Centaur memory buffers to achieve a sustained transfer rate of 230 Gbytes/s between the buffers and the processors. Each buffer chip has four DDR3 memory channels that yield a peak throughput of 410 Gbytes/s when all 32 memory channels are transferring data. A fully configured processor socket can address up to 1 Tbyte.

Additional high-performance CPUs based on the SPARC architecture included the SPARC 64X+, a next-generation processor targeted at UNIX servers developed by Fujitsu, and the M6 SPARC processor for enterprise systems developed by Oracle. A companion ASIC to the M6, the Sixby, is a scalability and coherency directory chip to support the company’s highly-scalable enterprise systems.

Fujitsu’s design packs 16 processor cores, each capable of running two threads, a shared L2 cache of 24 Mbytes, dual DDR3 memory interface controllers, several hardware-based software accelerators for cipher, database, and decimal calculations, dual PCI gen3 I/O controllers, and many other enhancements. The SPARC 64X+ processor was fabricated on a 28 nm CMOS process, contains almost 3 billion transistors, 1500 signal pins, and can clock at over 3.5 GHz. A single processor chip can deliver a peak performance of over 448 GFLOPs and has a memory throughput of 102 Gbytes/s. Designed for use in systems with from 1 to 64 CPU sockets, the processors incorporate a high-speed interconnect that can transfer data at up to 25 Gbits/s, per lane directly between CPU sockets (about 70% faster than the company’s previous SPARC 64X CPU). A basic system building block comprises four CPU sockets and two crossbar chips, and a full 64-socket system would contain 16 building-block modules capable of executing 2048 program threads.

Similar in some features to the Fujitsu processor, the Oracle M6 is also fabricated in a 28 nm process, packs 12 processor cores, and integrates two 8-lane PCIe gen 3 ports. However the M6 will be capable of executing eight program threads per core vs the two threads per core of the SPARC 64X+, and incorporates 48 Mbytes of L3 cache vs no L3 cache on the Fujitsu processor. Four DDR3 memory schedulers, each capable of handling four memory channels provide a total of 16 DDR channels that can address a total of 1 Tbyte per CPU socket (Figure 2).

Figure 2: The Oracle M6 processor packs 12 processor cores, each capable of executing eight program threads, 48 Mbytes of L3 cache, a pair of 8-lane PCIe gen 3 ports, and four DDR3 memory schedulers that together can address a total of 1 Tbyte per CPU socket.

In the large systems targeted by the Power8 and the SPARC processors, the CPUs are only a small part of the overall system. At the other extreme, however, in handheld, laptop and desktop computers, a system-on-a-chip solution is typically the main component. One example of that, the Kabini processor developed by AMD, packs four Jaguar CPU cores, a high-performance graphics and multimedia engine with Display Port, HDMI, and VGA outputs, SATA, USB, PCIe interfaces, advanced power management, and still other functions (Figure 3). Implemented in a 28 nm process, the chip is only 105 mm2—one-sixth the area and transistor count of the IBM Power8 and close to 1/30th the power. The four processor cores share a 2-Mbyte level-2 cache and are fed through a 64-bit DDR3 memory interface that can transfer data at up to 10.3 Gbytes/s with DDR3-1600 memory DIMMs. Additionally, the graphics/multimedia core is based on the company’s RADEON HD8000 Graphics Core Next (GCN) architecture that can handle 4k by 2k resolution. The core includes a video codec engine that can encode H.264 streams, and a universal video decoder that handles over half-a-dozen codec formats.

Figure 3: A highly-integrated system-on-a-chip solution for notebook and other portable computing systems, the Kabini processor packs four Jaguar CPU cores that share a 2 Mbyte level-2 cache and a 64-bit DDR3 memory interface. Included on the chip is a high-performance graphics/multimedia processor based on the HD8000 Radeon graphics core.


Making its move into the low-power handheld systems arena, Intel showed off its Clovertrail+, an SoC  targeted at smartphones that is a significant upgrade over Intel’s Medfield based smartphone solutions. Implemented in a 32-nm high-k metal-gate process, the Clovertrail+ (Atom Z2580) contains dual Atom CPU cores, a dual-core 2D/3D graphics processor, a multi-standard video decoder capable of handling 1080p/60 Hz video, a video encoder able to encode 1080p/30Hz video, a camera/imaging subsystem based on a programmable very-long-instruction word SIMD vector processor, a security engine, and many other control, power management, and system interface support functions (Figure 4).

Figure 4: Taking aim at next-generation smart phones, the Clovertrail+ system-on-a-chip platform developed by Intel employs two dual-thread Atom processor cores that can run at up to 2 GHz. The chip contains a 2D/3D dual core graphics engine as well as dedicated video decode and encode blocks, an image signal processor, a crypto engine, and many other interface and control blocks.


Each Atom core has a 512 kbytes L2 cache, can execute two program threads and run at a top speed of 2 GHz. The processor’s memory interface can support up to 2 Gbytes of low-power DDR2 (533 MHz) and address up to 256 Gbytes over an eMMC 4.41 interface. The enhancements in the Clovertrail+ processor yield a doubling of overall performance vs the Medfield processor and up to a 3X improvement in graphics performance.

Also focusing on portable system solutions, the just-unveiled Richland processor detailed by AMD incorporates the company’s Turbo-Core temperature-smart technology to manage core performance and power consumption (Figure 5). The Turbo-core technology is designed to more effectively exploit temperature margins by detecting favorable thermal conditions in real time and adjusting operating voltage and frequency.

Figure 5: The Turbo-core temperature-smart technology developed by AMD measure core temperature and performs various calculations to determine the optimum operating frequency and voltage for the CPU and graphics cores on the Richland processor chip.


The processor contains two dual-core CPU modules, each based on AMD’s recently released Piledriver CPU core. Each module has a 2 Mbyte L2 cache that is shared across the two CPU cores in the module. Like the Kabini processor, the Richland also incorporates an HD8000 series graphics processing unit and multimedia accelerators to offload the CPUs. The chip delivers up to 29% higher CPU performance and 41% higher GPU performance than the company’s previous generation solution while keeping power consumption down, allowing systems to deliver 10 or more hours of idle operation, or over five hours of video playback. Also supported by the processor chip is AMD’s wireless display for Windows 8.1, a low-latency wireless interface that can stream HD video at 1080p/60 Hz along with rich audio playback.

With low power consumption on nearly every processor designer’s mind, a new body bias technology used in conjunction with deeply-depleted-channel (DDC) transistors developed by SuVolta promises to reduce processor power consumption by as much as 50%. Implemented in an ARM Cortex M0 processor to validate the concepts, the body bias network requires minimal routing resources and the DDC transistor performance can be optimized to reduce leakage current for high-performance designs, or reduce active power for low-voltage threshold devices, thus permitting the processor designers to optimize the performance and power. The M0 processor implemented with 65 nm design rules achieved comparable performance at half the power vs a standard 65 nm process, or delivered 35% better performance at comparable power levels.

These are only a few of the many presentations at the Hot Chips conference. To view the full conference program, go to www.hotchips.org.

Dave Bursky
Semiconductor Technology Editor
Chip Design Magazine

Advances in DRAM and non-volatile memories keep upping system performance

August 26th, 2013

In the drive to improve system performance, faster processors often end
up spotlighting system bottlenecks, especially in the memory subsystem. To
reduce those bottlenecks, designers are developing faster-accessing memories,
faster interfaces with reduced overheads, and even new memory architectures and
technologies. At this month’s MemCon conference in Santa Clara, Calif., presentations
highlighted many of the developments in storage subsystems and devices that
promise improve memory subsystem performance. Memory interfaces such as DDR3
will soon give way to DDR4 and the low-power DDR3 will give way to LPDDR4,
while new interfaces such as HMC (hybrid memory cube), Wide I/O 2, eMMC 5.0, and
NVMe are gaining acceptance for future system designs.

When designers try to start implementing systems based on these new
standards, having functional models that can link into the memory subsystem
help verify system designs before any hardware gets built. Released at the
conference, new verification IP models developed by Cadence Design Systems for
DDR4, LPDDR4, Wide I/O 2, eMMC 5.0, and HMC allow designers to check out their
designs, while providing them with trace debugging, address scrambling, and
backdoor memory access. The models support all leading third-party simulators,
verification languages, and methodologies, thus enabling SoC designers to
verify the correctness of the interfaces to the specialized memories.

New memory architectures, such as the Dis-integrated 3D RAM developed
by Tezzaron Semiconductor (Figure 1a) and the Hybrid Memory Cube developed by
Micron Semiconductor (Figure 1b) in conjunction with Samsung, SK Hynix, Open
Silicon, IBM, ARM, Altera, and Xilinx, promise to provide much higher bandwidth
– in the case of the HMC module, a bandwidth of 160 Gbytes/s, which is a 15X
boost in memory bandwidth over a DDR3 memory module, while Tezzaron is
projecting a data bandwidth of 16 Tbits/s for its novel memory structure.


Figure 1a: The high-density memory subsystem proposed by Tezzaron consists of multiple layers of DRAM memory cells and access transistors. These layers sit on top of a layer of sense amplifiers and control logic, which, in
turn, sits on top of another chip that contains the I/O circuits.

Figure 1b: The hybrid memory cube developed at Micron also consists of multiple thinned chips that each contain multiple blocks of DRAM cells.  The multiple chips are stacked using through-silicon vias and sit on top of a logic chip that controls access to all the memories and the I/O operations.


Both the Tezzaron and Micron solutions have a somewhat similar approach
– multiple layers of thinned DRAM storage chips all interconnected and then connected
to a lower layer or two that contains the control logic and I/O. In Tezzaron’s
concept, there are 256 independent memory channels with each channel containing
256 Mbits of storage and delivering a 64 Gbit/s bandwidth. This gives the
DiRAM4 stack the capability of delivering 21 billion transactions per second.
Micron’s hybrid memory cube consists of a single package containing multiple
memory die and one logic die, stacked together and interconnected using
through-silicon via (TSV) technology. Within an HMC, memory is organized into
vaults. Each vault is functionally and operationally independent. Each vault
has a memory controller in the logic base (called a vault controller) that manages
all memory reference operations within that vault. Each vault controller
determines its own timing requirements. Refresh operations are controlled by
the vault controller, eliminating this function from the host memory controller.

One additional presentation at MemCon discussed the future of ReRAM (resistive
RAM) as a possible DRAM replacement. A new material developed by 4DS, dubbed
MOHJO (metal oxide heterojunction operation) allows the company to develop ReRAMs
with a high cycle live, low power dissipation, good data retention, reduced
manufacturing time and cost, and it also solves the word-line drop problem that
occurs with other ReRAM solutions. The MOHJO material is deposited on the
back-end of the process flow, on top of a standard CMOS manufacturing flow. The
material has low-current reset state that permits large blocks of memory to be
erased, and MOHJO-based memories can be especially useful in solid-state drive
systems by lowering the energy consumption by almost 100X. In comparison to
flash, spin-torque technology (STT), phase-change memories (PCM), and MOHJO,
the MOHJO technology has about the same endurance as PCM storage, but its
endurance is significantly lower than STT memories yet higher than flash. Read
and write performance of the MOHJO memories is symmetrical and ranges from 10
to 50 ns, which is about 200X faster than flash but competitive with STT and
PCM memories (See the table). The company expects this technology to be used in
both flash replacement applications, as well as a nonvolatile cache in hybrid DRAM

Performance comparison of Flash, STT, PCM and MOHJO technologies

Dave Bursky

Semiconductor Technology Editor

The Big and Small Come Together at Semicon and Intersolar

July 25th, 2013

The recently held Semicon West and Intersolar Conferences in San Francisco were interesting examples of technology extremes. At Semicon, for example, major efforts are underway to define the equipment and fabrication facilities needed to produce chips based on 450 mm diameter wafers, while at the other extreme at Semicon, device and equipment designers were challenging each other to define and design ultra-small transistors and the lithography and other systems capable of fabricating devices with gate dimensions of 14 nm, 10 nm, and even smaller features. Intersolar also had it extremes, with presentations discussing energy efficiency of photovoltaic cells measuring a few square inches to the performance aspects of multi-square-meter PV panels and the implementation of large multi-acre commercial PV arrays.

Large research consortiums such as IMEC (formerly the Interuniversity Microelectronics Centre) in Leuven, Belgium, LETI (Laboratoire D’Electronique et de Technologies de L’Information) in Grenoble, France, Sematech in Albany, NY, as well as foundries such as Global Foundries, TSMC, and others are all working hard to define and qualify the processes needed for future-generation chips. Over the past few decades, scaling has lowered the cost of transistors by integrating more and more devices on a chip, even as the cost to fabricate the chips continued to increase.

However, Kurt Ronse, director of advanced patterning at IMEC explains that the extremely high cost of fabrication tools and facilities to implement features of 14 nm and smaller, has led to an increasing cost per transistor. The higher cost comes as a result of the use of triple or quadruple patterning with 193 nm immersion lithography. Such patterning techniques require many more masks to create the ultra-small features, and the higher number of masks adds considerably to the fabrication cost. Subi Kengeri, the Vice President of Advanced Technology Architecture at Global Foundries confirmed the rising cost of lithography comparison to other factors in a TechXpot presentation. In the graph he presented, various steps in the manufacturing process—etch, CMP, Doping, Metrology, metal deposition, dry etch, diffusion and dielectric deposition, and lithography were compared for relative costs (Figure 1). In most cases only moderate cost increases were observed. However lithography costs escalated the most for 193i for nodes below 20 nm.

Figure 1: A comparison of costs of difference aspects of the manufacturing flow was done by Global Foundries across four process nodes to show the cost increases as the process nodes shrink from 28 nm to 20 nm, from 20 nm to N+1 nm using 193i immersion lithography, and alternately from 20 nm to N+1 using extreme ultraviolet lithography. As the graph shows, lithography costs skyrocket for the N+1 node using immersion lithography, but when EUV lithography is used the lithography cost drops considerably.


According to Ronse, it will still be a few years before EUV systems can be used in mass production, but only EUV systems can enable the 50% scaling needed to reach the 10 nm node. Current EUV research has led to UV power sources capable of delivering about 55 W. However, for a tool capable of commercial production, UV sources capable of 250 W will be needed. Such sources are not expected until 2015 at the earliest. Additionally, Ronse is hopeful that at the 10 nm node, EUV lithography can reverse the cost escalation trend since double or triple patterning would not be needed to create the 10 nm features (Figure 2).

Figure 2: Researchers at IMEC also agree that lithography costs have become a significant portion of the fabrication flow. In this graph the 28 nm node is used as the relative reference, with the 20 nm node costing almost 50% more and the N+1 193i showing an almost 80% cost increase over the 28 nm node, while the use of EUV promises to reduce the cost increase to just 20% vs the 28 nm node.


At the device level there has been much written about the three-dimensional FinFET structures and the high performance that such transistors can deliver. However, there is a competition brewing between FinFET advocates and the supporters of planar fully-depleted silicon-on-insulator (FDSOI) device structures. Additionally, to further boost device performance, researchers are looking beyond silicon for the channel material in future transistor structures — options being researched include III-V materials, silicon-germanium, germanium, and carbon nanotubes.

As Maud Vinet, the FDSOI Manager at LETI explained, the planar FDSOI structure requires fewer masks and is easier to scale than the 3D structure of the FinFET. Additionally, with FDSOI designers can take one of two directions – they can opt for lower power consumption at comparable performance to current designs, or they can design for higher performance at comparable power levels. This quarter Vinet expects LETI’s development partner, STMicro, to start releasing FDSOI 14 nm design kits, while in early 2014, device and process models for 10 nm FDSOI designs should be ready for STMicro to develop the design kits for release to the availability in the third quarter of 2014.

Dave Bursky, Chip Design Magazine

Microfluidics – an interesting blend of MEMS, IC technologies, and paper

July 3rd, 2013

At a recently held MEMS Technology and Business Symposium hosted by MEPTEC (the Microelectronics Packaging and Test Engineering Council) in San Jose, Calif., many advances in MEMS technology focusing on health care demonstrated the implementation in silicon of pumps, valves, chemical sensing, and still other functions. Additional research on the use of paper rather than silicon as the substrate shows a lot of promise since paper is very inexpensive, is compatible with many chemical/biochemical/medical applications, and it transports liquids using capillary forces, thus eliminating the need for a MEMS-based pump.

This combination of microscopic mechanical functions, silicon control circuits, and paper-based sensors, is making possible a wide range of products for the medical eHealth market and for industrial and military applications. As demonstrated in a presentation by Dr. Gisela Lin from University of California at Irvine, silicon technology can now implement all the functions to form a “lab-on-a-chip” – bubble pumps, fluid channels, a mixing chamber, a polysilicon heater, and valves, all interconnected and controlled by an off-chip processor (Figure 1). The technology is similar that used by the ink-jet printer print heads.

Figure 1: Implemented in silicon, this lab-on-a-chip can pump liquid through fluid channels, warm the liquid using polysilicon heaters and control the liquid flow into mixing chambers via electrically controlled valves.



And the innovation doesn’t stop there as Dr. Janusz Bryzek, the Vice President of Development for MEMS and Sensing Solutions at Fairchild Semiconductor pointed out in the conference’s opening presentation. Driving that development is the growth in the wearable health monitoring market – according to ABI Research, a market research company, in 2010 just 10 million monitoring devices were deployed and all for mostly sports and fitness applications. However by 2014, ABI analysts expect the market to grow to 420 million wearable health monitors, with about 59 million used at home.

Ongoing research at several universities is examining the ability of directly printing sensors on skin, allowing direct-contact measurements. For example, at the University of Illinois at Urbana-Champaign, researchers have succeeded in printing a triple-function sensor that senses the skin’s temperature, strain, and hydration state, all of which are useful to track general health and wellness, as well as for monitoring wound healing (Figure 2). An even more complex sensor circuit developed at the University of California at San Diego combines ECG and EMG sensors, temperature sensors, strain gauges, photodetectors, a wireless antenna, a wireless communications oscillator, a power pick-up coil to capture transmitted power, and an LED—all in a thin layer of rubbery polyester that allows the senosrs to stretch, bend, or wrinkle. Such a solution can provide a means to monitor premature babies to detect the onset of seizures, which could lead to epilepsy or brain development problems (Figure 3).

Figure 2: Sensors directly printed on the skin by researchers from the University of Illinois at Urban-Champaign can sense temperature, strain, and hydration state.



Figure 3: Multiple sensors as well as a wireless power pick-up coil and simple transmitter and antenna allow this sensing solution in a thin flexible polymer from the University of California at San Diego, be used for various patient monitoring applications. One such  application could be to monitor premature babies to detect the onset of seizures, which could affect the baby’s development.


In addition to these advanced research prototypes, there are many real examples of Appcessories – application software and peripherals that link to and run on smartphones such as the Apple iPhone. Bryzek highlighted just a few – Proteus offers digestable sensors that send wireless signal through the body to a receiver. The sensors measure heart rate, activity, and respiratory rate. GeneZ offers a low-cost DNA chip containing up to 64 reaction of less than 1 microliter in volume – assay time is 10 to 30 minutes and the cost is less than $1000 (the chip cost is just $5 to $10). Uchek from MIT uses the smartphone’s image sensor and a software application available on the Apple App Store to read test strips and it can detect up to 25 diseases such as diabetes, urinary tract infections, and pre-clampsia. The test strips can also measure the levels of glucose, proteins, ketones, and still other health factors.

Putting a doctor in a pocket, Scanadu released three home diagnostic tools that leverage the sensors and processing capability in a Smartphone to perform imaging, sound analysis, molecular diagnostics, data analytics, and run a suite of algorithms that can create a comprehensive, real-time picture of your health. A “Lab on a Chip” developed by STMicroelectronics is employed by Veredus Laboratories to detect the current subtype of H7N9 (Avian Flu) along with other types of human subtypes of Influenza A. The Lab on a chip combines two powerful molecular biological applications – polymerase chain reaction and microarray and can detect the infection with a high accuracy and sensitivity within two hours while providing genetic information on the infection that traditionally would take days to weeks to learn. One last example provided by Bryzek is a device that performs DNA and RNA sensing – Nanobiosim, an engine that integrates physics, biomedicine, and nanotechnology that can rapidly and accurately detect genetic fingerprints from any biological organism.

Dave Bursky, Technology Editor

For conference program details, go to  https://www.meptec.org/meptec11thannual.html

Next Page »

©2018 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS