Posts Tagged ‘ARM’

Server system-on-chips pack up to 48 64-bit ARM cores

Wednesday, June 18th, 2014

Targeting secure cloud servers, storage servers, compute servers, and data-plane applications, the ThunderX series of multicore SoCs deliver power-efficient computing solutions

Dave Bursky
Semiconductor Technology Editor

Multicore processors based on x86 cores are a very common choice for servers and for handling packets in data-networking applications. Although x86-based servers command most of the IT market, other processors such as MIPS and PowerPC are key players in the deeply embedded applications such as network switches and routers, handling both data plane and control plane functions. ARM processors have started to make inroads in the server market, and with the release of the A57 64-bit core, the ARM processors are poised to make significant inroads into all the applications that are currently employing the x86, MIPS, and PowerPC cores.

One example of that opportunity takes aim at low-power servers and secure network communications — the just-released ThunderX series of multicore processors from Cavium. This family includes versions containing from 8 to 48 customized ARM 64-bit processor cores that can operate at up to 2.5 GHz. There will actually be four families of processors in the ThunderX series–each optimized for a different type of workload. The ThunderX_SC is targeted at security applications, the ThunderX_ST for storage control and management, the ThunderX_NT for networking systems, and the ThunderX_CP for computational applications.

Implemented in a low-power 28-nm process, the basic ThunderX architecture brings together up to 48 full custom 64-bit processor cores that are fully compliant with the ARMv8 architecture specification and ARM’s Server Base System Architecture (SBSA). Included on each multi-core chip are a cache subsystem (each processor has level 1 instruction and data caches, and all processors share an L2 cache), Ethernet interfaces capable of 10/40/100 Gbit/s data rates, multiple PCIe gen3 and SATA v3 interfaces, up to four DDR3/4 memory controllers, additional I/O ports, and various accelerators depending on the market segment the processor is optimized to tackle (see the figure).

Members of the ThunderX family from Cavium contain up to 48 ARM64 processor cores, application-specific hardware accelerators, high-speed Ethernet ports, both PCIe gen3 and SATA v3 ports and many other system support features to support Compute, Storage, Networking, and Secure Computing applications.






For example, the ThunderX_SC family is optimized for Secure Web frontend, security appliances and Cloud RAN type workloads. It includes specialized hardware accelerators consisting of Cavium’s 4th generation NITROX and TurboDPI technology with acceleration for IPSec, SSL, Anti-virus, Anti-malware, firewall and DPI. The NITROX engine can deliver 50 Mbps to 40Gbps of encryption bandwidth with 1K to 200K RSA/DH operations per second. Additionally, the TurboDPI block employs the company’s Uniscan technology that simultaneously blocks malicious or inappropriate URLs, identifies hundreds of widely used protocols and applications, helps block thousands of different intrusion attempts and locates over a hundred thousand varieties of virus and malware threats, all with just a single scan of the data stream,

Also integrated on the Thunder_SC are multiple 10/40 Gbit/s Ethernet ports, multiple PCIe Gen3 and SATA 3 ports, up to four high-memory-bandwidth DDR 3 or DDR 4 72-bit memory controllers able to support 2400 MHz memories, a cache-coherent interconnect across dual sockets thanks to the Cavium Coherent Processor Interconnect, and a scalable fabric for east-west as well as north-south traffic connectivity. Most of these features are also available on the other Thunder families along with accelerators for each target application segment – the ST series includes storage accelerators for data protection, data integrity, security and compression, as well as efficient user-to-user data movement, the CP series includes core-to-I/O virtualization in hardware, and the NT series processors include full virtualization support and network accelerators for QoS, traffic shaping, tunnel termination, and high packet-throughput processing, network virtualization, and data monitoring.

New Processor Core Options Try Some ARM Wrestling

Monday, May 13th, 2013

When designing a system on a chip (SoC) that employs one or more embedded processor cores, the choice of available processors continues to expand. At last month’s Design West conference in San Jose, Calif., designers were presented with many processor options. Leading the pack, ARM, with its broad array of cores offers a wide range of performance choices, ranging from the Cortex-M0 at the low end of the performance spectrum to the 64-bit Cortex A57 at the high end. Although ARM’s cores dominate some SoC market segments, they aren’t the only game in town. EDA tool suppliers Synopsys and Cadence have acquired core suppliers ARC and Tensilica, respectively, and recently, Imagination Technologies acquired MIPS. Thus the number of independent processor core IP providers dropped considerably, but not for long.

One newcomer to the U.S. market, Andes Technology, has crafted multiple, synthesizable processor-core families, the N7, N8, N9, N10, N12, and N13, that offer 32-bit cores with gate counts that start at just 12k gates (for the N7). For applications that don’t require legacy compatibility these cores can challenge ARM and other vendors for embedded applications. Based on a proprietary instruction-set architecture (ISA), the N7 family cores can deliver about 1.19 MIPS/MHz, which is about 20 percent higher than the ARM Cortex-M0. Additionally, the cores consume about 30 percent less power at the same performance level as the M0. The low-gate-count core, referred to as the Hummingbird, also requires a small amount of chip real-estate – less than 0.04 mm2 when fabricated using a 90 nm process. With optional features such as a prefetch buffer that can serve as a small instruction cache, the core can deliver up to 1.45 DMIPS/MHz, but to get the higher performance the gate count would increase to close to 30K gates.

Figure 1: One of the higher-end processor cores from Andes Technology is
the N12. It contains an eight-stage pipeline with dynamic branch prediction, a
memory-management unit and instruction and data caches.


The ISA consists of a mix of 16- and 32-bit instructions that execute on the N7, which has a simple two-stage pipelined architecture. On the high-end, the N12 and N13 series implement the ISA on an eight-stage pipeline and pack a memory-management unit, instruction and data caches, and dynamic branch prediction (Figure 1). Programming tools and a good compiler make the proprietary ISA a non-issue and allow designers to program using tools like GCC/Linux. The Hummingbird core is targeted at applications such as Bluetooth, the Internet of Things (IOT)/machine2machine communications, touchscreen controllers, and other embedded applications, the Hummingbird core licensing fees are considerably lower than what ARM charges for its M0 core, thus keeping down the cost of the SoC. The higher-performance cores take on performance-sensitive applications such as embedded Linux systems.

Figure 2: Between the commercial CPUs and a dedicated fixed-function solution is the ASIP (application-specific
instruction-set processor)—a block of customer-defined intellectual property (left). Tools from Target Compiler Technologies allow designers to craft the IP block and incorporate the block in an ASIC, thus enabling the designers t0 significantly improve the power efficiency as well as the performance of their ASIC solution (right).



Taking a different approach to crafting an embedded processor core, Target Compiler Technologies offers tools that let designers define everything from their own optimized processor cores to a complete multicore application-specific SoC. By allowing designers to craft their own application-specific intellectual property (ASIP) the company’s IP Designer tools allow architectural exploration, SDK generation (C compiler, instruction set simulator, debugger, etc.), and RTL generation. Once the IP blocks are defined, the MP Designer tools for multicore ASIC design perform code parallelization, communication and synchronization and multicore platform generation (Figure 2).

Figure 3: A single-tile xCore processor SoC platform from XMOS can
emulate up to eight “logical” processors and has areas set aside that designers
can use to customize the I/O and bus interface/communications channel. The
platform chips from XMOS can contain 1, 2, or 4 physical processor tiles (up to
32 logical processors) and can clock at up to 500 MHz.


Somewhere between a dedicated processor core and a fully-definable multicore platform sits the configurable processor SoC platform developed by XMOS. The company offers a partially-predefined multiple processor platform that contains 1, 2, or 4 processor “tiles”, with each tile able to run up to eight threads (or eight logical processors) and basic support blocks such as SRAM, PLLs, timing (schedulers, timers, clocks), Security (one-time-programmable ROM), and JTAG debug port (Figure 3). The remainder of the platform consists of configurable sections into which designers can drop special IP blocks from the XMOS library or their own their proprietary interface/special function logic IP that connects to the platform’s I/O ports and X-Connect interface channels/links.

Each processor tile can deliver up to 500 MIPS of compute power when running at 500 MHz. Each logical processor (a thread) shares processing resources and memory in the tile, but each logical processor has its own register files and gets a guaranteed slice of the tile processor’s compute power (125 MIPS at 500 MHz). The high performance of the processor tiles allows the xCore to take on many applications in consumer and audio systems, automotive systems, industrial control, and display/imaging systems.

Dave Bursky
Semiconductor Technology Editor

Multicore Processors Deliver PerformanceBoosts For Handheld Platforms

Tuesday, February 5th, 2013

As designers push to improve the performance of smartphones and other portable systems, they are turning more and more to multicore processors to speed the computations, handle more audio and video operations, and provide better connectivity – all while reducing the system power envelope. At this year’s International Consumer Electronics Conference (ICES), several new multicore systems-on-a-chip solutions were unveiled by Nvidia, Qualcomm, and Samsung. These solutions promise to raise the performance bar of handheld systems to new levels by allowing multiple applications to run concurrently, render 3D graphics to support the most complex gaming applications, and deliver real-time response. Let’s take a closer look at what each of the companies has done to deliver the performance needed for next-generation systems.

The highest performance solution from the trio comes from Samsung – the Exynos5 Octa leverages the big.LITTLE concept developed by ARM to craft a system-on-a-chip design that contains eight processor cores plus lots of other system logic in a high-density ball-grid-array package (see the photo). The cores are divided into two clusters – one contains four high-performance A15 ARM processors, while the other cluster consists of four power-efficient ARM Cortex A7 cores. Only one cluster of CPU cores can be active at any time, and the cluster not in use goes to sleep to reduce power consumption. When switching between clusters, there is a 30 to 50 ms switchover delay.

The A7 cores, when in use, reduce power consumption by 3.3X vs the quad A15 cluster, but still give the system enough performance to handle most of the basic housekeeping functions and many applications that don’t require the high performance of the A15 cores. Additionally, the A7 cores are much smaller than the A15 core – all four A7 cores occupy only about half the area as a single A15 cor

This high-density ball-grid array package houses Samsung’s Exynos 5 Octa processor that contains two quad-core clusters – one comprised of four Cortex A-15 high-performance cores and the other containing four low-power second-generation Cortex A7 cores.

e, so the area penalty to add the four A7 cores has minimal impact on the chip area.

Qualcomm has crafted two new multicore additions to its Snapdragon family – the Snapdragon 600 and Snapdragon 800. The top-of-the-line 800 series, fabricated on a 28-nm process, not only has Qualcomm’s latest CPU core, the Krait 400 in a quad-core configuration, but updated versions of the company’s GPU, the Adreno 330, and the Hexagon v5 DSP engine. Furthermore, a 4G LTE Cat 5 modem integrated on the chip allows Snapdragon to connect to the fastest mobile networks. The processor cores can run at clock rates of up to 2.3 GHz, each core is only active when needed, so the entire system is designed to conserve power whenever possible. Additionally, the video support includes the ability to capture and display ultra HD, which delivers four times the pixel density of the standard 1080p display. The chip also supports displays of up to 2560 by 2048 pixels as well as Miracast wireless video streaming at 1080p.

For more conservative system designs, the Snapdragon 600 embeds a quad core Krait 300 CPU cluster that runs at 1.9 GHz, a Adreno 320-series graphics processor, and support for low-power DDR3 memory. Designers also included many other enhancements that allow the chip to deliver about 40% better performance than the previous generation Snapdragon S4 Pro processor at even lower power consumption levels.

Last of trio, NVIDIA unveiled details of the Tegra 4, which integrates four ARM Cortex A15 cores plus a second-generation battery-saver core, similar to the approach used in the Tegra 3 processor. This variable symmetrical multiprocessor architecture developed by NVIDIA allows all four A15 cores to operate simultaneously or power-down when not needed, allowing the lower power battery-saver core to take over for housekeeping and non-performance-critical tasks such as music and video playback. The integrated graphics processor unit (GPU) contains 72 custom cores that help it deliver topnotch gaming performance as well as advanced media and web capabilities, including WebGL and HTML5. Lastly, the Tegra 4 ties into the company’s Icera 450 soft-modem chipset to deliver high-throughput HSPA+ communications with data rates as high as 28 Mbits/s. Additional Icera solutions, the Icera 410 and 400 are also compatible with the Tegra 4.

These multicore processors are just the tip of the proverbial iceberg with regards to the many product developments unveiled at ICES. Future columns will highlight some of the additional developments from CES.

Dave Bursky
Chip Design Magazine

ARMed and Ready for the Future

Thursday, November 29th, 2012

About a year ago ARM unveiled its big.LITTLE processor architecture that seamlessly melded a low-power ARM7 processor core with one or more high-performance 32-bit A15 processors. In this approach, software execution could seamlessly switch between the high-performance cores and the low-performance low-power core, depending on the task at hand. This allowed designers to get the high-performance of the A15 cores when compute-intensive tasks were running, and save power when light tasks such as system housekeeping and monitoring functions had to be executed. Although not a radically new concept (Nvidia introduced a similar approach in its Tegra3 series processors that are used in various mobile applications such as tablets.), the ability to smoothly switch execution improves the user experience and helps extend the battery life.

However, system designers are always asking for higher performance, and to meet those demands, at this past October’s ARM Developer Forum held in Santa Clara, Calif., ARM unveiled its second-generation big.LITTLE multicore solution employing cores based on the new ARMv8 architecture. The ARMv8 represents ARM’s venture into 64-bit computing and the first ARM A50 series processor cores—the A53 and A57 – are the first cores to implement the v8 instruction set architecture, AARCH64. Both cores are also fully compatible with the large 32-bit ecosystem that supports previous ARM processor cores. Either core can also be used independently as a stand-alone solution, or combined into a big.LITTLE configuration that delivers high performance with excellent power efficiency. The two processors can seamlessly transition from their 32-bit execution mode to their 64-bit mode, thus providing performance scalability to 64-bit operation in mobile and enterprise computing applications.

The high-performance A57 core’s architecture includes a complex, out-of-order multi-issue pipeline, and can deliver about triple the performance of high-end processors used in today’s “superphones” in its 32-bit mode while maintaining the same power envelope. Designed for use in highly-scalable applications, the A57 core can be used in compute clusters that can range from a single core to beyond 16 cores. The core includes optimized instructions to improve software execution and new instructions to speed up encryption algorithms by close to 10X. Targeting some enterprise applications, the 64-bit support also includes enhanced floating-point performance.

Complementing the A57 core is the Cortex A53, claimed by ARM to be the world’s smallest 64-bit processor. The A53 delivers performance comparable to that of the Cortex A9 but is 40%+ smaller than the A9 when fabricated in the same process. Both the A57 and A53 are supported by the recently released ARM-developed blocks of IP – the CoreLink 400 AMBA 4 next-generation on-chip bus interconnect and the CoreLink 500 cache-coherency management logic.

ARM expects the cores to deliver multi-gigahertz performance when implemented on advanced CMOS and FinFET processes at 20 nm and eventually 14 nm process nodes. The small size of the cores will permit designers to include multiple instances of the cores on a system-on-a-chip solution that can range from a system containing a dual-core A57 block and a quad A53 processor block for a next-generation “superphone”, to an enterprise solution that could pack four quad A57 clusters for a compute server, or four quad A53 clusters for a low-power web server (See the figure).

In these two sample configurations of the big.LITTLE approach, the Cortex A50 series processors deliver the best balance of power and performance for “superphones” using one A57 core and four A53 cores (Figure left), or topnotch performance in a webserver application using four quad-core clusters of A57 cores (Figure , right).

Products employing the new cores are not expected to debut until 2014, but early licensees including AMD, Broadcom, Calexda, HiSilicon, Samsung, and ST Microelectronics have a jumpstart on product development and could potentially release products in late 2013.

Dave Bursky
Chip Design Magazine

©2019 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS