Category Archives: Features

Featured articles

Moore's Law according to Mark Bohr

Moore’s Law, new architectures, machine learning

The International Symposium on Computer Architecture is right around the corner. We bring you a short overview of the topics in store for this year’s edition.

First, Intel’s Mark Bohr will deliver a keynote on CMOS scaling. Bohr is a Senior Fellow at Intel and regularly speaks about future challenges to various crowds. The ISCA audience is definitely demanding, so it will be interesting to find out what message Intel has to bring in this venue. One thing that keeps popping up is an optimistic 14nm+/10nm projection, which assumes an above-average performance per transistor. Naturally, there is also a lot of talk about heterogeneous integration, which progresses these days on many fronts: from chip level to data center level.

Intel's 14nm and 10nm projections (Image: Intel)

Intel’s 14nm and 10nm projections (Image: Intel)

The second keynote comes from none other than Google, who is in a very good position to talk about how Moore’s Law and its progression impacts real life workload. The speaker is Partha Ranganathan, formerly the Chief Technologist of HP Labs.

ISCA is definitely moving ahead with the times, dedicating two sessions to machine learning. On this front, EE Times has an interesting report on a few new items that will be coming up during the conference.

One particularly interesting development is Plasticine, a pre-design reconfigurable CPU from Stanford, that could have 100x the performance-per-Watt of FPGAs. Currently the chip is said to reach 12.3 TFLOP (SP) at a 1 GHz clock and 49W TDP. The concept behind the CPU is that higher level abstractions can be extracted from applications, leading to a more informed understanding of data and control flow, including data locality and memory access. Hence the name of Plasticine components – “Pattern Compute Units” and “Pattern Memory Units”. The design leverages a range of optimizations, including hierarchical distributed memory management and streaming optimizations. The designers say that rather than focusing on neural convolution, they chose to support often changing dense and sparse algorithms more efficiently.

Plasticine vs. a 28nm Stratix FPGA (Image: Stanford)

Plasticine vs. a 28nm Stratix FPGA (Image: Stanford)

More on the convolutional side, NVIDIA will present their SCNN inference accelerator. It is said to deliver 2.3x the energy efficiency of a comparable accelerator, through a more aggressive approach towards optimizing math operations. A range of other machine learning optimizations are included, focusing on weight and activation parameter delivery, reducing the overall pressure on DRAM. According to the authors, although the chip hasn’t been produced, the main commercial advantage would be the fact that SCNN exploits sparsity.

These two approaches seem to go in slightly different directions, but both use cutting-edge reconfigurable computing as a solid base for research. It is quite likely that in the not so distant future we will see pervasive reconfigurable logic in general purpose chips, and then it will be up to the software to expose and exploit the wonders within.

AMD wafer

What’s next for server chips?

For anyone following the server processor market in the last years there should be one thing clear as day: “it’s about to go down”. Competition between arch-rivals AMD and Intel is back on, while designer ARM is closing deals and opening up new spaces. Let’s take a closer look at what is in store.

Will Intel keep their crown?

Possibly the biggest news of the last months is the real-world performance of AMD’s new Ryzen CPUs. Recently launched in the client space, Ryzen chips offer approximately double the number of cores of their Intel counterparts, within roughly the same TDP and also on a 14nm process. AMD’s chips also feature pretty big caches for what we’re used to seeing in this segment. Such features are probably of less interest to the current target of AMD’s marketing efforts – the hardcore gamers – but could make many data center managers very happy. Higher core count means higher integration, more performance per system, lower TCO. Ryzen chips also feature the goodies that Intel got us used to – AVX, Turbo mode and hardware threading (also called SMT or Hyper-Threading on Intel x86). How do we know Intel is feeling threatened? Money. They’ve significantly dropped the prices of their desktop processors, in some cases by as much as 25%. If (or “once”) Ryzen threatens Intel’s Xeon cash-cow, we can be sure that Intel will defend it very vigorously and will aggressively work to deliver performing 10nm parts in 2018.

AMD Naples overview (Image: Anandtech)

AMD Naples overview (Image: AMD/Anandtech)

AMD indeed has a 32-core part in the pipeline, codenamed “Naples”, with strong support for DDR4. New dual-socket systems based on this chip are said to support up to 2 TB of memory, 512 GB in practice vs. the 384 GB most Intel platforms offer today. We should keep in mind that AMD has a history of undercutting Intel’s high performance enterprise offerings – for instance, by offering 4-socket platforms at attractive price points, making Intel customers ditch two 2-socket servers in favor of one.

ARM creeping in

Another interesting development is the interest from Microsoft to use ARM chips for production in its cloud business. At the Open Compute Project Summit, Microsoft declared that ARM would be the base of a future server design plugging into the Project Olympus form factor. Such work triggers the signing of new partnerships around the idea, as well as the development of a bunch of related components.

Microsoft has already had some stake in ARM development – for instance, with past versions of Windows. Now, it looks like the majority of Windows server-focused functionality will have to run smoothly on ARM, which is a big deal – especially for a company making one of the most used operating systems in the world.

The future of platforms, according to Microsoft

The future of platforms, according to Microsoft

We all know that Intel’s round of layoffs and accelerated retirements leaked a solid number of talented engineers and executives into the marketplace. Interestingly, it seems that Intel might be facing some competition from the very people who used to fuel the company. The Bloomberg article on the matter quotes both Anand Chandrasekher, now in charge of Qualcomm’s server chips, as well as Kushagra Vaid, now responsible for Azure hardware infrastructure at Microsoft.

Piz Daint - CSCS

Swiss supercomputing – a closer look

Swiss supercomputing is making headlines again after the recent upgrades made to the flagship “Piz Daint” machine. Notably, the system is now #2 on the supercomputing Green500 list and #8 on top Top500 list. It is operated by the Swiss National Supercomputing Centre and could easily be considered a Swiss crown jewel.

The Cray-built XC50 supercomputer was submitted to the November Top500 with over 200’000 Intel x86 cores. A performance of 9.78 PFlops (RMax) was obtained at only 1.3 MW. Current performance per Watt is double that of the system submitted to the June Top500: 7.45 GFlops/W as opposed to 3.58 GFlops/W previously.

These improvements were obtained by using more efficient Intel CPUs and a major upgrade to NVIDIA GPUs based on the Pascal architecture. Consequently, the number of Top500 cores went up by nearly 80%, while power went down by 25%.

Piz Daint and other Swiss supercomputers in CSCS are part of the Swiss national supercomputing service, providing academia, the public sector and national projects with number crunching power. For instance, like other countries, Switzerland supports its national weather forecasting service, enabling a 2.2 km high-resolution grid. If there should be any doubt, simulating weather in such a mountainous country 8 times a day is no easy feat. Another interesting application is the analysis of data from the Large Hadron Collider at CERN. The LHC is the biggest machine ever built by man, to smash particles at a speed near the speed of light. LHC detectors produce an avalanche of data coming in at a rate of multiple petabytes per second, which is later filtered, analyzed and stored by a grid of 170 computer systems.

The Blue Brain IV system

Last but not least, one of the CSCS systems, an IBM Blue Gene/Q, was acquired from the Blue Brain Project at the Swiss Federal Institute of Technology in Lausanne. The system is particularly well suited for brain simulations and neuroscientific research. It has a simulation capacity of 200 million neurons, comparable to a brain of a rodent. An interesting tidbit: many of the CSCS supercomputers are named after prominent Swiss peaks.

If you’re curious to see the impressive infrastructure in person, the center holds guided visits, which can be subscribed to through its webpage.

Images: CSCS

Is it finally the time for FPGAs in general purpose compute?

FPGAs have been around for decades, mostly in industrial applications. In recent years, ideas relating to “casual” FPGA use in general purpose programming are resurfacing with increased strength.

It is no secret that FPGAs are a bit of a hassle to program. This is one of the main reasons why many outlets today are examining options to move away from FPGAs in favor of GPUs, Xeon Phis or even standard x86.

At the same time, a lot of work is being done on FPGAs becoming more accessible. On the hardware side, vendors such as Intel have worked on x86+FPGA combos for a while now. At this year’s Hot Chips conference an expert from Baidu was talking about using software defined acceleration for fairly mainstream SQL processing. Since about 40% of data analysis jobs in Baidu are written in SQL, the setup uses FPGAs and RTL to process them and boasts a more modest power envelop than that of GPUs.

Another interesting talk from Hot Chips ’16, given by DeePhi Tech, was on software-hardware co-design to accelerate neural networks. The company provides novel FPGA-based solutions for deep learning, with a range of supported mainstream applications, such as detection, tracking, translation, recognition and so on. In essence, modern larger neural network designs require higher bandwidth that cannot easily be delivered (along with other constraints being unsatisfied) by non-FPGA chips. The high-level workflow enables efficient connectors from such frameworks as Tensor Flow. Raw performance results, as compared to the ARM-based NVIDIA Tegra TK1, are not better on all fronts tested, but certainly promising – not to mention improved power consumption.

Perhaps the first general-purpose compute niche for FPGAs (if we may call it that) lies in specialized applications, that have already been understood in the omnipresent push for data parallelism.