Hardware acceleration may be more interesting than you think

Writing software has never been so productive, as we now live in an abundance of languages, frameworks, platforms, and tools that help us create (hopefully) working software faster. By contrast, for most people designing hardware remains a kind of black magic, a mysterious dark art that only skilled wizards are versed in.

That's too bad, because hardware can be incredibly powerful, and I attempt to show why in this post. But first, what do I mean by hardware accelerator? An electronic integrated circuit that is dedicated/optimized for a given application or set of applications, and which is implemented with either a special kind of programmable chip (FPGA) or by making a custom specialized integrated circuit (ASIC). For example, the first GPUs were hardware accelerators for graphical processing (today's GPUs are really generic hardware accelerators ever since they became programmable).

1. Massive parallelism

Perhaps the most obvious advantage of hardware acceleration is the massive parallelism it allows. Modern chips contain billions of transistors, they can't be all used at the same time (because of data dependencies among other things), but that still means a lot of available parallelism. By comparison, high-end desktop processors have 8 cores, and even server processors max out at 32.

On a side note, multi-core processors have been the norm for 9 years (the "Core Duo" was released by Intel in January 2006). During that time, the number of transistors has kept increasing, tracking Moore's law, but the number of cores hasn't increased at the same pace (assuming doubling every two years, we should have 32 cores, as given by 2 multiplied by 2 ^ 4).

The only platforms that come close are GPUs and many-core processors, with (announced) 1024 cores. Impressive, but still orders of magnitude smaller than the number of functions you can run in parallel on custom hardware.

2. Heterogeneous parallelism

Some applications (so-called embarrassingly parallel) are relatively easy to map onto many homogeneous cores. That's the case for 3D applications (hence GPUs) and many scientific applications (hence many-cores). But in the general case, applications do not exhibit obvious data parallelism.

Custom hardware allows you to control precisely how to implement a given algorithm or application, and to create the right architecture for it. Take cryptography for example: hashing and encryting are purely sequential, in other words they are defined as a sequence of N operations that cannot be executed in parallel because of data dependencies (any given iteration requires data from the previous iteration).

Pipeline architecture (N = 4)

The easiest way to parallelize this kind of algorithm is to use a pipeline architecture. This architecture has a latency of N cycles, same as the original algorithm (since it cannot be made lower), but with a throughput that is N times bigger since it outputs data at each cycle (instead of every N cycles). Amdahl's law does not apply here.

This kind of architecture works very well in hardware because communication costs are basically zero (just wires between "cores"). But on a many-core processor, this kind of parallelism is only practical when the compute-to-communication ratio is favorable, i.e. when computations are large enough compared to inter-core communication costs.

3. High energy efficiency

This is why you can play a movie on your phone or tablet without it overheating or seeing your battery die in minutes. These devices are based on a SoC (System on Chip) with hardened video decoding functionality, to which most of the work of decoding video is offloaded. SoC have hardware support for many other things, like encoding video, encoding/decoding pictures, transmitting voice, etc.

A dedicated piece of hardware gives you the best computational power over power consumption ratio, and the flexibility to favor one or the other, allowing you to create ultra low-power devices.

Low power to the people (c) 2006 valley_free_radio - Flickr

This is because there is no overhead, no energy spent decoding instructions and deciding which ones can execute in parallel and in what order. And since everything is hard wired, there is no need to store a program in memory and fetch instructions.

4. Other advantages

  • Ultra-low, deterministic latency. This is why automotive has started to use FPGAs, and why the finance sector has been a heavy user of FPGAs. Since you have no OS running, you decide what is executed, when, and for how long.
  • Very low level. You can go down to the bit level in virtually any kind of communication protocol; you can even invent your own. You get to tinker with processors, experiment new computing architectures, etc.
  • Might actually be easier to program than if you have to write multi-core high performance code with threads or OpenCL or OpenMP or MPI or whatever. We're making it easier than ever to design hardware with a new programming language for hardware that we've created.

What other advantages do you see in using hardware? Will you consider hardware acceleration for your next project? Let us know in the comments below!