Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey


The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

This survey reviewed representative approaches regarding automatic acceleration of applications via migrating the respective binary code to specialized accelerator devices. Most work has been published within the past 10 years. The approaches are based on offloading the effort of optimizing the target application, ideally in terms of performance and energy consumption. This is done by automating the generation of custom accelerator hardware, automatically offloading computation to the accelerator hardware, or both. Compilation flows or runtime techniques capable of targeting heterogeneous computing elements, preferably with minimal developer intervention, are proposed to accomplish this. To provide functional validations and experimental evaluation, ASIC or FPGAs implementations, along with simulations, are used. We find that a platform for this type of approach, i.e., for automated hardware generation or runtime reconfiguration, is yet to be realized, especially if the process is to be offloaded to a runtime environment. There is, however, a trend toward hardware/software co-design and reconfigurable devices. For instance, FPGAs have evolved into fully fledged System-on-a-chips (SoCs) that contain hardcore processors and specialized hardware modules (e.g., memory controllers, encryption modules, etc.), with a software interface to the reconfigurable logic [102]. This thus addresses one of the highlighted issues: the integration of custom hardware with the host processor. Simultaneously, Intel’s new processors containing an integrated FPGA [49] are a step toward the same direction, potentially standardizing how reconfigurable logic and host processor integrate. In addition, this integration may allow for the detection and translation techniques proposed by the presented approaches to be offloaded fully to runtime. Given this, we believe that the reviewed binary acceleration techniques fulfill the required infrastructure of such future fully transparent acceleration approaches and that they demonstrate the potential for improvement of performance and energy consumption through hardware specialization.

Send your manuscripts as an e-mail attachment to the Editorial Office at or submit your manuscripts online at:

Media contact:
Mercy Eleanor
American Journal of Computer Science and Engineering Survey