Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers, Volume 9 Issue 4, January 2013

Feedback-driven binary code diversification
Bart Coppens, Bjorn De Sutter, Jonas Maebe
Article No.: 24
DOI: 10.1145/2400682.2400683

As described in many blog posts and in the scientific literature, exploits for software vulnerabilities are often engineered on the basis of patches. For example, “Microsoft Patch Tuesday” is often followed by “Exploit...

A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors
Jeremy Fowers, Greg Brown, John Wernsing, Greg Stitt
Article No.: 25
DOI: 10.1145/2400682.2400684

Recent architectural trends have focused on increased parallelism via multicore processors and increased heterogeneity via accelerator devices (e.g., graphics-processing units, field-programmable gate arrays). Although these architectures have...

Vectorization technology to improve interpreter performance
Erven Rohou, Kevin Williams, David Yuste
Article No.: 26
DOI: 10.1145/2400682.2400685

In the present computing landscape, interpreters are in use in a wide range of systems. Recent trends in consumer electronics have created a new category of portable, lightweight software applications. Typically, these applications have fast...

Fast asymmetric thread synchronization
Jimmy Cleary, Owen Callanan, Mark Purcell, David Gregg
Article No.: 27
DOI: 10.1145/2400682.2400686

For most multi-threaded applications, data structures must be shared between threads. Ensuring thread safety on these data structures incurs overhead in the form of locking and other synchronization mechanisms. Where data is shared among multiple...

PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs
Yong Li, Rami Melhem, Alex K. Jones
Article No.: 28
DOI: 10.1145/2400682.2400687

Traversing the page table during virtual to physical address translation causes pipeline stalls when misses occur in the translation-lookaside buffer (TLB). State-of-the-art translation proposals typically optimize a single aspect of translation...

Per-thread cycle accounting in multicore processors
Kristof Du Bois, Stijn Eyerman, Lieven Eeckhout
Article No.: 29
DOI: 10.1145/2400682.2400688

While multicore processors improve overall chip throughput and hardware utilization, resource sharing among the cores leads to unpredictable performance for the individual threads running on a multicore processor. Unpredictable per-thread...

Maxine: An approachable virtual machine for, and in, java
Christian Wimmer, Michael Haupt, Michael L. Van De Vanter, Mick Jordan, Laurent Daynès, Douglas Simon
Article No.: 30
DOI: 10.1145/2400682.2400689

A highly productive platform accelerates the production of research results. The design of a Virtual Machine (VM) written in the Java™ programming language can be simplified through exploitation of interfaces, type and memory safety,...

A script-based autotuning compiler system to generate high-performance CUDA code
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame
Article No.: 31
DOI: 10.1145/2400682.2400690

This article presents a novel compiler framework for CUDA code generation. The compiler structure is designed to support autotuning, which employs empirical techniques to evaluate a set of alternative mappings of computation kernels and...

Understanding fundamental design choices in single-ISA heterogeneous multicore architectures
Kenzo Van Craeynest, Lieven Eeckhout
Article No.: 32
DOI: 10.1145/2400682.2400691

Single-ISA heterogeneous multicore processors have gained substantial interest over the past few years because of their power efficiency, as they offer the potential for high overall chip throughput within a given power budget. Prior work in...

The CRNS framework and its application to programmable and reconfigurable cryptography
Samuel Antão, Leonel Sousa
Article No.: 33
DOI: 10.1145/2400682.2400692

This article proposes the Computing with the ResidueNumber System (CRNS) framework, which aims at the design automation of accelerators for Modular Arithmetic (MA). The framework provides a comprehensive set of tools ranging from a programming...

A decoupled local memory allocator
Boubacar Diouf, Can Hantaş, Albert Cohen, Özcan Özturk, Jens Palsberg
Article No.: 34
DOI: 10.1145/2400682.2400693

Compilers use software-controlled local memories to provide fast, predictable, and power-efficient access to critical data. We show that the local memory allocation for straight-line, or linearized programs is equivalent to a weighted...

Layout-oblivious compiler optimization for matrix computations
Huimin Cui, Qing Yi, Jingling Xue, Xiaobing Feng
Article No.: 35
DOI: 10.1145/2400682.2400694

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to...

Compiler support for lightweight context switching
Stephen Dolan, Servesh Muralidharan, David Gregg
Article No.: 36
DOI: 10.1145/2400682.2400695

We propose a new language-neutral primitive for the LLVM compiler, which provides efficient context switching and message passing between lightweight threads of control. The primitive, called Swapstack, can be used by any language...

LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors
Pablo Abad, Valentin Puente, Jose-Angel Gregorio
Article No.: 37
DOI: 10.1145/2400682.2400696

Although abstraction is the best approach to deal with computing system complexity, sometimes implementation details should be considered. Considering on-chip interconnection networks in particular, underestimating the underlying system...

Exploiting reuse locality on inclusive shared last-level caches
Jorge Albericio, Pablo Ibáñez, Víctor Viñals, Jose María Llabería
Article No.: 38
DOI: 10.1145/2400682.2400697

Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache...

Optimizing software runtime systems for speculative parallelization
Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, Mikel Luján
Article No.: 39
DOI: 10.1145/2400682.2400698

Thread-Level Speculation (TLS) overcomes limitations intrinsic with conservative compile-time auto-parallelizing tools by extracting parallel threads optimistically and only ensuring absence of data dependence violations at runtime.


Algorithmic species: A classification of affine loop nests for parallel programming
Cedric Nugteren, Pieter Custers, Henk Corporaal
Article No.: 40
DOI: 10.1145/2400682.2400699

Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly parallelism, requiring...

Optimal DPM and DVFS for frame-based real-time systems
Marco E. T. Gerards, Jan Kuper
Article No.: 41
DOI: 10.1145/2400682.2400700

Dynamic Power Management (DPM) and Dynamic Voltage and Frequency Scaling (DVFS) are popular techniques for reducing energy consumption. Algorithms for optimal DVFS exist, but optimal DPM and the optimal combination of DVFS and DPM are not yet...

An integrated pseudo-associativity and relaxed-order approach to hardware transactional memory
Zhichao Yan, Hong Jiang, Yujuan Tan, Dan Feng
Article No.: 42
DOI: 10.1145/2400682.2400701

Our experimental study and analysis reveal that the bottlenecks of existing hardware transactional memory systems are largely rooted in the extra data movements in version management and in the inefficient scheduling of conflicting transactions in...

Profile-guided floating- to fixed-point conversion for hybrid FPGA-processor applications
Doris Chen, Deshanand Singh
Article No.: 43
DOI: 10.1145/2400682.2400702

The key to enabling widespread use of FPGAs for algorithm acceleration is to allow programmers to create efficient designs without the time-consuming hardware design process. Programmers are used to developing scientific and mathematical...

Lock-contention-aware scheduler: A scalable and energy-efficient method for addressing scalability collapse on multicore systems
Yan Cui, Yingxin Wang, Yu Chen, Yuanchun Shi
Article No.: 44
DOI: 10.1145/2400682.2400703

In response to the increasing ubiquity of multicore processors, there has been widespread development of multithreaded applications that strive to realize their full potential. Unfortunately, lock contention within operating systems can limit the...

ADAPT: A framework for coscheduling multithreaded programs
Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan
Article No.: 45
DOI: 10.1145/2400682.2400704

Since multicore systems offer greater performance via parallelism, future computing is progressing towards use of multicore machines with large number of cores. However, the performance of emerging multithreaded programs often does not scale to...

Continuous learning of compiler heuristics
Michele Tartara, Stefano Crespi Reghizzi
Article No.: 46
DOI: 10.1145/2400682.2400705

Optimizing programs to exploit the underlying hardware architecture is an important task. Much research has been done on enabling compilers to find the best set of code optimizations that can build the fastest and less resource-hungry executable...

HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system
Grigorios Chrysos, Panagiotis Dagritzikos, Ioannis Papaefstathiou, Apostolos Dollas
Article No.: 47
DOI: 10.1145/2400682.2400706

Data mining is a new field of computer science with a wide range of applications. Its goal is to extract knowledge from massive datasets in a human-understandable structure, for example, the decision trees. In this article we present an...

Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures
Jongwon Lee, Yohan Ko, Kyoungwoo Lee, Jonghee M. Youn, Yunheung Paek
Article No.: 48
DOI: 10.1145/2400682.2400707

Soft errors are becoming a critical concern in embedded system designs. Code duplication techniques have been proposed to increase the reliability in multi-issue embedded systems such as VLIW by exploiting empty slots for duplicated instructions....

API compilation for image hardware accelerators
Fabien Coelho, François Irigoin
Article No.: 49
DOI: 10.1145/2400682.2400708

We present an API-based compilation strategy to optimize image applications, developed using a high-level image processing library, onto three different image processing hardware accelerators. We demonstrate that such a strategy is profitable for...

Fair CPU time accounting in CMP+SMT processors
Carlos Luque, Miquel Moreto, Francisco J. Cazorla, Mateo Valero
Article No.: 50
DOI: 10.1145/2400682.2400709

Processor architectures combining several paradigms of Thread-Level Parallelism (TLP), such as CMP processors in which each core is SMT, are becoming more and more popular as a way to improve performance at a moderate cost. However, the complex...

Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems
Pavlos M. Mattheakis, Ioannis Papaefstathiou
Article No.: 51
DOI: 10.1145/2400682.2400710

Highly parallel systems are becoming mainstream in a wide range of sectors ranging from their traditional stronghold high-performance computing, to data centers and even embedded systems. However, despite the quantum leaps of improvements in cost...

Improved loop tiling based on the removal of spurious false dependences
Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege, Konrad Trifunović
Article No.: 52
DOI: 10.1145/2400682.2400711

To preserve the validity of loop nest transformations and parallelization, data dependences need to be analyzed. Memory dependences come in two varieties: true dependences or false dependences. While true dependences must be satisfied in order to...

OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs
Antoniu Pop, Albert Cohen
Article No.: 53
DOI: 10.1145/2400682.2400712

We present OpenStream, a data-flow extension of OpenMP to express dynamic dependent tasks. The language supports nested task creation, modular composition, variable and unbounded sets of producers/consumers, and first-class streams. These...

Polyhedral parallel code generation for CUDA
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, Francky Catthoor
Article No.: 54
DOI: 10.1145/2400682.2400713

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any...

Delta-compressed caching for overcoming the write bandwidth limitation of hybrid main memory
Yu Du, Miao Zhou, Bruce Childers, Rami Melhem, Daniel Mossé
Article No.: 55
DOI: 10.1145/2400682.2400714

Limited PCM write bandwidth is a critical obstacle to achieve good performance from hybrid DRAM/PCM memory systems. The write bandwidth is severely restricted in PCM devices, which harms application performance. Indeed, as we show, it is more...

Finding good optimization sequences covering program space
Suresh Purini, Lakshya Jain
Article No.: 56
DOI: 10.1145/2400682.2400715

The compiler optimizations we enable and the order in which we apply them on a program have a substantial impact on the program execution time. Compilers provide default optimization sequences which can give good program speedup. As the default...

A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures
Mehmet E. Belviranli, Laxmi N. Bhuyan, Rajiv Gupta
Article No.: 57
DOI: 10.1145/2400682.2400716

Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors...

SCIN-cache: Fast speculative versioning in multithreaded cores
Anurag Negi, Ruben Titos-Gil
Article No.: 58
DOI: 10.1145/2400682.2400717

This article describes cache designs for efficiently supporting speculative techniques like transactional memory on chip multiprocessors with multithreaded cores. On-demand allocation and prompt freeing of speculative cache space in the design...

PARTANS: An autotuning framework for stencil computation on multi-GPU systems
Thibaut Lutz, Christian Fensch, Murray Cole
Article No.: 59
DOI: 10.1145/2400682.2400718

GPGPUs are a powerful and energy-efficient solution for many problems. For higher performance or larger problems, it is necessary to distribute the problem across multiple GPUs, increasing the already high programming complexity.

In this...

Stream arbitration: Towards efficient bandwidth utilization for emerging on-chip interconnects
Chunhua Xiao, M-C. Frank Chang, Jason Cong, Michael Gill, Zhangqin Huang, Chunyue Liu, Glenn Reinman, Hao Wu
Article No.: 60
DOI: 10.1145/2400682.2400719

Alternative interconnects are attractive for scaling on-chip communication bandwidth in a power-efficient manner. However, efficient utilization of the bandwidth provided by these emerging interconnects still remains an open problem due to the...