Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers, Volume 8 Issue 4, January 2012

Introduction to the special issue on high-performance and embedded architectures and compilers
Per Stenström, Koen De Bosschere
Article No.: 18
DOI: 10.1145/2086696.2086697

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
Jorge Albericio, Rubén Gran, Pablo Ibáñez, Víctor Viñals, Jose María Llabería
Article No.: 19
DOI: 10.1145/2086696.2086698

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main...

An architecture-independent instruction shuffler to protect against side-channel attacks
Ali Galip Bayrak, Nikola Velickovic, Paolo Ienne, Wayne Burleson
Article No.: 20
DOI: 10.1145/2086696.2086699

Embedded cryptographic systems, such as smart cards, require secure implementations that are robust to a variety of low-level attacks. Side-Channel Attacks (SCA) exploit the information such as power consumption, electromagnetic radiation...

Approximate graph clustering for program characterization
John Demme, Simha Sethumadhavan
Article No.: 21
DOI: 10.1145/2086696.2086700

An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data...

Bahurupi: A polymorphic heterogeneous multi-core architecture
Mihai Pricopi, Tulika Mitra
Article No.: 22
DOI: 10.1145/2086696.2086701

Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex...

Compiler mitigations for time attacks on modern x86 processors
Jeroen V. Cleemput, Bart Coppens, Bjorn De Sutter
Article No.: 23
DOI: 10.1145/2086696.2086702

This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side channels...

Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions
Jason Mccandless, David Gregg
Article No.: 24
DOI: 10.1145/2086696.2086703

Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can...

DAPSCO: Distance-aware partially shared cache organization
Antonio García-Guirado, Ricardo Fernández-Pascual, Alberto Ros, José M. García
Article No.: 25
DOI: 10.1145/2086696.2086704

Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks...

On-the-fly structure splitting for heap objects
Zhenjiang Wang, Chenggang Wu, Pen-Chung Yew, Jianjun Li, Di Xu
Article No.: 26
DOI: 10.1145/2086696.2086705

With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect. Sophisticated techniques are needed more than ever to improve an application's spatial and temporal...

Efficient liveness computation using merge sets and DJ-graphs
Dibyendu Das, B. Dupont De Dinechin, Ramakrishna Upadrasta
Article No.: 27
DOI: 10.1145/2086696.2086706

In this work we devise an efficient algorithm that computes the liveness information of program variables. The algorithm employs SSA form and DJ-graphs as representation to build Merge sets. The Merge set of node n,...

Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
George Patsilaras, Niket K. Choudhary, James Tuck
Article No.: 28
DOI: 10.1145/2086696.2086707

Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing....

Exploring the limits of GPGPU scheduling in control flow bound applications
Roman Malits, Evgeny Bolotin, Avinoam Kolodny, Avi Mendelson
Article No.: 29
DOI: 10.1145/2086696.2086708

GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication,...

FlexSig: Implementing flexible hardware signatures
Lois Orosa, Elisardo Antelo, Javier D. Bruguera
Article No.: 30
DOI: 10.1145/2086696.2086709

With the advent of chip multiprocessors, new techniques have been developed to make parallel programing easier and more reliable. New parallel programing paradigms and new methods of making the execution of programs more efficient and more...

Hardware transactional memory with software-defined conflicts
Ruben Titos-Gil, Manuel E. Acacio, Jose M. Garcia, Tim Harris, Adrian Cristal, Osman Unsal, Ibrahim Hur, Mateo Valero
Article No.: 31
DOI: 10.1145/2086696.2086710

In this paper we investigate the benefits of turning the concept of transactional conflict from its traditionally fixed definition into a variable one that can be dynamically controlled in software. We propose the extension of the atomic...

Improving performance of nested loops on reconfigurable array processors
Yongjoo Kim, Jongeun Lee, Toan X. Mai, Yunheung Paek
Article No.: 32
DOI: 10.1145/2086696.2086711

Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained...

Making wide-issue VLIW processors viable on FPGAs
Madhura Purnaprajna, Paolo Ienne
Article No.: 33
DOI: 10.1145/2086696.2086712

Soft and highly-customized processors are emerging as a common way to efficiently control large amount of computing resources available on FPGAs. Yet, some processor architectures of choice for DSP and media applications, such as wide-issue VLIW...

On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments
Petar Radojković, Sylvain Girbal, Arnaud Grasset, Eduardo Quiñones, Sami Yehia, Francisco J. Cazorla
Article No.: 34
DOI: 10.1145/2086696.2086713

Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However,...

Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks
Leonid Domnitser, Aamer Jaleel, Jason Loew, Nael Abu-Ghazaleh, Dmitry Ponomarev
Article No.: 35
DOI: 10.1145/2086696.2086714

We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents...

On the simulation of large-scale architectures using multiple application abstraction levels
Alejandro Rico, Felipe Cabarcas, Carlos Villavieja, Milan Pavlovic, Augusto Vega, Yoav Etsion, Alex Ramirez, Mateo Valero
Article No.: 36
DOI: 10.1145/2086696.2086715

Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating...

Optimizing explicit data transfers for data parallel applications on the cell architecture
Selma Saidi, Pranav Tendulkar, Thierry Lepley, Oded Maler
Article No.: 37
DOI: 10.1145/2086696.2086716

In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring...

PLDS: Partitioning linked data structures for parallelism
Min Feng, Changhui Lin, Rajiv Gupta
Article No.: 38
DOI: 10.1145/2086696.2086717

Recently, parallelization of computations in the presence of dynamic data structures has shown promising potential. In this paper, we present PLDS, a system for easily expressing and efficiently exploiting parallelism in computations that are...

Polyhedral parallelization of binary code
Benoit Pradelle, Alain Ketterlin, Philippe Clauss
Article No.: 39
DOI: 10.1145/2086696.2086718

Many automatic software parallelization systems have been proposed in the past decades, but most of them are dedicated to source-to-source transformations. This paper shows that parallelizing executable programs is feasible, even if they require...

ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication
Yaozu Dong, Yu Chen, Zhenhao Pan, Jinquan Dai, Yunhong Jiang
Article No.: 40
DOI: 10.1145/2086696.2086719

Virtualization is gaining popularity in cloud computing and has become the key enabling technology in cloud infrastructure. By replicating the virtual server state to multiple independent platforms, virtualization improves the reliability and...

Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic
Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, Gerard J. M. Smit
Article No.: 41
DOI: 10.1145/2086696.2086720

In spite of the fact that floating-point arithmetic is costly in terms of silicon area, the joint design of hardware for floating-point and integer arithmetic is seldom considered. While components like multipliers and adders can potentially be...

Seamlessly portable applications: Managing the diversity of modern heterogeneous systems
Mario Kicherer, Fabian Nowak, Rainer Buchty, Wolfgang Karl
Article No.: 42
DOI: 10.1145/2086696.2086721

Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and...

SYRANT: SYmmetric resource allocation on not-taken and taken paths
Nathanael Premillieu, Andre Seznec
Article No.: 43
DOI: 10.1145/2086696.2086722

In the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in parallel applications. Unfortunately, despite tremendous research effort on branch prediction,...

The gradient-based cache partitioning algorithm
William Hasenplaugh, Pritpal S. Ahuja, Aamer Jaleel, Simon Steely Jr., Joel Emer
Article No.: 44
DOI: 10.1145/2086696.2086723

This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources...

The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
Javier Lira, Timothy M. Jones, Carlos Molina, Antonio González
Article No.: 45
DOI: 10.1145/2086696.2086724

The exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs...

Thread Tranquilizer: Dynamically reducing performance variation
Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan
Article No.: 46
DOI: 10.1145/2086696.2086725

To realize the performance potential of multicore systems, we must effectively manage the interactions between memory reference behavior and the operating system policies for thread scheduling and migration decisions. We observe that these...

TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks
Dongsong Zhang, Deke Guo, Fangyuan Chen, Fei Wu, Tong Wu, Ting Cao, Shiyao Jin
Article No.: 47
DOI: 10.1145/2086696.2086726

As the energy consumption of multi-core systems becomes increasingly prominent, it's a challenge to design an energy-efficient real-time scheduling algorithm in multi-core systems for reducing the system energy consumption while guaranteeing the...

The accelerator store: A shared memory framework for accelerator-based systems
Michael J. Lyons, Mark Hempstead, Gu-Yeon Wei, David Brooks
Article No.: 48
DOI: 10.1145/2086696.2086727

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems...

Toward high-throughput algorithms on many-core architectures
Daniel Orozco, Elkin Garcia, Rishi Khan, Kelly Livingston, Guang R. Gao
Article No.: 49
DOI: 10.1145/2086696.2086728

Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of...

Using machine learning to improve automatic vectorization
Kevin Stock, Louis-Noël Pouchet, P. Sadayappan
Article No.: 50
DOI: 10.1145/2086696.2086729

Automatic vectorization is critical to enhancing performance of compute-intensive programs on modern processors. However, there is much room for improvement over the auto-vectorization capabilities of current production compilers through careful...

Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system
Kanit Therdsteerasukdi, Gyungsu Byun, Jason Cong, M. Frank Chang, Glenn Reinman
Article No.: 51
DOI: 10.1145/2086696.2086730

Smartphones and tablets are becoming more and more powerful, replacing desktops and laptops as the users' main computing system. As these systems support higher and higher resolutions with more complex 3D graphics, a high-throughput and low-power...

VSim: Simulating multi-server setups at near native hardware speed
Frederick Ryckbosch, Stijn Polfliet, Lieven Eeckhout
Article No.: 52
DOI: 10.1145/2086696.2086731

Simulating contemporary computer systems is a challenging endeavor, especially when it comes to simulating high-end setups involving multiple servers. The simulation environment needs to run complete software stacks, including operating systems,...

Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems
Miao Zhou, Yu Du, Bruce Childers, Rami Melhem, Daniel Mossé
Article No.: 53
DOI: 10.1145/2086696.2086732

Phase-Change Memory (PCM) has emerged as a promising low-power main memory candidate to replace DRAM. The main problems of PCM are that writes are much slower and more power hungry than reads, write bandwidth is much lower than read bandwidth, and...

A transactional memory with automatic performance tuning
Qingping Wang, Sameer Kulkarni, John Cavazos, Michael Spear
Article No.: 54
DOI: 10.1145/2086696.2086733

A significant obstacle to the acceptance of transactional memory (TM) in real-world parallel programs is the abundance of substantially different TM algorithms. Each TM algorithm appears well-suited to certain workload characteristics, but the...

sFtree: A fully connected and deadlock-free switch-to-switch routing algorithm for fat-trees
Bartosz Bogdanski, Sven-Arne Reinemo, Frank Olaf Sem-Jacobsen, Ernst Gunnar Gran
Article No.: 55
DOI: 10.1145/2086696.2086734

Existing fat-tree routing algorithms fully exploit the path diversity of a fat-tree topology in the context of compute node traffic, but they lack support for deadlock-free and fully connected switch-to-switch communication. Such support is...