Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 10 Issue 4, December 2013

Exploring single and multilevel JIT compilation policy for modern machines 1
Michael R. Jantz, Prasad A. Kulkarni
Article No.: 22
DOI: 10.1145/2541228.2541229

Dynamic or Just-in-Time (JIT) compilation is essential to achieve high-performance emulation for programs written in managed languages, such as Java and C#. It has been observed that a conservative JIT compilation policy is most...

A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies
Xiangyu Dong, Norman P. Jouppi, Yuan Xie
Article No.: 23
DOI: 10.1145/2541228.2541230

Many new memory technologies are available for building future energy-efficient memory hierarchies. It is necessary to have a framework that can quickly find the optimal memory technology at each hierarchy level. In this work, we first build a...

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface
Jishen Zhao, Guangyu Sun, Gabriel H. Loh, Yuan Xie
Article No.: 24
DOI: 10.1145/2541228.2541231

The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU...

An efficient multicharacter transition string-matching engine based on the aho-corasick algorithm
Chien-Chi Chen, Sheng-De Wang
Article No.: 25
DOI: 10.1145/2541228.2541232

A string-matching engine capable of inspecting multiple characters in parallel can multiply the throughput. However, the space required for implementing a matching engine that can process multiple characters in parallel generally grows...

The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution
Yangchun Luo, Wei-Chung Hsu, Antonia Zhai
Article No.: 26
DOI: 10.1145/2541228.2541233

With the emergence of multicore processors, various aggressive execution models have been proposed to exploit fine-grained thread-level parallelism, taking advantage of the fast on-chip interconnection communication. However, the aggressive nature...

Virtually split cache: An efficient mechanism to distribute instructions and data
Dyer Rolán, Basilio B. Fraguela, Ramón Doallo
Article No.: 27
DOI: 10.1145/2541228.2541234

First-level caches are usually split for both instructions and data instead of unifying them in a single cache. Although that approach eases the pipeline design and provides a simple way to independently treat data and instructions, its global hit...

Using in-flight chains to build a scalable cache coherence protocol
Samantika Subramaniam, Simon C. Steely, Will Hasenplaugh, Aamer Jaleel, Carl Beckmann, Tryggve Fossum, Joel Emer
Article No.: 28
DOI: 10.1145/2541228.2541235

As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel...

Modeling the impact of permanent faults in caches
Daniel Sánchez, Yiannakis Sazeides, Juan M. Cebrián, José M. García, Juan L. Aragón
Article No.: 29
DOI: 10.1145/2541228.2541236

The traditional performance cost benefits we have enjoyed for decades from technology scaling are challenged by several critical constraints including reliability. Increases in static and dynamic variations are leading to higher probability of...

Automatic parallelization of fine-grained metafunctions on a chip multiprocessor
Sanghoon Lee, James Tuck
Article No.: 30
DOI: 10.1145/2541228.2541237

Due to the importance of reliability and security, prior studies have proposed inlining metafunctions into applications for detecting bugs and security vulnerabilities. However, because these software techniques add frequent, fine-grained...

Dynamic microarchitectural adaptation using machine learning
Christophe Dubach, Timothy M. Jones, Edwin V. Bonilla
Article No.: 31
DOI: 10.1145/2541228.2541238

Adaptive microarchitectures are a promising solution for designing high-performance, power-efficient microprocessors. They offer the ability to tailor computational resources to the specific requirements of different programs or program phases....

E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories
Long Chen, Yanan Cao, Zhao Zhang
Article No.: 32
DOI: 10.1145/2541228.2541239

This study presents and evaluates E3CC (Enhanced Embedded ECC), a full design and implementation of a generic embedded ECC scheme that enables power-efficient error protection for subranked memory systems. It incorporates a novel...

Temporal-based multilevel correlating inclusive cache replacement
Yingying Tian, Samira M. Khan, Daniel A. Jiménez
Article No.: 33
DOI: 10.1145/2541228.2555290

Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but...

Hardware support for accurate per-task energy metering in multicore systems
Qixiao Liu, Miquel Moreto, Victor Jimenez, Jaume Abella, Francisco J. Cazorla, Mateo Valero
Article No.: 34
DOI: 10.1145/2541228.2555291

Accurately determining the energy consumed by each task in a system will become of prominent importance in future multicore-based systems because it offers several benefits, including (i) better application energy/performance optimizations, (ii)...

Tile size selection revisited
Sanyam Mehta, Gautham Beeraka, Pen-Chung Yew
Article No.: 35
DOI: 10.1145/2541228.2555292

Loop tiling is a widely used loop transformation to enhance data locality and allow data reuse. In the tiled code, however, tiles of different sizes can lead to significant variation in performance. Thus, selection of an optimal tile size is...

Fast pattern-specific routing for fat tree networks
Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg, Torsten Hoefler
Article No.: 36
DOI: 10.1145/2541228.2555293

In the context of eXtended Generalized Fat Tree (XGFT) topologies, widely used in HPC and datacenter network designs, we propose a generic method, based on Integer Linear Programming (ILP), to efficiently determine optimal routes for arbitrary...

Selecting representative benchmark inputs for exploring microprocessor design spaces
Maximilien B. Breughe, Lieven Eeckhout
Article No.: 37
DOI: 10.1145/2541228.2555294

The design process of a microprocessor requires representative workloads to steer the search process toward an optimum design point for the target application domain. However, considering a broad set of workloads to cover the large space of...

Information flow tracking meets just-in-time compilation
Christoph Kerschbaumer, Eric Hennigan, Per Larsen, Stefan Brunthaler, Michael Franz
Article No.: 38
DOI: 10.1145/2541228.2555295

Web applications are vulnerable to cross-site scripting attacks that enable data thefts. Information flow tracking in web browsers can prevent communication of sensitive data to unintended recipients and thereby stop such data thefts....

Time- and space-efficient flow-sensitive points-to analysis
Rupesh Nasre
Article No.: 39
DOI: 10.1145/2541228.2555296

Compilation of real-world programs often requires hours. The term nightly build known to industrial researchers is an artifact of long compilation times. Our goal is to reduce the absolute analysis times for large C codes (of the order of...

Boosting timestamp-based transactional memory by exploiting hardware cycle counters
Wenjia Ruan, Yujie Liu, Michael Spear
Article No.: 40
DOI: 10.1145/2541228.2555297

Time-based transactional memories typically rely on a shared memory counter to ensure consistency. Unfortunately, such a counter can become a bottleneck. In this article, we identify properties of hardware cycle counters that allow their use in...

ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity
Tanima Dey, Wei Wang, Jack W. Davidson, Mary Lou Soffa
Article No.: 41
DOI: 10.1145/2541228.2555298

To utilize the full potential of modern chip multiprocessors and obtain scalable performance improvements, it is critical to mitigate resource contention created by multithreaded workloads. In this article, we describe ReSense, the first runtime...

Techniques to improve performance in requester-wins hardware transactional memory
Adrià Armejach, Ruben Titos-Gil, Anurag Negi, Osman S. Unsal, Adrián Cristal
Article No.: 42
DOI: 10.1145/2541228.2555299

The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations...

Reducing DRAM row activations with eager read/write clustering
Myeongjae Jeon, Conglong Li, Alan L. Cox, Scott Rixner
Article No.: 43
DOI: 10.1145/2541228.2555300

This article describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, many dirty cache lines are written to DRAM before they are...

HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing
Zhijia Zhao, Michael Bebenita, Dave Herman, Jianhua Sun, Xipeng Shen
Article No.: 44
DOI: 10.1145/2541228.2555301

Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms...

Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
Ehsan Totoni, Mert Dikmen, María Jesús Garzarán
Article No.: 45
DOI: 10.1145/2541228.2555302

We optimize a visual object detection application (that uses Vision Video Library kernels) and show that OpenCL is a unified programming paradigm that can provide high performance when running on the Ivy Bridge heterogeneous on-chip architecture....

ARI: Adaptive LLC-memory traffic management
Viacheslav V. Fedorov, Sheng Qiu, A. L. Narasimha Reddy, Paul V. Gratz
Article No.: 46
DOI: 10.1145/2543697

Decreasing the traffic from the CPU LLC to main memory is a very important issue in modern systems. Recent work focuses on cache misses, overlooking the impact of writebacks on the total memory traffic, energy consumption, IPC, and so forth....

Accelerating an application domain with specialized functional units
Cecilia González-Álvarez, Jennifer B. Sartor, Carlos Álvarez, Daniel Jiménez-González, Lieven Eeckhout
Article No.: 47
DOI: 10.1145/2541228.2555303

Hardware specialization has received renewed interest recently as chips are hitting power limits. Chip designers of traditional processor architectures have primarily focused on general-purpose computing, partially due to time-to-market pressure...

Revisiting memory management on virtualized environments
Xiaolin Wang, Lingmei Weng, Zhenlin Wang, Yingwei Luo
Article No.: 48
DOI: 10.1145/2541228.2555304

With the evolvement of hardware, 64-bit Central Processing Units (CPUs) and 64-bit Operating Systems (OSs) have dominated the market. This article investigates the performance of virtual memory management of Virtual Machines (VMs) with a large...

PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling
Chuntao Jiang, Zhibin Yu, Hai Jin, Chengzhong Xu, Lieven Eeckhout, Wim Heirman, Trevor E. Carlson, Xiaofei Liao
Article No.: 49
DOI: 10.1145/2541228.2555305

Computer architects rely heavily on microarchitecture simulation to evaluate design alternatives. Unfortunately, cycle-accurate simulation is extremely slow, being at least 4 to 6 orders of magnitude slower than real hardware. This longstanding...

Profile-guided transaction coalescing—lowering transactional overheads by merging transactions
Srđan Stipić, Vesna Smiljković, Osman Unsal, Adrián Cristal, Mateo Valero
Article No.: 50
DOI: 10.1145/2541228.2555306

Previous studies in software transactional memory mostly focused on reducing the overhead of transactional read and write operations. In this article, we introduce transaction coalescing, a profile-guided compiler optimization technique...

WADE: Writeback-aware dynamic cache management for NVM-based main memory system
Zhe Wang, Shuchang Shan, Ting Cao, Junli Gu, Yi Xu, Shuai Mu, Yuan Xie, Daniel A. Jiménez
Article No.: 51
DOI: 10.1145/2541228.2555307

Emerging Non-Volatile Memory (NVM) technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor design. One of the major disadvantages for NVM is the latency and energy overhead...

C1C: A configurable, compiler-guided STT-RAM L1 cache
Yong Li, Yaojun Zhang, Hai LI, Yiran Chen, Alex K. Jones
Article No.: 52
DOI: 10.1145/2541228.2555308

Spin-Transfer Torque RAM (STT-RAM), a promising alternative to SRAM for reducing leakage power consumption, has been widely studied to mitigate the impact of its asymmetrically long write latency. Recently, STT-RAM has been proposed for L1 caches...

Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, P. Sadayappan
Article No.: 53
DOI: 10.1145/2541228.2555309

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper, while the...

Designing a practical data filter cache to improve both energy efficiency and performance
Alen Bardizbanyan, Magnus Själander, David Whalley, Per Larsson-Edefors
Article No.: 54
DOI: 10.1145/2541228.2555310

Conventional Data Filter Cache (DFC) designs improve processor energy efficiency, but degrade performance. Furthermore, the single-cycle line transfer suggested in prior studies adversely affects Level-1 Data Cache (L1 DC) area and energy...

GPU code generation for ODE-based applications with phased shared-data access patterns
Andrei Hagiescu, Bing Liu, R. Ramanathan, Sucheendra K. Palaniappan, Zheng Cui, Bipasa Chattopadhyay, P. S. Thiagarajan, Weng-Fai Wong
Article No.: 55
DOI: 10.1145/2541228.2555311

We present a novel code generation scheme for GPUs. Its key feature is the platform-aware generation of a heterogeneous pool of threads. This exposes more data-sharing opportunities among the concurrent threads and reduces the memory requirements...

TornadoNoC: A lightweight and scalable on-chip network architecture for the many-core era
Junghee Lee, Chrysostomos Nicopoulos, Hyung Gyu LEE, Jongman Kim
Article No.: 56
DOI: 10.1145/2541228.2555312

The rapid emergence of Chip Multi-Processors (CMP) as the de facto microprocessor archetype has highlighted the importance of scalable and efficient on-chip networks. Packet-based Networks-on-Chip (NoC) are gradually cementing themselves as the...

A system architecture, processor, and communication protocol for secure implants
Christos Strydis, Robert M. Seepers, Pedro Peris-Lopez, Dimitrios Siskos, Ioannis Sourdis
Article No.: 57
DOI: 10.1145/2541228.2555313

Secure and energy-efficient communication between Implantable Medical Devices (IMDs) and authorized external users is attracting increasing attention these days. However, there currently exists no systematic approach to the problem, while...

Fast modulo scheduler utilizing patternized routes for coarse-grained reconfigurable architectures
Wonsub Kim, Yoonseo Choi, Haewoo Park
Article No.: 58
DOI: 10.1145/2541228.2555314

Coarse-Grained Reconfigurable Architectures (CGRAs) present a potential of high compute throughput with energy efficiency. A CGRA consists of an array of Functional Units (FUs), which communicate with each other through an interconnect network...

JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages
Dorit Nuzman, Revital Eres, Sergei Dyshel, Marcel Zalmanovici, Jose Castanos
Article No.: 59
DOI: 10.1145/2541228.2555315

The growing gap between the advanced capabilities of static compilers as reflected in benchmarking results and the actual performance that users experience in real-life scenarios makes client-side dynamic optimization technologies imperative to...

Automatic data allocation and buffer management for multi-GPU machines
Thejas Ramashekar, Uday Bondhugula
Article No.: 60
DOI: 10.1145/2544100

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs...

Analysis of dependence tracking algorithms for task dataflow execution
Hans Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos
Article No.: 61
DOI: 10.1145/2541228.2555316

Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core...

Evaluator-executor transformation for efficient pipelining of loops with conditionals
Yeonghun Jeong, Seongseok Seo, Jongeun Lee
Article No.: 62
DOI: 10.1145/2541228.2555317

Control divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches...

A decoupled non-SSA global register allocation using bipartite liveness graphs
Rajkishore Barik, Jisheng Zhao, Vivek Sarkar
Article No.: 63
DOI: 10.1145/2544101

Register allocation is an essential optimization for all compilers. A number of sophisticated register allocation algorithms have been developed over the years. The two fundamental classes of register allocation algorithms used in modern compilers...

Reducing instruction fetch energy in multi-issue processors
Peter Gavin, David Whalley, Magnus Själander
Article No.: 64
DOI: 10.1145/2541228.2555318

The need to minimize power while maximizing performance has led to recent developments of powerful superscalar designs targeted at embedded and portable use. Instruction fetch is responsible for a significant fraction of microprocessor power and...

List of distinguished reviewers ACM TACO

Article No.: 65
DOI: 10.1145/2560216