Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 12 Issue 2, July 2015

A Joint SW/HW Approach for Reducing Register File Vulnerability
Hamed Tabkhi, Gunar Schirner
Article No.: 9
DOI: 10.1145/2733378

The Register File (RF) is a particularly vulnerable component within processor core and at the same time a hotspot with high power density. To reduce RF vulnerability, conventional HW-only approaches such as Error Correction Codes (ECCs) or...

Reliable Integrity Checking in Multicore Processors
Arun Kanuparthi, Ramesh Karri
Article No.: 10
DOI: 10.1145/2738052

Security and reliability have become important concerns in the design of computer systems. On one hand, microarchitectural enhancements for security (such as for dynamic integrity checking of code at runtime) have been proposed. On the other hand,...

A New Memory-Disk Integrated System with HW Optimizer
Do-Heon Lee, Su-Kyung Yoon, Jung-Geun Kim, Charles C. Weems, Shin-Dug Kim
Article No.: 11
DOI: 10.1145/2738053

Current high-performance computer systems utilize a memory hierarchy of on-chip cache, main memory, and secondary storage due to differences in device characteristics. Limiting the amount of main memory causes page swap operations and duplicates...

Dynamic Shared SPM Reuse for Real-Time Multicore Embedded Systems
Morteza Mohajjel Kafshdooz, Alireza Ejlali
Article No.: 12
DOI: 10.1145/2738051

Allocating the scratchpad memory (SPM) space to tasks is a challenging problem in real-time multicore embedded systems that use shared SPM. Proper SPM space allocation is important, as it considerably influences the application worst-case...

GPU Performance and Power Tuning Using Regression Trees
Wenhao Jia, Elba Garza, Kelly A. Shaw, Margaret Martonosi
Article No.: 13
DOI: 10.1145/2736287

GPU performance and power tuning is difficult, requiring extensive user expertise and time-consuming trial and error. To accelerate design tuning, statistical design space exploration methods have been proposed. This article presents Starchart, a...

An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations
Irshad Pananilath, Aravind Acharya, Vinay Vasista, Uday Bondhugula
Article No.: 14
DOI: 10.1145/2739047

The Lattice-Boltzmann method (LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver...

Practical Iterative Optimization for the Data Center
Shuangde Fang, Wenwen Xu, Yang Chen, Lieven Eeckhout, Olivier Temam, Yunji Chen, Chengyong Wu, Xiaobing Feng
Article No.: 15
DOI: 10.1145/2739048

Iterative optimization is a simple but powerful approach that searches the best possible combination of compiler optimizations for a given workload. However, iterative optimization is plagued by several practical issues that prevent it from being...

Buddy SM: Sharing Pipeline Front-End for Improved Energy Efficiency in GPGPUs
Tao Zhang, Naifeng Jing, Kaiming Jiang, Wei Shu, Min-You Wu, Xiaoyao Liang
Article No.: 16
DOI: 10.1145/2744202

A modern general-purpose graphics processing unit (GPGPU) usually consists of multiple streaming multiprocessors (SMs), each having a pipeline that incorporates a group of threads executing a common instruction flow. Although SMs are designed to...

EECache: A Comprehensive Study on the Architectural Design for Energy-Efficient Last-Level Caches in Chip Multiprocessors
Hsiang-Yun Cheng, Matt Poremba, Narges Shahidi, Ivan Stalev, Mary Jane Irwin, Mahmut Kandemir, Jack Sampson, Yuan Xie
Article No.: 17
DOI: 10.1145/2756552

Power management for large last-level caches (LLCs) is important in chip multiprocessors (CMPs), as the leakage power of LLCs accounts for a significant fraction of the limited on-chip power budget. Since not all workloads running on CMPs need the...

Intercepting Functions for Memoization: A Case Study Using Transcendental Functions
Arjun Suresh, Bharath Narasimha Swamy, Erven Rohou, André Seznec
Article No.: 18
DOI: 10.1145/2751559

Memoization is the technique of saving the results of executions so that future executions can be omitted when the input set repeats. Memoization has been proposed in previous literature at the instruction, basic block, and function levels using...

SECRET: A Selective Error Correction Framework for Refresh Energy Reduction in DRAMs
Chung-Hsiang Lin, De-Yu Shen, Yi-Jung Chen, Chia-Lin Yang, Cheng-Yuan Michael Wang
Article No.: 19
DOI: 10.1145/2747876

DRAMs are used as the main memory in most computing systems today. Studies show that DRAMs contribute to a significant part of overall system power consumption. One of the main challenges in low-power DRAM design is the inevitable refresh process....

Snippets: Taking the High Road to a Low Level
Doug Simon, Christian Wimmer, Bernhard Urban, Gilles Duboscq, Lukas Stadler, Thomas Würthinger
Article No.: 20
DOI: 10.1145/2764907

When building a compiler for a high-level language, certain intrinsic features of the language must be expressed in terms of the resulting low-level operations. Complex features are often expressed by explicitly weaving together bits of low-level...

Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU
Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, Karthikeyan Sankaralingam
Article No.: 21
DOI: 10.1145/2764908

Graphic processing unit (GPU)-based general-purpose computing is developing as a viable alternative to CPU-based computing in many domains. Today’s tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra. While useful...

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures
Quan Chen, Minyi Guo
Article No.: 22
DOI: 10.1145/2766450

Modern mainstream powerful computers adopt multisocket multicore CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses...

Section-Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication
Madan Das, Gabriel Southern, Jose Renau
Article No.: 23
DOI: 10.1145/2766451

Most systems that test and verify parallel programs, such as deterministic execution engines, data race detectors, and software transactional memory systems, require instrumenting loads and stores in an application. This can cause a very...

Aging-Aware Compilation for GP-GPUs
Atieh Lotfi, Abbas Rahimi, Luca Benini, Rajesh K. Gupta
Article No.: 24
DOI: 10.1145/2778984

General-purpose graphic processing units (GP-GPUs) offer high computational throughput using thousands of integrated processing elements (PEs). These PEs are stressed during workload execution, and negative bias temperature instability (NBTI)...

Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs
Brian P. Railing, Eric R. Hein, Thomas M. Conte
Article No.: 25
DOI: 10.1145/2776893

Parallel programs can be characterized by task graphs encoding instructions, memory accesses, and the parallel work’s dependencies, while representing any threading library and architecture. This article presents Contech, a high performance...