ACM DL

ACM Transactions on

Architecture and Code Optimization (TACO)

Menu
Latest Articles

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests.... (more)

The Power-optimised Software Envelope

Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests... (more)

Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources

We introduce Caliper, a technique for accurately estimating performance interference occurring in shared servers. Caliper overcomes the limitations of prior approaches by leveraging a micro-experiment-based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper... (more)

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such... (more)

Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware

We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect... (more)

Simplifying Transactional Memory Support in C++

C++ has supported a provisional version of Transactional Memory (TM) since 2015, via a technical specification. However, TM has not seen widespread adoption, and compiler vendors have been slow to implement the technical specification. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler... (more)

MH Cache: A Multi-retention STT-RAM-based Low-power Last-level Cache for Mobile Hardware Rendering Systems

Mobile devices have become the most important devices in our life. However, they are limited in battery capacity. Therefore, low-power computing is crucial for their long lifetime. A spin-transfer torque RAM (STT-RAM) has become emerging memory technology because of its low leakage power consumption. We herein propose MH cache, a multi-retention... (more)

Polyhedral Compilation for Multi-dimensional Stream Processing

We present a method for compilation of multi-dimensional stream processing programs from affine recurrence equations with unbounded domains into... (more)

Toward On-chip Network Security Using Runtime Isolation Mapping

Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel... (more)

A First Step Toward Using Quantum Computing for Low-level WCETs Estimations

Low-Level analysis of Worst Case Execution Time (WCET) is an important field for real-time system validation. It stands between computer architecture... (more)

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine... (more)

Morphable DRAM Cache Design for Hybrid Memory Systems

DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. When a small... (more)

NEWS

TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Exploiting Bank Conflict based Side-channel Timing Leakage of GPUs

We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.

BitSAD v2: Compiler Optimization and Analysis for Bitstream Computing

Bitstream computing (BC) enables these complex robotic algorithms under a strict power budget. Yet, BC can expose complex design decisions. To address these challenges, we propose compiler extensions to BitSAD, DSL for BC. Our work enables bit-level software emulation of hardware units, implements automated generation of synthesizable hardware from a program, highlights potential optimizations, and proposes compiler phases to implement them in a hardware-aware manner. Finally, we introduce population coding, a parallelization scheme for stochastic computing that decreases latency without sacrificing accuracy. This work is a series of analyses and experiments on BC designs that inform compiler extensions to BitSAD.

DNNTune: Automatic Benchmarking DNN Models for Mobile-Cloud Computing

DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.

Chunking for Dynamic Linear Pipelines

Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.

Building of a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of non-Affine Programs Scalable

The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi

Exploiting Nested MIMD-SIMD Parallelism on Heterogeneous Microprocessors

Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name