ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests.... (more)

The Power-optimised Software Envelope

Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests... (more)

Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources

We introduce Caliper, a technique for accurately estimating performance interference occurring in shared servers. Caliper overcomes the limitations of prior approaches by leveraging a micro-experiment-based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper... (more)

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such... (more)

Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware

We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect... (more)

Simplifying Transactional Memory Support in C++

C++ has supported a provisional version of Transactional Memory (TM) since 2015, via a technical specification. However, TM has not seen widespread adoption, and compiler vendors have been slow to implement the technical specification. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler... (more)

MH Cache: A Multi-retention STT-RAM-based Low-power Last-level Cache for Mobile Hardware Rendering Systems

Mobile devices have become the most important devices in our life. However, they are limited in battery capacity. Therefore, low-power computing is crucial for their long lifetime. A spin-transfer torque RAM (STT-RAM) has become emerging memory technology because of its low leakage power consumption. We herein propose MH cache, a multi-retention... (more)

Polyhedral Compilation for Multi-dimensional Stream Processing

We present a method for compilation of multi-dimensional stream processing programs from affine recurrence equations with unbounded domains into... (more)

Toward On-chip Network Security Using Runtime Isolation Mapping

Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel... (more)

A First Step Toward Using Quantum Computing for Low-level WCETs Estimations

Low-Level analysis of Worst Case Execution Time (WCET) is an important field for real-time system validation. It stands between computer architecture... (more)

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine... (more)

Morphable DRAM Cache Design for Hybrid Memory Systems

DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. When a small... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Exploiting Bank Conflict based Side-channel Timing Leakage of GPUs

We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.

Optimizing Remote Communication in X10

In this paper, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations (i) AT-Opt that minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning that identifies/elides redundant place-change operations and does parallel execution of place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72x and 17.83x, on a four-node (32 cores/node) Intel and two-node (16 cores/node) AMD system, respectively.

PIMBALL: Binary Neural Networks in Spintronic Memory

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Deep learning frameworks automate the deployment and hardware acceleration of models represented as DAGs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: a DSL with a tensor notation close to the mathematics of deep learning; a JIT optimizing compiler based on the polyhedral framework; carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels.

MetaStrider: Architectures for Scalable Memory Centric Reduction of Sparse Data Streams

Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This paper presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.

A Neural Network Prefetcher for Arbitrary Memory Access Patterns

Common memory prefetchers are designed to target specific memory access-patterns, including spatio-temporal locality, recurring patterns, and irregular patterns. In this paper, we propose a conceptual neural network (NN) prefetcher that dynamically adapts to arbitrary access patterns to capture semantic locality. Leveraging recent advances in machine learning, the proposed NN prefetcher correlates program context with memory accesses using online-training, and enables tapping into previously undetected access patterns. We present an architectural implementation of our prefetcher and evaluate it over SPEC2006, Graph500, and other kernels, showing it delivers up to 30% speedup over SPEC2006 and up to 4.4x speedup on some kernels.

Chunking for Dynamic Linear Pipelines

Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.

Layup: Layer-Adaptive and Multi-Type Intermediate-Oriented Memory Optimization for GPU-Based CNNs

Evaluating Auto-Vectorizing Compilers through Objective Withdrawal of Useful Information

The information that compilers have at their disposal is instrumental for obtaining good auto-vectorization optimizations. However, the exact information available at compile-time varies greatly, as does the resulting performance. In this paper, we propose a novel method for evaluating the auto-vectorization capability of compilers by objectively withdraw and withhold information that would otherwise aid the compiler in the auto-vectorization process. As such, our approach is orthogonal to well-known frameworks such as Test Suite for Vectorizing Compilers (TSVC), and thus re-aligns the compile evaluations to be more realistic in embracing the real-world conditions.

Building of a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of non-Affine Programs Scalable

The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi

DCMI: A Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs

Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with FPGAs, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of Multiple Iterations (DCMI) which improves upon prior work by pre-computing the effective stencil coefficients after a number of iterations at design time --- resulting in accelerators that use minimal on-chip memory and avoid redundant computation. This enables DCMI to improve throughput by up to 7.7X compared to the state-of-the-art cone-based architecture.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name