ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

Optimizing Remote Communication in X10

X10 is a partitioned global address space programming language that supports the notion of places; a place consists of some data and some lightweight tasks called activities. Each activity runs at a place and may invoke a place-change operation (using the at-construct) to synchronously perform some computation at another place. These place-change... (more)

MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning, and HPC applications. However, these applications exhibit low locality of reference, rendering... (more)

DCMI: A Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs

Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with Field Programmable Gate Arrays, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of... (more)

A Neural Network Prefetcher for Arbitrary Memory Access Patterns

Memory prefetchers are designed to identify and prefetch specific access patterns, including spatiotemporal locality (e.g., strides, streams),... (more)

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any... (more)

Layup: Layer-adaptive and Multi-type Intermediate-oriented Memory Optimization for GPU-based CNNs

Although GPUs have emerged as the mainstream for the acceleration of convolutional neural network (CNN) training processes, they usually have limited physical memory, meaning that it is hard to train large-scale CNN models. Many methods for memory optimization have been proposed to decrease the memory consumption of CNNs and to mitigate the... (more)

Evaluating Auto-Vectorizing Compilers through Objective Withdrawal of Useful Information

The need for compilers to generate highly vectorized code is at an all-time high with the increasing vectorization capabilities of modern processors.... (more)

PIMBALL: Binary Neural Networks in Spintronic Memory

Neural networks span a wide range of applications of industrial and commercial significance. Binary neural networks (BNN) are particularly effective in trading accuracy for performance, energy efficiency, or hardware/software complexity. Here, we introduce a spintronic, re-configurable in-memory BNN... (more)


Editor-In-Chief Call for Nominations

The term of the current Editor-in-Chief (EiC) of the ACM Transactions on Architecture and Code Optimization (TACO) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.

Nominations, including self-nominations, are invited for a three-year term as TACO EiC, beginning on May 1, 2020. The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support. READ MORE


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Exploiting Bank Conflict based Side-channel Timing Leakage of GPUs

We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.

Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

BitSAD v2: Compiler Optimization and Analysis for Bitstream Computing

Bitstream computing (BC) enables these complex robotic algorithms under a strict power budget. Yet, BC can expose complex design decisions. To address these challenges, we propose compiler extensions to BitSAD, DSL for BC. Our work enables bit-level software emulation of hardware units, implements automated generation of synthesizable hardware from a program, highlights potential optimizations, and proposes compiler phases to implement them in a hardware-aware manner. Finally, we introduce population coding, a parallelization scheme for stochastic computing that decreases latency without sacrificing accuracy. This work is a series of analyses and experiments on BC designs that inform compiler extensions to BitSAD.

Data-Driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs

We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy MpSpMV on an Nvidia V100 GPU. The approach has three benefits: (1) reduces computation time; (2) reduces size of the input matrix, and therefore reduces data movement; and, (3) provides an opportunity for increased parallelism across computations, e.g., distinct floating point units. MpSpMV's decision to lower to single precision is data-driven, based on individual nonzero values of the sparse matrix. On large representative matrices, we obtain average speedup of 2.35X over double precision, while maintaining two or more additional significant digits of accuracy compared to single precision.

DNNTune: Automatic Benchmarking DNN Models for Mobile-Cloud Computing

DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.

Chunking for Dynamic Linear Pipelines

Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.

Compiler-Support for Critical Data Persistence in NVM

In this paper, we present a compiler-support that automatically inserts complex instructions into kernels to achieve NVMM data-persistence based on a simple programmer directive. We only make the data resulting from the computations durable. Specific parameters are also logged periodically to track a safe restart point. Data transfer operations are decreased because we scale to the cache line block size instead of individual data elements. Our compiler-support was implemented in the LLVM tool-chain and introduces the necessary modifications to loop-intensive computational kernels to force data persistence. The experiments show that our proposed compiler support performance overheads are insignificant.

Exploring Complex Brain-Simulation Workloads on Multi-GPU Deployments

In-silico brain simulations are the de-facto tools through which computational neuroscientists seek to understand brain-function dynamics. Current brain simulators do no scale efficiently to large-scale problem sizes. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA's GPUDirect technology on the extended Hodgkin-Huxley, model of the inferior-olivary nucleus to assess its scalability. Not only the simulation of the cells, but also the setup of the network is taken into account. Network sizes simulated range from 65K to 4M cells, with 10, 100 and 1,000 synapses/neuron are executed on 8, 16, 32 and 48 GPUs.

Building of a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of non-Affine Programs Scalable

The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi

FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation

We present FailAmp, a novel LLVM program transformation algorithm that makes a program manipulating arrays more robust against soft-errors by guarding its array index calculations. Without FailAmp, an offset calculation error can go undetected; with FailAmp, all subsequent offset calculations are relativized, building on the faulty one to be fully detected. FailAmp can exploit ISAs such as ARM to further reduce overheads. We present a thorough evaluation of FailAmp applied to a large collection of HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of only 5%.

Exploiting Nested MIMD-SIMD Parallelism on Heterogeneous Microprocessors

Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.

Flextended Tiles: a Flexible Extension of Overlapped Tiles for Polyhedral Compilation

We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional ?scalene trapezoid? shapes as well as original ?right-rectangle? variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and representative iterated stencils, validating the effectiveness and general applicability of our technique on both general-purpose multicores and GPU accelerators.

A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-based Processors

The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This paper presents Metric-Guided Method (MGM) ? a new structured approach for analysis of architectural enhancements. We evaluate MGM through two evaluations at the microarchitecture and the Instruction Set Architecture (ISA) levels. Our evaluation results show that simple optimizations, such as improved representation of CISC instructions, broadly improve performance, while changes in the Floating-Point execution units had mixed impact. The paper also contributes a set of specially designed micro-benchmarks that isolates features of the Skylake processor that were influential on SPEC CPU benchmarks.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name