ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

Optimizing Remote Communication in X10

X10 is a partitioned global address space programming language that supports the notion of places; a place consists of some data and some lightweight tasks called activities. Each activity runs at a place and may invoke a place-change operation (using the at-construct) to synchronously perform some computation at another place. These place-change... (more)

MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning, and HPC applications. However, these applications exhibit low locality of reference, rendering... (more)

DCMI: A Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs

Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with Field Programmable Gate Arrays, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of... (more)

A Neural Network Prefetcher for Arbitrary Memory Access Patterns

Memory prefetchers are designed to identify and prefetch specific access patterns, including spatiotemporal locality (e.g., strides, streams),... (more)

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any... (more)

Layup: Layer-adaptive and Multi-type Intermediate-oriented Memory Optimization for GPU-based CNNs

Although GPUs have emerged as the mainstream for the acceleration of convolutional neural network (CNN) training processes, they usually have limited physical memory, meaning that it is hard to train large-scale CNN models. Many methods for memory optimization have been proposed to decrease the memory consumption of CNNs and to mitigate the... (more)

Evaluating Auto-Vectorizing Compilers through Objective Withdrawal of Useful Information

The need for compilers to generate highly vectorized code is at an all-time high with the increasing vectorization capabilities of modern processors.... (more)

PIMBALL: Binary Neural Networks in Spintronic Memory

Neural networks span a wide range of applications of industrial and commercial significance. Binary neural networks (BNN) are particularly effective in trading accuracy for performance, energy efficiency, or hardware/software complexity. Here, we introduce a spintronic, re-configurable in-memory BNN... (more)

Exploiting Bank Conflict-based Side-channel Timing Leakage of GPUs

To prevent information leakage during program execution, modern software cryptographic implementations target constant-time function, where the number... (more)

BitSAD v2: Compiler Optimization and Analysis for Bitstream Computing

Computer vision and machine learning algorithms operating under a strict power budget require an alternate computing paradigm. While bitstream computing (BC) satisfies these constraints, creating BC systems is difficult. To address the design challenges, we propose compiler extensions to BitSAD, a DSL for BC. Our work enables bit-level software... (more)

Chunking for Dynamic Linear Pipelines

Dynamic scheduling and dynamic creation of the pipeline structure are crucial for efficient execution of pipelined programs. Nevertheless, dynamic systems imply higher overhead than static systems. Therefore, chunking is the key to decrease the synchronization and scheduling overhead by grouping activities. We present a chunking algorithm for... (more)


Editor-In-Chief Call for Nominations

The term of the current Editor-in-Chief (EiC) of the ACM Transactions on Architecture and Code Optimization (TACO) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.

Nominations, including self-nominations, are invited for a three-year term as TACO EiC, beginning on May 1, 2020. The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support. READ MORE


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

Informed Prefetching for Indirect Memory Accesses

Indirect memory accesses have irregular access patterns which limit the performance of conventional software and hardware-based prefetchers. To address this problem, we propose the Array Tracking Prefetcher(ATP) which tracks array-based indirect memory accesses using a novel combination of software and hardware. ATP yields average speedup of 2.17 as compared to a single-core without prefetching. By contrast, the speedup for conventional software and hardware-based prefetching is 1.84 and 1.32, respectively. For four-cores, the average speedup for ATP is 1.85, while the corresponding speedups for software and hardware-based prefetching are 1.60 and 1.25, respectively.

Data-Driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs

We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy MpSpMV on an Nvidia V100 GPU. The approach has three benefits: (1) reduces computation time; (2) reduces size of the input matrix, and therefore reduces data movement; and, (3) provides an opportunity for increased parallelism across computations, e.g., distinct floating point units. MpSpMV's decision to lower to single precision is data-driven, based on individual nonzero values of the sparse matrix. On large representative matrices, we obtain average speedup of 2.35X over double precision, while maintaining two or more additional significant digits of accuracy compared to single precision.

DNNTune: Automatic Benchmarking DNN Models for Mobile-Cloud Computing

DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.

ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0

Despite key technological advancements in racetrack memories (RMs) in the last decade, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift-operations. This paper presents data-placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an ILP-formulation for data-placement in RMs and revisit existing heuristics. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic-search that reduces the number of shifts of up-to 52.5%, outperforming the state-of-the-art by up-to 16.1%.

Compiler-Support for Critical Data Persistence in NVM

In this paper, we present a compiler-support that automatically inserts complex instructions into kernels to achieve NVMM data-persistence based on a simple programmer directive. We only make the data resulting from the computations durable. Specific parameters are also logged periodically to track a safe restart point. Data transfer operations are decreased because we scale to the cache line block size instead of individual data elements. Our compiler-support was implemented in the LLVM tool-chain and introduces the necessary modifications to loop-intensive computational kernels to force data persistence. The experiments show that our proposed compiler support performance overheads are insignificant.

Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters

Low latency Remote Direct Memory Access (RDMA) fabrics such as InfiniBand and Converged Ethernet are attractive as potential replacements for TCP in data centers (DCs) with latency-critical workloads. Unfortunately, InfiniBand's existing primitives ('verbs') are limited in ways that impose serious penalties (e.g., memory wastage or significant programmer burden or performance loss) for typical DC traffic. We propose Remote Indirect Memory Access (RIMA)} which avoids these pitfalls by providing a network interface card (NIC) microarchitecture support for novel queue-based verbs.

Exploring Complex Brain-Simulation Workloads on Multi-GPU Deployments

In-silico brain simulations are the de-facto tools through which computational neuroscientists seek to understand brain-function dynamics. Current brain simulators do no scale efficiently to large-scale problem sizes. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA's GPUDirect technology on the extended Hodgkin-Huxley, model of the inferior-olivary nucleus to assess its scalability. Not only the simulation of the cells, but also the setup of the network is taken into account. Network sizes simulated range from 65K to 4M cells, with 10, 100 and 1,000 synapses/neuron are executed on 8, 16, 32 and 48 GPUs.

Building of a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of non-Affine Programs Scalable

The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi

FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation

We present FailAmp, a novel LLVM program transformation algorithm that makes a program manipulating arrays more robust against soft-errors by guarding its array index calculations. Without FailAmp, an offset calculation error can go undetected; with FailAmp, all subsequent offset calculations are relativized, building on the faulty one to be fully detected. FailAmp can exploit ISAs such as ARM to further reduce overheads. We present a thorough evaluation of FailAmp applied to a large collection of HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of only 5%.

Exploiting Nested MIMD-SIMD Parallelism on Heterogeneous Microprocessors

Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.

Flextended Tiles: a Flexible Extension of Overlapped Tiles for Polyhedral Compilation

We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional ?scalene trapezoid? shapes as well as original ?right-rectangle? variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and representative iterated stencils, validating the effectiveness and general applicability of our technique on both general-purpose multicores and GPU accelerators.

Declarative Loop Tactics for Domain-Specific Optimization

A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-based Processors

The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This paper presents Metric-Guided Method (MGM) ? a new structured approach for analysis of architectural enhancements. We evaluate MGM through two evaluations at the microarchitecture and the Instruction Set Architecture (ISA) levels. Our evaluation results show that simple optimizations, such as improved representation of CISC instructions, broadly improve performance, while changes in the Floating-Point execution units had mixed impact. The paper also contributes a set of specially designed micro-benchmarks that isolates features of the Skylake processor that were influential on SPEC CPU benchmarks.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name