ACM DL

ACM Transactions on

Architecture and Code Optimization (TACO)

Menu
Latest Articles

QuMan: Profile-based Improvement of Cluster Utilization

Modern data centers consolidate workloads to increase server utilization and reduce total cost of ownership, and cope with scaling limitations. However, server resource sharing introduces performance interference across applications and, consequently, increases performance volatility, which negatively affects user experience. Thus, a challenging... (more)

LAPPS: Locality-Aware Productive Prefetching Support for PGAS

Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce locality-aware productive... (more)

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for... (more)

An Alternative TAGE-like Conditional Branch Predictor

TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is... (more)

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for... (more)

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of... (more)

Global Dead-Block Management for Task-Parallel Programs

Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in... (more)

High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS... (more)

Cluster Programming using the OpenMP Accelerator Model

Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in... (more)

Block Cooperation: Advancing Lifetime of Resistive Memories by Increasing Utilization of Error Correcting Codes

Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the... (more)

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN... (more)

Software-Directed Techniques for Improved GPU Register File Utilization

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing... (more)

NEWS

TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Static Prediction of Silent Stores

A Store operation is called "silent" if it writes in memory a value that is already there. Silent stores are traditionally detected via profiling. We depart from this methodology and predict silentness by analyzing the syntax of programs. To accomplish this goal, we classify store operations in terms of syntactic features of programs. Based on such features, we develop different kinds of predictors, some of which go much beyond what any trivial approach could achieve. To illustrate how static prediction can be employed in practice, we use it to optimize programs running on non-volatile memory systems.

SelSMaP: A Selective Stride Masking Prefetching Scheme

Although prefetching concepts have been proposed for decades, new challenges are introduced by sophisticated system architecture and emerging applications. Large instruction windows coupled with out-of-order execution makes program data access sequence distorted from cache perspective. Big data applications stress memory subsystems heavily with their large working set sizes and complex data access patterns. To address such challenges, this work proposes a high performance hardware prefetching scheme, SelSMaP. SelSMaP is able to detect both regular and non-uniform stride patterns by taking the minimum observed address offset (called a reference stride) as a heuristic.

SCP: Shared Cache Partitioning for High-performance GEMM

GEMM is the most fundamental routine in BLAS. Software prefetching and data packing are used to exploit LRU caches. However, processors equipped with shared non-LRU caches are emerging. This poses challenges to GEMM development in multi-threaded contexts. We present a Shared Cache Partitioning method to eliminate inter-thread cache conflicts in GEMM, by partitioning a shared cache into disjoint sets and assigning different sets to different threads. We implemented SCP in OpenBLAS and evaluated it on Phytium 2000+, a 64-core AArch64 processor with shared pseudo-random L2 caches. Evaluation shows that SCP has effectively reduced conflict misses, and performance improves by 2.75-6.91%.

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Performance optimization of stencil codes requires data locality improvements. The polyhedron model for loop transformation is well suited for such optimizations with established techniques, such as the PLuTo algorithm and diamond tiling. However, in the domain of our project ExaStencils, stencil codes, it fails to yield optimal results. As an alternative, we propose a new, optimized, multi-dimensional polyhedral search space exploration and demonstrate its effectiveness: we obtain better results than existing approaches. We also propose how to specialize the search for the domain of stencil codes, which dramatically reduces the exploration effort without significantly impairing performance.

RAGuard: An Efficient and User-Transparent Hardware Mechanism against ROP Attacks

This paper proposes RAGuard, an efficient and user-transparent hardware-based approach to prevent ROP attacks. RAGuard binds a MAC to each return address to ensure its integrity. To guarantee the security of the MAC and reduce runtime overhead: RAGuard (1) computes the MAC by encrypting the signature of a return address with AES-128, (2) develops a key management module based on a PUF, and (3) uses a dedicated register to reduce MACs' load and store operations of leaf functions. Furthermore, we evaluate our mechanism based on the LEON3 processor and show that RAGuard incurs acceptable performance overhead and occupies reasonable area.

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphics Processing Units (GPUs) is equipped with large register file (RF) to support fast context switch between massive threads, and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPU is usually limited by the inefficient resource management of register file and scratchpad memory. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. The experimental results show that our approach achieves 20.01% performance improvement on average with negligible energy overhead.

On-GPU Thread-Data Remapping for Branch Divergence Reduction

To achieve runtime on-the-spot branch divergence reduction, we propose the first on-GPU thread-data remapping scheme. Before kernel launching, our solution inserts codes into GPU kernels immediately before each target branch so as to acquire actual divergence information. Threads can be remapped to data multiples times during single kernel execution. We propose two on-GPU thread-data remapping algorithms. Effective on two generations of GPUs from NVIDIA and AMD, our solution achieves speedups up to 2.718 with third-party benchmarks. We also implement three GPGPU frontier benchmarks and show that our solution works better than the traditional one.

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

We introduce Poker, a permutation-based approach for vectorizing queries over B+-trees. Our insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multicores. For five benchmarks evaluated with 24 configurations each, Poker outperforms the prior art by 2.11x with one thread and 2.28x with eight threads on Broadwell, on average. In addition, strip-mining queries further improves Pokers performance by 1.21x, on average.

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency

Modern applications often have high memory intensity, with common memory request patterns requiring a large number of address generation instructions and a high degree of memory-level parallelism. This paper proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address generation instructions by offloading addressing to dedicated hardware, and reduce memory interference by grouping related requests together. Our results show that we eliminate 33% of dynamic instructions, leading to an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%.

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

In this paper, we first demonstrate how to combine the general tuning techniques with the POWER8 hardware architecture through optimizing three representative stencil benchmarks, and then employ two typical real-world applications (similar kernels of the winner program of Gordon Bell Prize 2016 and 2017) to illustrate how to make proper algorithm modifications and fully combine the hardware-oriented tuning strategies with the application algorithms. As a result, this work fills the gap between hardware capability and software performance of the POWER8 processor, and provides useful guidance for optimizing stencil-based scientific applications on POWER systems.

AVPP: Address-first Value-next Predictor with Value Prefetching for Improving the Efficiency of Load Value Prediction

Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name