ACM DL

ACM Transactions on

Architecture and Code Optimization (TACO)

Menu
Latest Articles

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Performance optimization of stencil codes requires data locality improvements. The polyhedron model for loop transformation is well suited for such... (more)

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

This article demonstrates an approach for combining general tuning techniques with the POWER8 hardware architecture through optimizing three... (more)

SelSMaP: A Selective Stride Masking Prefetching Scheme

Data prefetching, which intelligently loads data closer to the processor before demands, is a popular cache performance optimization technique to address the increasing processor-memory performance gap. Although prefetching concepts have been proposed for decades, sophisticated system architecture and emerging applications introduce new challenges.... (more)

SCP: Shared Cache Partitioning for High-Performance GEMM

GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent... (more)

NEWS

TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Static Prediction of Silent Stores

A Store operation is called "silent" if it writes in memory a value that is already there. Silent stores are traditionally detected via profiling. We depart from this methodology and predict silentness by analyzing the syntax of programs. To accomplish this goal, we classify store operations in terms of syntactic features of programs. Based on such features, we develop different kinds of predictors, some of which go much beyond what any trivial approach could achieve. To illustrate how static prediction can be employed in practice, we use it to optimize programs running on non-volatile memory systems.

GenMatcher: A Generic Clustering-based Arbitrary Matching Framework

This paper presents GenMatcher, a generic, software-only, arbitrary matching framework for fast, efficient searches. The key idea of our approach is to represent arbitrary rules with efficient prefix-based tries. Our contribution includes a novel, clustering-based grouping algorithm to group rules based upon their bit-level similarities. Our algorithm generates near-optimal trie groupings with low configuration times and provides significantly higher match throughput compared to prior techniques. Experiments with synthetic traffic show that our method can achieve an 58.9X speedup compared to the baseline on a single core processor under a given memory constraint.

Automated software protection for the masses against side-channel attacks

We present an approach and a tool to answer the need for effective, generic and easily applicable protections against side-channel attacks. The protection mechanism is based on code polymorphism, so that the observable behaviour of the protected component is variable and unpredictable to the attacker. Our approach combines lightweight specialized runtime code generation with the optimization capabilities of static compilation. It is extensively configurable and mitigations are set-up against security holes related to runtime code generation. Experimental results show that programs secured by our approach present strong security levels and meet the performance requirements of constrained systems.

RAGuard: An Efficient and User-Transparent Hardware Mechanism against ROP Attacks

This paper proposes RAGuard, an efficient and user-transparent hardware-based approach to prevent ROP attacks. RAGuard binds a MAC to each return address to ensure its integrity. To guarantee the security of the MAC and reduce runtime overhead: RAGuard (1) computes the MAC by encrypting the signature of a return address with AES-128, (2) develops a key management module based on a PUF, and (3) uses a dedicated register to reduce MACs' load and store operations of leaf functions. Furthermore, we evaluate our mechanism based on the LEON3 processor and show that RAGuard incurs acceptable performance overhead and occupies reasonable area.

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphics Processing Units (GPUs) is equipped with large register file (RF) to support fast context switch between massive threads, and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPU is usually limited by the inefficient resource management of register file and scratchpad memory. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. The experimental results show that our approach achieves 20.01% performance improvement on average with negligible energy overhead.

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

We introduce Poker, a permutation-based approach for vectorizing queries over B+-trees. Our insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multicores. For five benchmarks evaluated with 24 configurations each, Poker outperforms the prior art by 2.11x with one thread and 2.28x with eight threads on Broadwell, on average. In addition, strip-mining queries further improves Pokers performance by 1.21x, on average.

Predicting New Workload or CPU performance by Analyzing Public Datasets

Repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have applied it to the datasets of Intel CPU specifications and SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also quantify the self-similarity of these suites for the first time in the literature.

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency

Modern applications often have high memory intensity, with common memory request patterns requiring a large number of address generation instructions and a high degree of memory-level parallelism. This paper proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address generation instructions by offloading addressing to dedicated hardware, and reduce memory interference by grouping related requests together. Our results show that we eliminate 33% of dynamic instructions, leading to an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%.

Processor-Tracing Guided Region Formation in Dynamic Binary Translation

The submitted paper presents a lightweight region formation method guided by processor tracing, e.g., Intel PT. We leverage the branch history information stored in the processor to re-construct program execution profile and effectively form high quality regions with low cost. We also present the designs of lightweight HPM sampling and branch instruction decode cache to minimize region formation overhead. Using ARM32-to-x86-64 translations, the experiment results show that our method achieves a performance speedup of up to 1.38x (1.10x on average) for SPEC CPU2006 benchmarks with reference inputs, compared to the well-known software-based trace formation method, Next Executing Tail (NET).

AVPP: Address-first Value-next Predictor with Value Prefetching for Improving the Efficiency of Load Value Prediction

Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name