ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Performance optimization of stencil codes requires data locality improvements. The polyhedron model for loop transformation is well suited for such... (more)

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

This article demonstrates an approach for combining general tuning techniques with the POWER8 hardware architecture through optimizing three... (more)

SelSMaP: A Selective Stride Masking Prefetching Scheme

Data prefetching, which intelligently loads data closer to the processor before demands, is a popular cache performance optimization technique to address the increasing processor-memory performance gap. Although prefetching concepts have been proposed for decades, sophisticated system architecture and emerging applications introduce new challenges.... (more)

SCP: Shared Cache Partitioning for High-Performance GEMM

GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent... (more)

Static Prediction of Silent Stores

A store operation is called “silent” if it writes in memory a value that is already there. The ability to detect silent stores is important, because they might indicate performance bugs, might enable code optimizations, and might reveal opportunities of automatic parallelization, for instance. Silent stores are traditionally... (more)

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Modern computing workloads often have high memory intensity, requiring high bandwidth access to memory. The memory request patterns of these workloads... (more)

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

We introduce Poker, a permutation-based approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, Poker represents a... (more)

Automated Software Protection for the Masses Against Side-Channel Attacks

We present an approach and a tool to answer the need for effective, generic, and easily applicable protections against side-channel attacks. The... (more)

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to... (more)

AVPP: Address-first Value-next Predictor with Value Prefetching for Improving the Efficiency of Load Value Prediction

Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Although this technique can significantly improve overall performance, most of the state-of-the-art value prediction approaches require high hardware cost, which is the main obstacle for its wide adoption in current processors. To... (more)

RAGuard: An Efficient and User-Transparent Hardware Mechanism against ROP Attacks

Control-flow integrity (CFI) is a general method for preventing code-reuse attacks, which utilize benign code sequences to achieve arbitrary code execution. CFI ensures that the execution of a program follows the edges of its predefined static Control-Flow Graph: any deviation that constitutes a CFI violation terminates the application. Despite... (more)

GenMatcher: A Generic Clustering-Based Arbitrary Matching Framework

Packet classification methods rely upon packet content/header matching against rules. Thus, throughput of matching operations is critical in many networking applications. Further, with the advent of Software Defined Networking (SDN), efficient implementation of software approaches to matching are critical for the overall system performance. This... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-Order Execution

In this paper we present a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls on GPUs. HAWS leverages our compiler infrastructure to identify potential opportunities to bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 15.3% on average for memory intensive applications.

Memory-side Protection with a Capability Enforcement co-Processor

Byte-addressable non-volatile memory (NVM) blends the concepts of storage and memory and can radically improve data-centric applications, from in-memory databases to graph processing. NVM changes the nature of rack-scale systems and enables short-latency direct memory access while retaining data persistence properties and simplifying the software stack. This paper proposes CEP (Capability Enforcement Coprocessor), a memory-side coprocessor which implements fine-grained protection through the capability model. By doing so, it opens up important performance optimization opportunities (without compromising security).

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

We intend to provide a measurement on the deep learning communication via trace capturing. We firstly provide detailed analyses about the communication mechanism of MXNet. Secondly, we define the DLC trace format to record the communication behaviors. Thirdly, we present the implementation of method for trace capturing. Fourthly, we verify the communication mechanism by providing a glimpse of the trace files. Finally, we make some statistics and analyses about the distributed deep learning communication based on the captured trace files, including communication pattern, overlap ratio between computation and communication, synchronization overhead, update overhead etc.

Efficient Data Supply for Parallel Heterogeneous Architectures

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This paper explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some micro-architectural improvements for data supply units to efficiently handle long-latency indirect loads.

Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing pipelines remains a challenge due to the inherently complex structure of many image processing applications, the plethora of transformations that can be applied as well as the interaction of these transformations with locality, parallelism and re-computation. We propose a novel optimization strategy that aims to maximize producer-consumer locality and reuse between stages of the pipeline. We implement it as a tool to be used alongside the Halide DSL and test it on a variety of benchmarks. Experimental results on three multi-core platforms show a performance improvement of over 40% compared to previous state-of-the-art approaches.

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

In this paper, we propose a novel technique called Idle-Time-Aware Power Management (ITAP) to effectively reduce the static energy consumption of GPU execution units. ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length using prediction and look-ahead techniques in a synergistic way and then, applies the most proper static power reduction mode based on the estimated idle period length. Our experimental results show that ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with a negligible performance overhead.

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their dependences to be determined at run-time. Pipelite unifies communication, scheduling and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely-used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state-of-the-art in terms of speedup, scalability, and memory usage.

Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads

Inexpensive DRAMs have created new opportunities for in-memory data analytics. However, the major bottleneck in such systems is high memory access latency. Traditionally, this problem is solved with large cache hierarchies that only benefit regular applications. Alternatively, many data-intensive applications exhibit irregular behavior. Hardware multithreading can better cope with high latency seen in such applications. This paper implements a multithreaded prototype (MTP) on FPGAs, for the relational selection operator which exhibits control flow irregularity. On a standard TPC-H query, MTP achieves a normalized speedup of 1.8x over CPU and 3.2x over GPU, while consuming 2.5x and 3.4x less power respectively

SAQIP: A Scalable Architecture for Quantum Information Processors

Proposing an architecture that efficiently compensates for the inefficiencies of hardware with extra resources is one of the key issues in quantum computer design. Scaling the current small-sized lab quantum systems to large-scale systems that are capable of solving meaningful practical problems can be the main goal of much research. In this paper, a scalable architecture for quantum information processors, called SAQIP, is proposed. Moreover, a flow is presented to map a circuit on this architecture. Experimental results show that the proposed architecture and design flow decrease the average latency of quantum circuits by about 83% for the attempted benchmarks.

Supporting Superpages and Lightweight Page Migration in Hybrid Memory Systems

Superpages have long been used to mitigate address translation overhead. However, superpages often preclude lightweight page migration in hybrid memory systems composed of DRAM and non-volatile memory (NVM). This paper presents Rainbow to bridge this fundamental conflict between superpages and lightweight page migration. Rainbow utilizes split TLBs to support different page sizes, and uses DRAM to cache frequently-accessed small pages in each NVM superpage. By a novel NVM-to-DRAM address remapping mechanism, Rainbow supports lightweight page migration without splintering superpages. Experimental results show that Rainbow can significantly reduce applications' TLB misses and improve application performance (IPC) by up to 2.9X.

Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction Scheduling

In this paper, we explore an alternative cost function for combinatorial register-pressure-aware instruction scheduling, which is the Sum of Live Interval Lengths (SLIL). Unlike the classical peak cost function, which captures register pressure only at the highest pressure point, SLIL captures register pressure at all points in the schedule. The paper describes a Branch-and-Bound (B&B) algorithm for minimizing SLIL. The algorithm is implemented into LLVM. The experimental results using SPEC CPU2006 on Intel x86 show that the proposed algorithm gives substantially less spilling and speeds the execution by up to 18% relative to LLVMs default scheduler.

DUCATI: High Performance Address Translation By Improving TLB Reach on GPU Accelerated Systems

We propose two hardware mechanisms for reducing the frequency and penalty of on-die TLB misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die last-level-cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity. The second, DRAM-TLB, memoizes virtual to physical address translations in DRAM and reduces on-die TLB miss penalty when the UCAT is unable to fully cover application working-set size. Combining both these mechanisms, we propose DUCATI, an address translation architecture that improves GPU performance by 81% (up to 4.5x) and requires minimal changes to the existing design

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name