ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

Polyhedral Search Space Exploration in the ExaStencils Code Generator

Performance optimization of stencil codes requires data locality improvements. The polyhedron model for loop transformation is well suited for such... (more)

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

This article demonstrates an approach for combining general tuning techniques with the POWER8 hardware architecture through optimizing three... (more)

SelSMaP: A Selective Stride Masking Prefetching Scheme

Data prefetching, which intelligently loads data closer to the processor before demands, is a popular cache performance optimization technique to address the increasing processor-memory performance gap. Although prefetching concepts have been proposed for decades, sophisticated system architecture and emerging applications introduce new challenges.... (more)

SCP: Shared Cache Partitioning for High-Performance GEMM

GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent... (more)

Static Prediction of Silent Stores

A store operation is called “silent” if it writes in memory a value that is already there. The ability to detect silent stores is important, because they might indicate performance bugs, might enable code optimizations, and might reveal opportunities of automatic parallelization, for instance. Silent stores are traditionally... (more)

Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Modern computing workloads often have high memory intensity, requiring high bandwidth access to memory. The memory request patterns of these workloads... (more)

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding

We introduce Poker, a permutation-based approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, Poker represents a... (more)

Automated Software Protection for the Masses Against Side-Channel Attacks

We present an approach and a tool to answer the need for effective, generic, and easily applicable protections against side-channel attacks. The... (more)

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to... (more)

RAGuard: An Efficient and User-Transparent Hardware Mechanism against ROP Attacks

Control-flow integrity (CFI) is a general method for preventing code-reuse attacks, which utilize benign code sequences to achieve arbitrary code execution. CFI ensures that the execution of a program follows the edges of its predefined static Control-Flow Graph: any deviation that constitutes a CFI violation terminates the application. Despite... (more)

GenMatcher: A Generic Clustering-Based Arbitrary Matching Framework

Packet classification methods rely upon packet content/header matching against rules. Thus, throughput of matching operations is critical in many networking applications. Further, with the advent of Software Defined Networking (SDN), efficient implementation of software approaches to matching are critical for the overall system performance. This... (more)

Processor-Tracing Guided Region Formation in Dynamic Binary Translation

Region formation is an important step in dynamic binary translation to select hot code regions for translation and optimization. The quality of the... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-Order Execution

In this paper we present a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls on GPUs. HAWS leverages our compiler infrastructure to identify potential opportunities to bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 15.3% on average for memory intensive applications.

Dual-Page Checkpointing: An Architectural Approach to Efficient Data Persistence for In-Memory Applications

Data persistence is highly desired by many in-memory applications. This paper demonstrates that a hardware-based high-frequency checkpointing technique can be used to achieve efficient in-memory data persistence on NVM. we design a new dual-page checkpointing scheme, which achieves low metadata cost and eliminates most excessive NVM writes at the same time. It breaks the traditional trade-off between metadata space cost and extra data writes. Our solution outperforms the state-of-the-art data persistence software by 24.2× higher throughput, and leads to 28% less NVM wear and 1.25× higher throughput than state-of-the-art hardware checkpointing systems.

Accelerating Synchronization using Moving Compute to Data Model at 1000-core Multicore Scale

Shared memory cache coherence paradigm is prevalent for data sharing in modern single-chip multicores. However, atomic instructions based thread synchronization suffers from cache line ping-pong overhead, which prevents performance scaling as core counts increase. To mitigate this bottleneck, this paper proposes to utilize in-hardware explicit messaging to enable a novel moving computation to data model (MC) that offloads the execution of critical code regions at dedicated cores. The key idea is to utilize low-latency, non-blocking capabilities of in-hardware messaging to overlap communication with computation, and exploit data locality for efficient critical code execution in multicores at 1000-cores scale.

SCORE: A Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash

Some solid state drives (SSDs) adopt overlong error correction codes (ECCs), whose redundancy size exceeds the spare area limit of flash pages, to improve the reliability, but suffer from significantly degraded read performance. In this paper, we propose a novel scheme to cache overlong ECCs, called SCORE, to improve the SSD performance. Exceeding ECC redundancy of logically consecutive data pages are grouped into ECC pages. SCORE partitions RAM to cache both data pages and ECC pages in a workload-adaptive manner. Experimental results show that SCORE improves the SSD performance by an average of 34%, compared to the state-of-the-art schemes.

Speeding up Iterative Polyhedral Schedule Optimization with Surrogate Performance Models

To address the large computational effort of iterative compilation, we pursue the guidance of a genetic algorithm for program optimization via feedback from a surrogate performance model. We train the model on program transformations that were evaluated during the iterative optimization of a set of training programs. For the representation of programs and program transformations we employ the polyhedron model. Our experimental evaluation reveals that our models can be used to speed up the future optimization of programs. We demonstrate that we can reduce the benchmarking effort required for an iterative optimization without substantially worsening the result.

Energy-efficient Runtime Management of Heterogeneous Multicores using Online Projection

Heterogeneous multicores offer different core types and Dynamic Voltage and Frequency Scaling (DVFS), defining a vast configuration space. The optimal configuration choice is not always straightforward for single applications, and becomes even harder for dynamically changing scenarios of concurrent applications with unpredictable spawn and termination times and individual performance requirements. This paper proposes an integrated approach for runtime decision making for energy efficiency on such systems. The approach consists of a model that predicts performance and power for any possible decision and low-complexity heuristics that use this model to evaluate a subset of possible decisions and choose the best one.

The art of getting deep neural networks in shape

In this paper, we aim to automate the process of selecting an appropriate DNN topology that fulfills both functional and non-functional requirements of the application. Specifically, we tune two important hyperparameters, depth and width, which together define the shape of the neural network and directly affect its accuracy, speed, size, and energy consumption. By modeling the accuracy of DNNs we achieve up to 4x speed-up in design space traversal compared to exhaustive search. We are able to produce tuned ResNets, which are up to 4.22x faster than original depth-scaled ResNets on a batch of 128 images while matching their accuracy.

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

In this paper, we propose a novel technique called Idle-Time-Aware Power Management (ITAP) to effectively reduce the static energy consumption of GPU execution units. ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length using prediction and look-ahead techniques in a synergistic way and then, applies the most proper static power reduction mode based on the estimated idle period length. Our experimental results show that ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with a negligible performance overhead.

Reusing the Optimized Code for JavaScript Ahead-of-Time Compilation

Modern JavaScript engines achieve a remarkable performance by employing tiered execution architecture based on interpreter, baseline JITC, and optimizing JITC. Unfortunately, they still suffer from a substantial compilation overhead. We propose a novel AOTC that reuses the optimized machine code for high-performance JavaScript engines. We need to resolve a few challenging issues related to reusing profile-based optimized code and relocating dynamic addresses. Our AOTC improves the performance of a commercial JavaScript engine by 6.36 times (max) and 1.99 times (average) for Octane benchmarks, by reducing the compilation overhead and by running the optimized code from the first invocation.

Predicting New Workload or CPU performance by Analyzing Public Datasets

Repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have applied it to the datasets of Intel CPU specifications and SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also quantify the self-similarity of these suites for the first time in the literature.

List of Distinguished Reviewers ACM TACO 2018

An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation

On modern many-core CPUs, performance tuning against complex memory subsystems and scalability for parallelism is inevitable for achieving their potential. In this paper, we focus on loop tiling which plays an important role in the performance tuning and develop a novel autotuning framework that analytically models load balance and empirically autotunes unpredictable cache behaviors through iterative polyhedral compilation using LLVM/Polly. From the evaluation on many-core CPUs, we demonstrate that our autotuner can achieve superior performance than the conventional static tile size selection approaches and well-known autotuning heuristics, and the almost same performance as a brute-force search based approach.

A System-level Simulator for RRAM-based Neuromorphic Computing Chips

Advances in non-volatile resistive switching random access memory (RRAM) have made it a promising memory technology with applications in low power and embedded in-memory computing devices owing to advantages such as low energy consumption and area cost, and good scaling. However, it is challenging to employ RRAM devices in neuromorphic chips due to non-ideal behavior of RRAM. In this paper, we propose a cycle-accurate, scalable system-level simulator that can be used to study the effects of RRAM devices in neuromorphic chips. The simulator models a spatial neuromorphic chip architecture containing many neural cores with RRAM crossbars connected via a Network-on-Chip.

Automatic Kernel Code Generation for a Analogue SIMD Focal-Plane Sensor-Processor Array

Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to more traditional imaging devices, these devices enable massive pixel-parallel execution of image processing algorithms. This enables the application of certain image processing algorithms at extreme frame rates. By performing some early-stage processing in-situ, FPSPs have the potential to consume less power compared to conventional approaches using standard digital cameras. In this paper we explore code generation for an FPSP whose processors operate on analogue signal data, leading to further opportunities for power reduction.

Blaze-Tasks: A Framework for Computing Parallel Reductions over Tasks

This paper describes a simple, yet effective prototype , Blaze-Tasks, for task scheduling and task reductions on shared memory architectures. The framework has been designed with lock-free techniques and generic programming principles in mind. To load-balance the computation, Blaze-Tasks implement a task stealing mechanism. To manage contention on a task-pool, the number of lock-free attempts to steal a task depends on the distance between thief and pool owner. The paper evaluates the framework on an Intel dual-socket and an IBM dual-socket system using three different problems. The evaluation demonstrates that Blaze-Tasks is competitive with high-quality OpenMP implementations and Intel TBB.

Metric Selection for GPU Kernel Classification

Graphics Processing Units (GPUs) are vastly used for running massively parallel programs. GPU kernels exhibit different behavior at runtime. In particular, co-scheduling of compute-bound and memory-bound kernels seems promising. Classifying kernels can be performed by instrumentation based on performance counters. The current work conducts a thorough analysis of the metrics collected from various benchmarks to find the minimum number of effective metrics that enables low-overhead classification of kernels. A wrapper-based feature selection method based on the Fisher criterion is used. The results of experiments show that only three out of more than 90 metrics are sufficient to classify kernels.

POWAR: Power-Aware Routing in HPC Networks with On/Off Links

In order to save energy in HPC interconnection networks, one common proposal is to switch the idle links into a low power mode after a certain time without any transmission, as the IEEE Energy Efficient Ethernet standard proposes. Extending the low power mode mechanism, we propose Power-Aware Routing (PAR), a simple power aware routing and selection function for fat-tree and torus networks. PAR adapts the amount of network links that can be used taking into account the network load, obtaining great energy saving in the network (55%-65%) and the entire system (9%-10%) with negligible performance overhead.

Decoupled Fused Cache: Fusing a Decoupled LLC with a DRAM Cache

DRAM caches (DC) have shown excellent performance potential, however they are still far from their ideal performance. This paper introduces Decoupled Fused Cache (DFC), a DRAM cache design that alleviates the cost of tag accesses fusing DC tags with the tags of the on-chip LLC. DFC relies in most cases on the LLC tag access to retrieve the required information for accessing the DC avoiding additional overheads. Compared to existing DCs, DFC improves performance by 6% on average and by 16-18% in large cacheline sizes. Finally, DFC reduces DRAM cache traffic by 18% and DRAM cache energy consumption by 7%.

Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis

This paper presents a framework which uses a detailed RDA algorithm to generate reuse distance profiles and other performance parameters in GPU cache hierarchy. The effects of reservation fails in cache blocks and miss status holding registers are included in the RDA model. The framework is 264KX slower than GPU executions (10X faster than GPGPU-Sim) and obtains the results within the average of 4.93% and 4.87% on L1 and L2 miss rates, respectively. To alleviate RDA computations complexity, we applied a statistical sampling method which achieves 1.8X speedup on average and predicts the L1/L2 miss rates within ~5/9%, respectively.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name