ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests.... (more)

The Power-optimised Software Envelope

Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests... (more)

Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources

We introduce Caliper, a technique for accurately estimating performance interference occurring in shared servers. Caliper overcomes the limitations of prior approaches by leveraging a micro-experiment-based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper... (more)

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Optimizing Remote Communication in X10

In this paper, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations (i) AT-Opt that minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning that identifies/elides redundant place-change operations and does parallel execution of place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72x and 17.83x, on a four-node (32 cores/node) Intel and two-node (16 cores/node) AMD system, respectively.

Simplifying Transactional Memory Support in C++

The C++ Transactional Memory Technical Specification(TMTS) has not seen widespread adoption, in large part due to its complexity. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form. We show that the elimination of support for self-abort, coupled with the use of an ?executor? interface to the TM system, can handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art

A Relational Theory of Locality

Locality is a common concept used in many areas of program and system analysis and optimization. It has been defined and measured in many ways. Previous work has focused on individual types of locality, but this paper focuses on how they are related. It formalizes their relations by categorizing commonly used definitions in three groups: access locality, timescale locality and cache locality, and by showing whether and how we can convert between them. Often two different metrics, e.g. access frequency and miss ratio, are related but not equivalent. The formalization shows precisely how they are related and where they differ.

A Neural Network Prefetcher for Arbitrary Memory Access Patterns

Common memory prefetchers are designed to target specific memory access-patterns, including spatio-temporal locality, recurring patterns, and irregular patterns. In this paper, we propose a conceptual neural network (NN) prefetcher that dynamically adapts to arbitrary access patterns to capture semantic locality. Leveraging recent advances in machine learning, the proposed NN prefetcher correlates program context with memory accesses using online-training, and enables tapping into previously undetected access patterns. We present an architectural implementation of our prefetcher and evaluate it over SPEC2006, Graph500, and other kernels, showing it delivers up to 30% speedup over SPEC2006 and up to 4.4x speedup on some kernels.

Side-channel Timing Attack of RSA on a GPU

In this work, we build a timing model to capture the parallel characteristics of a RSA public-key cipher implemented on a GPU. We consider optimizations that include using Montgomery multiplication and sliding window exponentiation to implement cryptographic operations. Our timing model considers the challenges of parallel execution, complications. Based on our timing model, we launch successful timing attacks on RSA running on a GPU, extracting the private key of RSA. We also present an effective error detection and correction mechanism. Our results demonstrate that GPU acceleration of RSA is vulnerable to side-channel timing attack.

Memory-access-aware safety and profitability analysis for transformation of accelerator OpenMP loops

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally-defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements.This work also demonstrates how architecture-aware compilers improve code portability and reducing programmer effort.

Polyhedral Compilation for Multi-dimensional Stream Processing

A first step toward using Quantum Computing for Low-level WCETs estimations

In this paper, we want to show some potential advantages of using a formalism inspired by Quantum Computing (QC) to evaluate CMRDs with preemptions and avoid the NP-hard problem underneath. The experimental results, with a classic (non quantum) numerical approach, on a selection of Malardalen benchmark programs display very good accuracy, while the complexity of the evaluation is a low order polynomial of the number of memory accesses. Whilst it is not yet a full quantum algorithm, we provide a first roadmap on how to reach such an objective in future wo

Towards On-Chip Network Security Using Runtime Isolation Mapping

Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel attack in the on-chip network. Mapping of applications can effectively determine the interference among applications in on-chip network. In this work, we explore non-interference approaches through run-time mapping at software and application level. Through run-time mapping, we can maximize utilization of the system without leaking information. The proposed run-time mapping policy requires no router modification in contrast to the best known competing schemes, and the throughput degradation is, on average, 16\% lower than that of the state-of-the-art non-secure baselines.

MH Cache: A Multi-retention STT-RAM-based Low-power Last-level Cache for Mobile Hardware Rendering Systems

We herein propose MH cache, a multi-retention STT-RAM based cache management scheme for last-level caches(LLC) to reduce their power consumption for mobile hardware rendering system. We analyzed the memory access patterns of processes and observed that how rendering methods affect process behaviors. We propose a cache management scheme that measures write-intensity of each process dynamically and exploits it to manage our proposed cache. Our experimental results show that our techniques signfiicantly reduce the LLC power consumption by 33% and 32.2% in single- and quad-core systems, respectively, compared to a full STT-RAM

Morphable DRAM Cache Design for Hybrid Memory Systems

DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. This paper first investigates how prior approaches perform with diverse hybrid memory configurations, and observes that no single DRAM cache organization always outperforms the other organizations across all the diverse scenarios. This paper proposes a reconfigurable DRAM cache design which can adapt to different HW configurations and application patterns. Using a sample-based mechanism, the proposed DRAM cache controller dynamically finds the best organization from three candidates and applies the best one by the reconfiguration.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name