ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

With the fast development of deep learning (DL), the communication is increasingly a bottleneck for distributed workloads, and a series of optimization works have been done to scale out successfully. Nevertheless, the network behavior has not been investigated much yet. We intend to analyze the network behavior and then carry out some research... (more)

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained... (more)

Efficient Data Supply for Parallel Heterogeneous Architectures

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although... (more)

Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to... (more)

Supporting Superpages and Lightweight Page Migration in Hybrid Memory Systems

Superpages have long been used to mitigate address translation overhead in large-memory systems. However, superpages often preclude lightweight page... (more)

SAQIP: A Scalable Architecture for Quantum Information Processors

Proposing an architecture that efficiently compensates for the inefficiencies of physical hardware with extra resources is one of the key issues in quantum computer design. Although the demonstration of quantum systems has been limited to some dozen qubits, scaling the current small-sized lab quantum systems to large-scale quantum systems that are... (more)

Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads

Inexpensive DRAMs have created new opportunities for in-memory data analytics. However, the major bottleneck in such systems is high memory access... (more)

Transparent Acceleration for Heterogeneous Platforms With Compilation to OpenCL

Multi-accelerator platforms combine CPUs and different accelerator architectures within a single compute node. Such systems are capable of processing... (more)

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing... (more)

A Self-aware Resource Management Framework for Heterogeneous Multicore SoCs with Diverse QoS Targets

In modern heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering... (more)

Combining Source-adaptive and Oblivious Routing with Congestion Control in High-performance Interconnects using Hybrid and Direct Topologies

Hybrid and direct topologies are cost-efficient and scalable options to interconnect thousands of end nodes in high-performance computing (HPC)... (more)

Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Simplifying Transactional Memory Support in C++

The C++ Transactional Memory Technical Specification(TMTS) has not seen widespread adoption, in large part due to its complexity. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form. We show that the elimination of support for self-abort, coupled with the use of an ?executor? interface to the TM system, can handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Contemporary GPUs support multiprogramming by allowing multiple kernels to run concurrently on the same streaming multiprocessors. Recent studies have demonstrated that such concurrent kernel execution improves both resource utilization and computational throughput. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. In this paper, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach significantly improves the performance even compared with the exhaustively searched CTA combination.

Memory-access-aware safety and profitability analysis for transformation of accelerator OpenMP loops

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally-defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements.This work also demonstrates how architecture-aware compilers improve code portability and reducing programmer effort.

Polyhedral Compilation for Multi-dimensional Stream Processing

A first step toward using Quantum Computing for Low-level WCETs estimations

In this paper, we want to show some potential advantages of using a formalism inspired by Quantum Computing (QC) to evaluate CMRDs with preemptions and avoid the NP-hard problem underneath. The experimental results, with a classic (non quantum) numerical approach, on a selection of Malardalen benchmark programs display very good accuracy, while the complexity of the evaluation is a low order polynomial of the number of memory accesses. Whilst it is not yet a full quantum algorithm, we provide a first roadmap on how to reach such an objective in future wo

The Power-Optimised Software Envelope

In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.

Towards On-Chip Network Security Using Runtime Isolation Mapping

Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel attack in the on-chip network. Mapping of applications can effectively determine the interference among applications in on-chip network. In this work, we explore non-interference approaches through run-time mapping at software and application level. Through run-time mapping, we can maximize utilization of the system without leaking information. The proposed run-time mapping policy requires no router modification in contrast to the best known competing schemes, and the throughput degradation is, on average, 16\% lower than that of the state-of-the-art non-secure baselines.

MH Cache: A Multi-retention STT-RAM-based Low-power Last-level Cache for Mobile Hardware Rendering Systems

We herein propose MH cache, a multi-retention STT-RAM based cache management scheme for last-level caches(LLC) to reduce their power consumption for mobile hardware rendering system. We analyzed the memory access patterns of processes and observed that how rendering methods affect process behaviors. We propose a cache management scheme that measures write-intensity of each process dynamically and exploits it to manage our proposed cache. Our experimental results show that our techniques signfiicantly reduce the LLC power consumption by 33% and 32.2% in single- and quad-core systems, respectively, compared to a full STT-RAM

Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources

We introduce Caliper, a technique for accurately estimating performance interference occurred in shared servers, overcoming the limitations of prior approaches by leveraging a micro-experiment based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co- location. This allows Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a wide spectrum of workload scenarios, demonstrating that it seamlessly supports up to 16 applications running simultaneously and outperforms state-of-the-art approaches.

Morphable DRAM Cache Design for Hybrid Memory Systems

DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. This paper first investigates how prior approaches perform with diverse hybrid memory configurations, and observes that no single DRAM cache organization always outperforms the other organizations across all the diverse scenarios. This paper proposes a reconfigurable DRAM cache design which can adapt to different HW configurations and application patterns. Using a sample-based mechanism, the proposed DRAM cache controller dynamically finds the best organization from three candidates and applies the best one by the reconfiguration.

Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware

We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect of real-time requirements. It scales to thousands of data-flow nodes, and has been validated on two production avionics applications. Unlike classical optimizing compilation, it takes as input non-functional requirements (real-time, resource constraints). To ensure respect of requirements, the compiler follows a static resource allocation strategy, from coarse grain tasks communicating over an interconnection network, all the way to individual variables and memory accesses. It keeps track of timing interferences resulting from mapping decisions in a precise, safe, and scalable way.

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

In this paper, we propose Elastic-Cache to support both fine- and coarse-grained cache-line management to improve the L1 cache efficiency of GPUs. Specifically, it stores 32-byte words in non-contiguous memory space to a single 128-byte cache-line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage. To improve bandwidth utilization of L1 cache, we further propose Elastic-Plus to issue 32-byte requests in parallel, which can reduce the processing latency of instructions and improve the throughput of GPUs. Our experiment shows that Elastic-Cache and Elastic-Plus improve the performance by 104% and 131%, respectively.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name