ACM Transactions on

Architecture and Code Optimization (TACO)

Latest Articles

Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction Scheduling

Multiple combinatorial algorithms have been proposed for doing pre-allocation instruction scheduling... (more)

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability... (more)

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant... (more)

Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale

Thread synchronization using shared memory hardware cache coherence paradigm is prevalent in multicore processors. However, as the number of cores... (more)


TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-Order Execution

In this paper we present a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls on GPUs. HAWS leverages our compiler infrastructure to identify potential opportunities to bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 15.3% on average for memory intensive applications.

Comprehensive Characterization of an Open Source Document Search Engine

This work performs a thorough characterization and analysis of the Nutch web search bechnmark which is based on the popular open source Lucene search library. The paper describes in detail the architecture, the functionality and micro-architectural behaviour of the search engine, and, investigates prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for web search and also examine the potential causes of performance degradation and variability as well as causes of tail latencies.

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

We intend to provide a measurement on the deep learning communication via trace capturing. We firstly provide detailed analyses about the communication mechanism of MXNet. Secondly, we define the DLC trace format to record the communication behaviors. Thirdly, we present the implementation of method for trace capturing. Fourthly, we verify the communication mechanism by providing a glimpse of the trace files. Finally, we make some statistics and analyses about the distributed deep learning communication based on the captured trace files, including communication pattern, overlap ratio between computation and communication, synchronization overhead, update overhead etc.

Efficient Data Supply for Parallel Heterogeneous Architectures

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This paper explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some micro-architectural improvements for data supply units to efficiently handle long-latency indirect loads.

Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing pipelines remains a challenge due to the inherently complex structure of many image processing applications, the plethora of transformations that can be applied as well as the interaction of these transformations with locality, parallelism and re-computation. We propose a novel optimization strategy that aims to maximize producer-consumer locality and reuse between stages of the pipeline. We implement it as a tool to be used alongside the Halide DSL and test it on a variety of benchmarks. Experimental results on three multi-core platforms show a performance improvement of over 40% compared to previous state-of-the-art approaches.

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their dependences to be determined at run-time. Pipelite unifies communication, scheduling and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely-used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state-of-the-art in terms of speedup, scalability, and memory usage.

Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads

Inexpensive DRAMs have created new opportunities for in-memory data analytics. However, the major bottleneck in such systems is high memory access latency. Traditionally, this problem is solved with large cache hierarchies that only benefit regular applications. Alternatively, many data-intensive applications exhibit irregular behavior. Hardware multithreading can better cope with high latency seen in such applications. This paper implements a multithreaded prototype (MTP) on FPGAs, for the relational selection operator which exhibits control flow irregularity. On a standard TPC-H query, MTP achieves a normalized speedup of 1.8x over CPU and 3.2x over GPU, while consuming 2.5x and 3.4x less power respectively

Transparent Acceleration for Heterogeneous Platforms with Compilation to OpenCL

Multi-accelerator platforms combine CPUs and accelerators, enabling them to process parallel workloads very efficiently. However, their architectures are diverse, requiring developers to know about different tools/compilers, programming-languages/models and low-level details to port their applications to different accelerators. To tackle this challenge, we propose an approach and practical realization that is completely transparent to the user. Our approach is able to automatically detect hotspots in sequential applications, generate parallel OpenCL host/kernel code and offload hotspots to different OpenCL-enabled resources (CPU/GPGPU/Phi). We evaluate the performance/energy-improvements for a diverse set of benchmark applications and achieve speedups/energy-savings of up to two orders of magnitude.

A Self-Aware Resource Management Framework for Heterogeneous Multicore SoCs with Diverse QoS Targets

In heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering end-to-end QoS. Previous frameworks have either focused on singular QoS targets or the allocation of partitionable resources at relatively slow timescales. However, heterogeneous MPSoCs typically require instant response from the memory system where most resources cannot be partitioned. Moreover, the health of different cores in an MPSoC is measured by diverse performance objectives. In this work, we propose the Self-Aware Resource Allocation framework. Priority-based adaptation allows cores to use different target performance and self-monitor their own health. In response, the system allocates non-partitionable resources based on priorities.

The Power-Optimised Software Envelope

In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.

SAQIP: A Scalable Architecture for Quantum Information Processors

Proposing an architecture that efficiently compensates for the inefficiencies of hardware with extra resources is one of the key issues in quantum computer design. Scaling the current small-sized lab quantum systems to large-scale systems that are capable of solving meaningful practical problems can be the main goal of much research. In this paper, a scalable architecture for quantum information processors, called SAQIP, is proposed. Moreover, a flow is presented to map a circuit on this architecture. Experimental results show that the proposed architecture and design flow decrease the average latency of quantum circuits by about 83% for the attempted benchmarks.

Combining Source-Adaptive and Oblivious Routing with Congestion Control in High-Performance Interconnects using Hybrid and Direct Topologies

We propose here two solutions to reduce HoL-blocking that combine routing and queuing schemes: Source-Adaptive Solution for Head-of-Line Blocking Avoidance (SASHA) and Oblivious Solution for Head-of-Line Blocking Avoidance (OSHA). Both proposals leverage hybrid interconnection network topologies, proposed in the last years, using exclusive virtual networks (VNs) to support deadlock-free path diversity, and a new queuing scheme, called Dynamic Band-Based Queuing (DBBQ), to reduce efficiently congestion problems (i.e. HoL-blocking) while requiring a reduced set of resources, what makes them feasible for current network technologies, as the simulation-based results presented in the paper demostrare.

Supporting Superpages and Lightweight Page Migration in Hybrid Memory Systems

Superpages have long been used to mitigate address translation overhead. However, superpages often preclude lightweight page migration in hybrid memory systems composed of DRAM and non-volatile memory (NVM). This paper presents Rainbow to bridge this fundamental conflict between superpages and lightweight page migration. Rainbow utilizes split TLBs to support different page sizes, and uses DRAM to cache frequently-accessed small pages in each NVM superpage. By a novel NVM-to-DRAM address remapping mechanism, Rainbow supports lightweight page migration without splintering superpages. Experimental results show that Rainbow can significantly reduce applications' TLB misses and improve application performance (IPC) by up to 2.9X.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name