ACM DL

ACM Transactions on

Architecture and Code Optimization (TACO)

Menu
Latest Articles

SketchDLC: A Sketch on Distributed Deep Learning Communication via Trace Capturing

With the fast development of deep learning (DL), the communication is increasingly a bottleneck for distributed workloads, and a series of optimization works have been done to scale out successfully. Nevertheless, the network behavior has not been investigated much yet. We intend to analyze the network behavior and then carry out some research... (more)

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained... (more)

Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to... (more)

Supporting Superpages and Lightweight Page Migration in Hybrid Memory Systems

Superpages have long been used to mitigate address translation overhead in large-memory systems. However, superpages often preclude lightweight page... (more)

SAQIP: A Scalable Architecture for Quantum Information Processors

Proposing an architecture that efficiently compensates for the inefficiencies of physical hardware with extra resources is one of the key issues in quantum computer design. Although the demonstration of quantum systems has been limited to some dozen qubits, scaling the current small-sized lab quantum systems to large-scale quantum systems that are... (more)

Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads

Inexpensive DRAMs have created new opportunities for in-memory data analytics. However, the major bottleneck in such systems is high memory access... (more)

Transparent Acceleration for Heterogeneous Platforms With Compilation to OpenCL

Multi-accelerator platforms combine CPUs and different accelerator architectures within a single compute node. Such systems are capable of processing... (more)

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing... (more)

A Self-aware Resource Management Framework for Heterogeneous Multicore SoCs with Diverse QoS Targets

In modern heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering... (more)

Combining Source-adaptive and Oblivious Routing with Congestion Control in High-performance Interconnects using Hybrid and Direct Topologies

Hybrid and direct topologies are cost-efficient and scalable options to interconnect thousands of end nodes in high-performance computing (HPC)... (more)

NEWS

TACO Goes Gold Open Access

As of July 2018, and for a four-year period, all papers published in ACM Transactions on Architecture and Code Optimization (TACO) will be published as Gold Open Access (OA) and will be free to read and share via the ACM Digital Library. READ MORE

About TACO

The ACM Transactions on Architecture and Code Optimization focuses on hardware, software, and systems research spanning the fields of computer architecture and code optimization. Articles that appear in TACO present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to computer architects, hardware or software developers, system designers and tool builders are emphasized. READ MORE

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Contemporary GPUs support multiprogramming by allowing multiple kernels to run concurrently on the same streaming multiprocessors. Recent studies have demonstrated that such concurrent kernel execution improves both resource utilization and computational throughput. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. In this paper, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach significantly improves the performance even compared with the exhaustively searched CTA combination.

Comprehensive Characterization of an Open Source Document Search Engine

This work performs a thorough characterization and analysis of the Nutch web search bechnmark which is based on the popular open source Lucene search library. The paper describes in detail the architecture, the functionality and micro-architectural behaviour of the search engine, and, investigates prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for web search and also examine the potential causes of performance degradation and variability as well as causes of tail latencies.

Efficient Data Supply for Parallel Heterogeneous Architectures

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This paper explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some micro-architectural improvements for data supply units to efficiently handle long-latency indirect loads.

Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory

This paper is an extension of a paper published in PACT-2017 In this work, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code.Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively,our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery.

The Power-Optimised Software Envelope

In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.

Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources

We introduce Caliper, a technique for accurately estimating performance interference occurred in shared servers, overcoming the limitations of prior approaches by leveraging a micro-experiment based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co- location. This allows Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a wide spectrum of workload scenarios, demonstrating that it seamlessly supports up to 16 applications running simultaneously and outperforms state-of-the-art approaches.

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

In this paper, we propose Elastic-Cache to support both fine- and coarse-grained cache-line management to improve the L1 cache efficiency of GPUs. Specifically, it stores 32-byte words in non-contiguous memory space to a single 128-byte cache-line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage. To improve bandwidth utilization of L1 cache, we further propose Elastic-Plus to issue 32-byte requests in parallel, which can reduce the processing latency of instructions and improve the throughput of GPUs. Our experiment shows that Elastic-Cache and Elastic-Plus improve the performance by 104% and 131%, respectively.

All ACM Journals | See Full Journal Index

Search TACO
enter search term and/or author name