Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 14 Issue 3, September 2017

Extending Halide to Improve Software Development for Imaging DSPs
Sander Vocke, Henk Corporaal, Roel Jordans, Rosilde Corvino, Rick Nas
Article No.: 21
DOI: 10.1145/3106343

Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern devices, play an important role in power-efficient, high-performance image processing. Applications including camera sensor post-processing and computer...

Improving Loop Dependence Analysis
Nicklas Bo Jensen, Sven Karlsson
Article No.: 22
DOI: 10.1145/3095754

Programmers can no longer depend on new processors to have significantly improved single-thread performance. Instead, gains have to come from other sources such as the compiler and its optimization passes. Advanced passes make use of information...

Iterative Schedule Optimization for Parallelization in the Polyhedron Model
Stefan Ganser, Armin Grösslinger, Norbert Siegmund, Sven Apel, Christian Lengauer
Article No.: 23
DOI: 10.1145/3109482

The polyhedron model is a powerful model to identify and apply systematically loop transformations that improve data locality (e.g., via tiling) and enable parallelization. In the polyhedron model, a loop transformation is, essentially,...

HAP: Hybrid-Memory-Aware Partition in Shared Last-Level Cache
Wei Wei, Dejun Jiang, Jin Xiong, Mingyu Chen
Article No.: 24
DOI: 10.1145/3106340

Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of Non-Volatile Memory (NVM). However,...

Providing Predictable Performance via a Slowdown Estimation Model
Dongliang Xiong, Kai Huang, Xiaowen Jiang, Xiaolang Yan
Article No.: 25
DOI: 10.1145/3124451

Interapplication interference at shared main memory slows down different applications differently. A few slowdown estimation models have been proposed to provide predictable performance by quantifying memory interference, but they have relatively...

Programming Heterogeneous Systems from an Image Processing DSL
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, Mark Horowitz
Article No.: 26
DOI: 10.1145/3107953

Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating,...

Efficient Generation of Compact Execution Traces for Multicore Architectural Simulations
Ayman Hroub, M. E. S. Elrabaa, M. F. Mudawar, A. Khayyat
Article No.: 27
DOI: 10.1145/3106342

Requiring no functional simulation, trace-driven simulation has the potential of achieving faster simulation speeds than execution-driven simulation of multicore architectures. An efficient, on-the-fly, high-fidelity trace generation method for...

MATOG: Array Layout Auto-Tuning for CUDA
Nicolas Weber, Michael Goesele
Article No.: 28
DOI: 10.1145/3106341

Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power....

MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning
Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano, Sameer Kulkarni, John Cavazos
Article No.: 29
DOI: 10.1145/3124452

Recent compilers offer a vast number of multilayered optimizations targeting different code segments of an application. Choosing among these optimizations can significantly impact the performance of the code being optimized. The selection of the...

An Architecture for Integrated Near-Data Processors
Erik Vermij, Leandro Fiorin, Rik Jongerius, Christoph Hagleitner, Jan Van Lunteren, Koen Bertels
Article No.: 30
DOI: 10.1145/3127069

To increase the performance of data-intensive applications, we present an extension to a CPU architecture that enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to...

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores
Andreas Diavastos, Pedro Trancoso
Article No.: 31
DOI: 10.1145/3127068

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other...