Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 13 Issue 4, December 2016

Synergistic Analysis of Evolving Graphs
Keval Vora, Rajiv Gupta, Guoqing Xu
Article No.: 32
DOI: 10.1145/2992784

Evolving graph processing involves repeating analyses, which are often iterative, over multiple snapshots of the graph corresponding to different points in time. Since the snapshots of an evolving graph share a great number of vertices and edges,...

A Cross-Platform SpMV Framework on Many-Core Architectures
Yunquan Zhang, Shigang Li, Shengen Yan, Huiyang Zhou
Article No.: 33
DOI: 10.1145/2994148

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth...

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
Junwhan Ahn, Sungjoo Yoo, Kiyoung Choi
Article No.: 34
DOI: 10.1145/2994149

In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main...

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs
Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L. Abellán, Rafael Ubal, John Kim, Ajay Joshi, David Kaeli
Article No.: 35
DOI: 10.1145/2996190

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a...

Hardware-Accelerated Cross-Architecture Full-System Virtualization
Tom Spink, Harry Wagstaff, Björn Franke
Article No.: 36
DOI: 10.1145/2996798

Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported...

LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies
Qingchuan Shi, George Kurian, Farrukh Hijaz, Srinivas Devadas, Omer Khan
Article No.: 37
DOI: 10.1145/2983632

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore,...

Evaluation of Histogram of Oriented Gradients Soft Errors Criticality for Automotive Applications
Fernando Fernandes, Lucas Weigel, Claudio Jung, Philippe Navaux, Luigi Carro, Paolo Rech
Article No.: 38
DOI: 10.1145/2998573

Pedestrian detection reliability is a key problem for autonomous or aided driving, and methods that use Histogram of Oriented Gradients (HOG) are very popular. Embedded Graphics Processing Units (GPUs) are exploited to run HOG in a very efficient...

Cooperative Caching for GPUs
Saumay Dublish, Vijay Nagarajan, Nigel Topham
Article No.: 39
DOI: 10.1145/3001589

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to...

Accelerating Intercommunication in Highly Parallel Systems
Nikolaos Tampouratzis, Pavlos M. Mattheakis, Ioannis Papaefstathiou
Article No.: 40
DOI: 10.1145/3005717

Every HPC system consists of numerous processing nodes interconnect using a number of different inter-process communication protocols such as Messaging Passing Interface (MPI) and Global Arrays (GA). Traditionally, research has focused on...

Concurrent JavaScript Parsing for Faster Loading of Web Apps
Hyukwoo Park, Myungsu Cha, Soo-Mook Moon
Article No.: 41
DOI: 10.1145/3004281

JavaScript is a dynamic language mainly used as a client-side web script. Nowadays, web is evolving into an application platform with its web apps, and JavaScript increasingly undertakes complex computations and interactive user...

Memory Access Scheduling Based on Dynamic Multilevel Priority in Shared DRAM Systems
Dongliang Xiong, Kai Huang, Xiaowen Jiang, Xiaolang Yan
Article No.: 42
DOI: 10.1145/3007647

Interapplication interference at shared main memory severely degrades performance and increasing DRAM frequency calls for simple memory schedulers. Previous memory schedulers employ a per-application ranking scheme for high system performance or a...

A Reconfiguration Algorithm for Power-Aware Parallel Applications
Daniele De Sensi, Massimo Torquati, Marco Danelutto
Article No.: 43
DOI: 10.1145/3004054

In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet...

Impact of Intrinsic Profiling Limitations on Effectiveness of Adaptive Optimizations
Michael R. Jantz, Forrest J. Robinson, Prasad A. Kulkarni
Article No.: 44
DOI: 10.1145/3008661

Many performance optimizations rely on or are enhanced by runtime profile information. However, both offline and online profiling techniques suffer from intrinsic and practical limitations that affect the quality of delivered profile data. The...

Extending the WCET Problem to Optimize for Runtime-Reconfigurable Processors
Marvin Damschen, Lars Bauer, Jörg Henkel
Article No.: 45
DOI: 10.1145/3014059

The correctness of a real-time system does not depend on the correctness of its calculations alone but also on the non-functional requirement of adhering to deadlines. Guaranteeing these deadlines by static timing analysis, however, is practically...

MaxPB: Accelerating PCM Write by Maximizing the Power Budget Utilization
Zheng Li, Fang Wang, Dan Feng, Yu Hua, Jingning Liu, Wei Tong
Article No.: 46
DOI: 10.1145/3012007

Phase Change Memory (PCM) is one of the promising memory technologies but suffers from some critical problems such as poor write performance and high write energy consumption. Due to the high write energy consumption and limited power supply, the...

Designing a Tunable Nested Data-Parallel Programming System
Saurav Muralidharan, Michael Garland, Albert Sidelnik, Mary Hall
Article No.: 47
DOI: 10.1145/3012011

This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed...

Accuracy Bugs: A New Class of Concurrency Bugs to Exploit Algorithmic Noise Tolerance
Ismail Akturk, Riad Akram, Mohammad Majharul Islam, Abdullah Muzahid, Ulya R. Karpuzcu
Article No.: 48
DOI: 10.1145/3017991

Parallel programming introduces notoriously difficult bugs, usually referred to as concurrency bugs. This article investigates the potential for deviating from the conventional wisdom of writing concurrency bug--free, parallel programs. It...

Selecting Heterogeneous Cores for Diversity
Erik Tomusk, Christophe Dubach, Michael O'boyle
Article No.: 49
DOI: 10.1145/3014165

Mobile devices with heterogeneous processors are becoming mainstream. With a heterogeneous processor, the runtime scheduler can pick the best CPU core for a given task based on program characteristics, performance requirements, and power...

Some Mathematical Facts About Optimal Cache Replacement
Pierre Michaud
Article No.: 50
DOI: 10.1145/3017992

This article exposes and proves some mathematical facts about optimal cache replacement that were previously unknown or not proved rigorously. An explicit formula is obtained, giving OPT hits and misses as a function of past references. Several...

Static and Dynamic Frequency Scaling on Multicore CPUs
Wenlei Bao, Changwan Hong, Sudheer Chunduri, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan
Article No.: 51
DOI: 10.1145/3011017

Dynamic Voltage and Frequency Scaling (DVFS) typically adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical DVFS approaches include using default strategies such as running at the...

Pot: Deterministic Transactional Execution
Tiago M. Vale, João A. Silva, Ricardo J. Dias, João M. Lourenço
Article No.: 52
DOI: 10.1145/3017993

This article presents Pot, a system that leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use Transactional Memory. Preordered transactions eliminate the root cause of...

Aggregate Flow-Based Performance Fairness in CMPs
Zhonghai Lu, Yuan Yao
Article No.: 53
DOI: 10.1145/3014429

In CMPs, multiple co-executing applications create mutual interference when sharing the underlying network-on-chip architecture. Such interference causes different performance slowdowns to different applications. To mitigate the unfairness...

Energy-Proportional Photonic Interconnects
Yigit Demir, Nikos Hardavellas
Article No.: 54
DOI: 10.1145/3018110

Photonic interconnects have emerged as the prime candidate technology for efficient networks on chip at future process nodes. However, the high optical loss of many nanophotonic components coupled with the low efficiency of current laser sources...

User-Assisted Store Recycling for Dynamic Task Graph Schedulers
Mehmet Can Kurt, Sriram Krishnamoorthy, Gagan Agrawal, Bin Ren
Article No.: 55
DOI: 10.1145/3018111

The emergence of the multi-core era has led to increased interest in designing effective yet practical parallel programming models. Models based on task graphs that operate on single-assignment data are attractive in several ways. Notably,...

Fine-Grain Power Breakdown of Modern Out-of-Order Cores and Its Implications on Skylake-Based Systems
Jawad Haj-Yihia, Ahmad Yasin, Yosi Ben Asher, Avi Mendelson
Article No.: 56
DOI: 10.1145/3018112

A detailed analysis of power consumption at low system levels becomes important as a means for reducing the overall power consumption of a system and its thermal hot spots. This work presents a new power estimation method that allows understanding...

A Software Cache Partitioning System for Hash-Based Caches
Alberto Scolari, Davide Basilio Bartolini, Marco Domenico Santambrogio
Article No.: 57
DOI: 10.1145/3018113

Contention on the shared Last-Level Cache (LLC) can have a fundamental negative impact on the performance of applications executed on modern multicores. An interesting software approach to address LLC contention issues is based on page...