Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 12 Issue 4, January 2016

Reuse Distance-Based Probabilistic Cache Replacement
Subhasis Das, Tor M. Aamodt, William J. Dally
Article No.: 33
DOI: 10.1145/2818374

This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The...

MINIME-GPU: Multicore Benchmark Synthesizer for GPUs
Etem Deniz, Alper Sen
Article No.: 34
DOI: 10.1145/2818693

We introduce MINIME-GPU, a novel automated benchmark synthesis framework for graphics processing units (GPUs) that serves to speed up architectural simulation of modern GPU architectures. Our framework captures important characteristics of...

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology
Li Tan, Zizhong Chen, Shuaiwen Leon Song
Article No.: 35
DOI: 10.1145/2822893

Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent...

Tumbler: An Effective Load-Balancing Technique for Multi-CPU Multicore Systems
Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan
Article No.: 36
DOI: 10.1145/2827698

Schedulers used by modern OSs (e.g., Oracle Solaris 11™ and GNU/Linux) balance load by balancing the number of threads in run queues of different cores. While this approach is effective for a single CPU multicore system, we show that it can...

Four Metrics to Evaluate Heterogeneous Multicores
Erik Tomusk, Christophe Dubach, Michael O’boyle
Article No.: 37
DOI: 10.1145/2829950

Semiconductor device scaling has made single-ISA heterogeneous processors a reality. Heterogeneous processors contain a number of different CPU cores that all implement the same Instruction Set Architecture (ISA). This enables greater flexibility...

SPCM: The Striped Phase Change Memory
Morteza Hoseinzadeh, Mohammad Arjomand, Hamid Sarbazi-Azad
Article No.: 38
DOI: 10.1145/2829951

Phase Change Memory (PCM) devices are one of the known promising technologies to take the place of DRAM devices with the aim of overcoming the obstacles of reducing feature size and stopping ever growing amounts of leakage power. In exchange for...

Two-Level Hybrid Sampled Simulation of Multithreaded Applications
Chuntao Jiang, Zhibin Yu, Lieven Eeckhout, Hai Jin, Xiaofei Liao, Chengzhong Xu
Article No.: 39
DOI: 10.1145/2818353

Sampled microarchitectural simulation of single-threaded applications is mature technology for over a decade now. Sampling multithreaded applications, on the other hand, is much more complicated. Not until very recently have researchers proposed...

Integrated Mapping and Synthesis Techniques for Network-on-Chip Topologies with Express Channels
Sandeep D'souza, Soumya J, Santanu Chattopadhyay
Article No.: 40
DOI: 10.1145/2831233

The addition of express channels to a traditional mesh network-on-chip (NoC) has emerged as a viable solution to solve the problem of high latency. In this article, we address the problem of integrated mapping and synthesis for express...

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite
Dimitrios Chasapis, Marc Casas, Miquel Moretó, Raul Vidal, Eduard Ayguadé, Jesús Labarta, Mateo Valero
Article No.: 41
DOI: 10.1145/2829952

In this work, we show how parallel applications can be implemented efficiently using task parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed,...

A Framework for Application-Guided Task Management on Heterogeneous Embedded Systems
Francisco Gaspar, Luis Taniça, Pedro Tomás, Aleksandar Ilic, Leonel Sousa
Article No.: 42
DOI: 10.1145/2835177

In this article, we propose a general framework for fine-grain application-aware task management in heterogeneous embedded platforms, which allows integration of different mechanisms for an efficient resource utilization, frequency scaling, and...

Managing Mismatches in Voltage Stacking with CoreUnfolding
Ehsan K. Ardestani, Rafael Trapani Possignolo, Jose Luis Briz, Jose Renau
Article No.: 43
DOI: 10.1145/2835178

Five percent to 25% of power could be wasted before it is delivered to the computational resources on a die, due to inefficiencies of voltage regulators and resistive loss. The power delivery could benefit if, at the same power, the...

FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems
Prashant J. Nair, David A. Roberts, Moinuddin K. Qureshi
Article No.: 44
DOI: 10.1145/2831234

As memory systems scale, maintaining their Reliability Availability and Serviceability (RAS) is becoming more complex. To make matters worse, recent studies of DRAM failures in data centers and supercomputer environments have highlighted that...

Adaptive Correction of Sampling Bias in Dynamic Call Graphs
Byeongcheol Lee
Article No.: 45
DOI: 10.1145/2840806

This article introduces a practical low-overhead adaptive technique of correcting sampling bias in profiling dynamic call graphs. Timer-based sampling keeps the overhead low but sampling bias lowers the accuracy when either observable call events...

Fence Placement for Legacy Data-Race-Free Programs via Synchronization Read Detection
Andrew J. Mcpherson, Vijay Nagarajan, Susmit Sarkar, Marcelo Cintra
Article No.: 46
DOI: 10.1145/2835179

Shared-memory programmers traditionally assumed Sequential Consistency (SC), but modern systems have relaxed memory consistency. Here, the trend in languages is toward Data-Race-Free (DRF) models, where, assuming annotated synchronizations and the...

Optimizing Control Transfer and Memory Virtualization in Full System Emulators
Ding-Yong Hong, Chun-Chen Hsu, Cheng-Yi Chou, Wei-Chung Hsu, Pangfeng Liu, Jan-Jan Wu
Article No.: 47
DOI: 10.1145/2837027

Full system emulators provide virtual platforms for several important applications, such as kernel and system software development, co-verification with cycle accurate CPU simulators, or application development for hardware still in development....

The Polyhedral Model of Nonlinear Loops
Aravind Sukumaran-Rajam, Philippe Clauss
Article No.: 48
DOI: 10.1145/2838734

Runtime code optimization and speculative execution are becoming increasingly prominent to leverage performance in the current multi- and many-core era. However, a wider and more efficient use of such techniques is mainly hampered by the...

Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures
Prashant J. Nair, David A. Roberts, Moinuddin K. Qureshi
Article No.: 49
DOI: 10.1145/2840807

Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory...

Automatic Vectorization of Interleaved Data Revisited
Andrew Anderson, Avinash Malik, David Gregg
Article No.: 50
DOI: 10.1145/2838735

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data...

A Filtering Mechanism to Reduce Network Bandwidth Utilization of Transaction Execution
Lihang Zhao, Lizhong Chen, Woojin Choi, Jeffrey Draper
Article No.: 51
DOI: 10.1145/2837028

Hardware Transactional Memory (HTM) relies heavily on the on-chip network for intertransaction communication. However, the network bandwidth utilization of transactions has been largely neglected in HTM designs. In this work, we propose a cost...

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study
Olivier Serres, Abdullah Kayi, Ahmad Anbar, Tarek El-Ghazawi
Article No.: 52
DOI: 10.1145/2842686

Due to its rich memory model, the partitioned global address space (PGAS) parallel programming model strikes a balance between locality-awareness and the ease of use of the global address space model. Although locality-awareness can lead to high...