Architecture and Code Optimization (TACO)


Search Issue
enter search term and/or author name


ACM Transactions on Architecture and Code Optimization (TACO), Volume 11 Issue 1, February 2014

Shared-port register file architecture for low-energy VLIW processors
Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda
Article No.: 1
DOI: 10.1145/2533397

We propose a reduced-port Register File (RF) architecture for reducing RF energy in a VLIW processor. With port reduction, RF ports need to be shared among Function Units (FUs), which may lead to access conflicts, and thus, reduced performance....

Integrating profile-driven parallelism detection and machine-learning-based mapping
Zheng Wang, Georgios Tournavitis, Björn Franke, Michael F. P. O'boyle
Article No.: 2
DOI: 10.1145/2579561

Compiler-based auto-parallelization is a much-studied area but has yet to find widespread application. This is largely due to the poor identification and exploitation of application parallelism, resulting in disappointing performance far below...

Leveraging GPUs using cooperative loop speculation
Mehrzad Samadi, Amir Hormati, Janghaeng Lee, Scott Mahlke
Article No.: 3
DOI: 10.1145/2579617

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs...

Endurance-aware cache line management for non-volatile caches
Jue Wang, Xiangyu Dong, Yuan Xie, Norman P. Jouppi
Article No.: 4
DOI: 10.1145/2579671

Nonvolatile memories (NVMs) have the potential to replace low-level SRAM or eDRAM on-chip caches because NVMs save standby power and provide large cache capacity. However, limited write endurance is a common problem for NVM technologies, and...

BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems
Lei Liu, Zehan Cui, Yong Li, Yungang Bao, Mingyu Chen, Chengyong Wu
Article No.: 5
DOI: 10.1145/2579672

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference...

Trace transitioning and exception handling in a trace-based JIT compiler for java
Christian Häubl, Christian Wimmer, Hanspeter Mössenböck
Article No.: 6
DOI: 10.1145/2579673

Trace-based Just-In-Time (JIT) compilation generates machine code for frequently executed paths (so-called traces) instead of whole methods. While this has several advantages, it complicates invocation of compiled traces as well as exception...

HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap
Yongbing Huang, Licheng Chen, Zehan Cui, Yuan Ruan, Yungang Bao, Mingyu Chen, Ninghui Sun
Article No.: 7
DOI: 10.1145/2579668

DRAM access traces (i.e., off-chip memory references) can be extremely valuable for the design of memory subsystems and performance tuning of software. Hardware snooping on the off-chip memory interface is an effective and nonintrusive approach to...

Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures
Quan Chen, Minyi Guo
Article No.: 8
DOI: 10.1145/2579674

Single-ISA Asymmetric Multicore (AMC) architectures have shown high performance as well as power efficiency. However, current parallel programming environments do not perform well on AMC because they are designed for symmetric multicore...

Efficient hosted interpreters on the JVM
Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Eric Seckler, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz
Article No.: 9
DOI: 10.1145/2532642

Many guest languages are implemented using the Java Virtual Machine (JVM) as a host environment. There are two major implementation choices: custom compilers and so-called hosted interpreters. Custom compilers are complex to build but offer good...

Refresh pausing in DRAM memory systems
Prashant J. Nair, Chia-Chen Chou, Moinuddin K. Qureshi
Article No.: 10
DOI: 10.1145/2579669

Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read...

Tuning the continual flow pipeline architecture with virtual register renaming
Komal Jothi, Haitham Akkary
Article No.: 11
DOI: 10.1145/2579675

Continual Flow Pipelines (CFPs) allow a processor core to process hundreds of in-flight instructions without increasing cycle-critical pipeline resources. When a load misses the data cache, CFP checkpoints the processor register state and then...

Predicate-aware, makespan-preserving software pipelining of scheduling tables
Thomas Carle, Dumitru Potop-Butucaru
Article No.: 12
DOI: 10.1145/2579676

We propose a software pipelining technique adapted to specific hard real-time scheduling problems. Our technique optimizes both computation throughput and execution cycle makespan, with makespan being prioritary. It also takes advantage of the...

A scalable and near-optimal representation of access schemes for memory management
Angeliki Kritikakou, Francky Catthoor, Vasilios Kelefouras, Costas Goutis
Article No.: 13
DOI: 10.1145/2579677

Memory management searches for the resources required to store the concurrently alive elements. The solution quality is affected by the representation of the element accesses: a sub-optimal representation leads to overestimation and a non-scalable...

Automatic feature generation for machine learning--based optimising compilation
Hugh Leather, Edwin Bonilla, Michael O'boyle
Article No.: 14
DOI: 10.1145/2536688

Recent work has shown that machine learning can automate and in some cases outperform handcrafted compiler optimisations. Central to such an approach is that machine learning techniques typically rely upon summaries or features of the program. The...