enter search term and/or author name
Exploring single and multilevel JIT compilation policy for modern machines 1
Michael R. Jantz, Prasad A. Kulkarni
Article No.: 22
Dynamic or Just-in-Time (JIT) compilation is essential to achieve high-performance emulation for programs written in managed languages, such as Java and C#. It has been observed that a conservative JIT compilation policy is most...
A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies
Xiangyu Dong, Norman P. Jouppi, Yuan Xie
Article No.: 23
Many new memory technologies are available for building future energy-efficient memory hierarchies. It is necessary to have a framework that can quickly find the optimal memory technology at each hierarchy level. In this work, we first build a...
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface
Jishen Zhao, Guangyu Sun, Gabriel H. Loh, Yuan Xie
Article No.: 24
The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU...
An efficient multicharacter transition string-matching engine based on the aho-corasick algorithm
Chien-Chi Chen, Sheng-De Wang
Article No.: 25
A string-matching engine capable of inspecting multiple characters in parallel can multiply the throughput. However, the space required for implementing a matching engine that can process multiple characters in parallel generally grows...
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution
Yangchun Luo, Wei-Chung Hsu, Antonia Zhai
Article No.: 26
With the emergence of multicore processors, various aggressive execution models have been proposed to exploit fine-grained thread-level parallelism, taking advantage of the fast on-chip interconnection communication. However, the aggressive nature...
Virtually split cache: An efficient mechanism to distribute instructions and data
Dyer Rolán, Basilio B. Fraguela, Ramón Doallo
Article No.: 27
First-level caches are usually split for both instructions and data instead of unifying them in a single cache. Although that approach eases the pipeline design and provides a simple way to independently treat data and instructions, its global hit...
Using in-flight chains to build a scalable cache coherence protocol
Samantika Subramaniam, Simon C. Steely, Will Hasenplaugh, Aamer Jaleel, Carl Beckmann, Tryggve Fossum, Joel Emer
Article No.: 28
As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel...
The traditional performance cost benefits we have enjoyed for decades from technology scaling are challenged by several critical constraints including reliability. Increases in static and dynamic variations are leading to higher probability of...
Automatic parallelization of fine-grained metafunctions on a chip multiprocessor
Sanghoon Lee, James Tuck
Article No.: 30
Due to the importance of reliability and security, prior studies have proposed inlining metafunctions into applications for detecting bugs and security vulnerabilities. However, because these software techniques add frequent, fine-grained...
Dynamic microarchitectural adaptation using machine learning
Christophe Dubach, Timothy M. Jones, Edwin V. Bonilla
Article No.: 31
Adaptive microarchitectures are a promising solution for designing high-performance, power-efficient microprocessors. They offer the ability to tailor computational resources to the specific requirements of different programs or program phases....
E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories
Long Chen, Yanan Cao, Zhao Zhang
Article No.: 32
This study presents and evaluates E3CC (Enhanced Embedded ECC), a full design and implementation of a generic embedded ECC scheme that enables power-efficient error protection for subranked memory systems. It incorporates a novel...
Temporal-based multilevel correlating inclusive cache replacement
Yingying Tian, Samira M. Khan, Daniel A. Jiménez
Article No.: 33
Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but...
Hardware support for accurate per-task energy metering in multicore systems
Qixiao Liu, Miquel Moreto, Victor Jimenez, Jaume Abella, Francisco J. Cazorla, Mateo Valero
Article No.: 34
Accurately determining the energy consumed by each task in a system will become of prominent importance in future multicore-based systems because it offers several benefits, including (i) better application energy/performance optimizations, (ii)...
Loop tiling is a widely used loop transformation to enhance data locality and allow data reuse. In the tiled code, however, tiles of different sizes can lead to significant variation in performance. Thus, selection of an optimal tile size is...
Fast pattern-specific routing for fat tree networks
Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg, Torsten Hoefler
Article No.: 36
In the context of eXtended Generalized Fat Tree (XGFT) topologies, widely used in HPC and datacenter network designs, we propose a generic method, based on Integer Linear Programming (ILP), to efficiently determine optimal routes for arbitrary...
Selecting representative benchmark inputs for exploring microprocessor design spaces
Maximilien B. Breughe, Lieven Eeckhout
Article No.: 37
The design process of a microprocessor requires representative workloads to steer the search process toward an optimum design point for the target application domain. However, considering a broad set of workloads to cover the large space of...
Web applications are vulnerable to cross-site scripting attacks that enable data thefts. Information flow tracking in web browsers can prevent communication of sensitive data to unintended recipients and thereby stop such data thefts....
Time- and space-efficient flow-sensitive points-to analysis
Article No.: 39
Compilation of real-world programs often requires hours. The term nightly build known to industrial researchers is an artifact of long compilation times. Our goal is to reduce the absolute analysis times for large C codes (of the order of...
Boosting timestamp-based transactional memory by exploiting hardware cycle counters
Wenjia Ruan, Yujie Liu, Michael Spear
Article No.: 40
Time-based transactional memories typically rely on a shared memory counter to ensure consistency. Unfortunately, such a counter can become a bottleneck. In this article, we identify properties of hardware cycle counters that allow their use in...
ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity
Tanima Dey, Wei Wang, Jack W. Davidson, Mary Lou Soffa
Article No.: 41
To utilize the full potential of modern chip multiprocessors and obtain scalable performance improvements, it is critical to mitigate resource contention created by multithreaded workloads. In this article, we describe ReSense, the first runtime...
Techniques to improve performance in requester-wins hardware transactional memory
Adrià Armejach, Ruben Titos-Gil, Anurag Negi, Osman S. Unsal, Adrián Cristal
Article No.: 42
The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations...
Reducing DRAM row activations with eager read/write clustering
Myeongjae Jeon, Conglong Li, Alan L. Cox, Scott Rixner
Article No.: 43
This article describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, many dirty cache lines are written to DRAM before they are...
HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing
Zhijia Zhao, Michael Bebenita, Dave Herman, Jianhua Sun, Xipeng Shen
Article No.: 44
Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms...
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
Ehsan Totoni, Mert Dikmen, María Jesús Garzarán
Article No.: 45
We optimize a visual object detection application (that uses Vision Video Library kernels) and show that OpenCL is a unified programming paradigm that can provide high performance when running on the Ivy Bridge heterogeneous on-chip architecture....
Decreasing the traffic from the CPU LLC to main memory is a very important issue in modern systems. Recent work focuses on cache misses, overlooking the impact of writebacks on the total memory traffic, energy consumption, IPC, and so forth....
Accelerating an application domain with specialized functional units
Cecilia González-Álvarez, Jennifer B. Sartor, Carlos Álvarez, Daniel Jiménez-González, Lieven Eeckhout
Article No.: 47
Hardware specialization has received renewed interest recently as chips are hitting power limits. Chip designers of traditional processor architectures have primarily focused on general-purpose computing, partially due to time-to-market pressure...
Revisiting memory management on virtualized environments
Xiaolin Wang, Lingmei Weng, Zhenlin Wang, Yingwei Luo
Article No.: 48
With the evolvement of hardware, 64-bit Central Processing Units (CPUs) and 64-bit Operating Systems (OSs) have dominated the market. This article investigates the performance of virtual memory management of Virtual Machines (VMs) with a large...
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling
Chuntao Jiang, Zhibin Yu, Hai Jin, Chengzhong Xu, Lieven Eeckhout, Wim Heirman, Trevor E. Carlson, Xiaofei Liao
Article No.: 49
Computer architects rely heavily on microarchitecture simulation to evaluate design alternatives. Unfortunately, cycle-accurate simulation is extremely slow, being at least 4 to 6 orders of magnitude slower than real hardware. This longstanding...
Profile-guided transaction coalescing—lowering transactional overheads by merging transactions
Srđan Stipić, Vesna Smiljković, Osman Unsal, Adrián Cristal, Mateo Valero
Article No.: 50
Previous studies in software transactional memory mostly focused on reducing the overhead of transactional read and write operations. In this article, we introduce transaction coalescing, a profile-guided compiler optimization technique...
WADE: Writeback-aware dynamic cache management for NVM-based main memory system
Zhe Wang, Shuchang Shan, Ting Cao, Junli Gu, Yi Xu, Shuai Mu, Yuan Xie, Daniel A. Jiménez
Article No.: 51
Emerging Non-Volatile Memory (NVM) technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor design. One of the major disadvantages for NVM is the latency and energy overhead...
Spin-Transfer Torque RAM (STT-RAM), a promising alternative to SRAM for reducing leakage power consumption, has been widely studied to mitigate the impact of its asymmetrically long write latency. Recently, STT-RAM has been proposed for L1 caches...
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, P. Sadayappan
Article No.: 53
Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper, while the...
Designing a practical data filter cache to improve both energy efficiency and performance
Alen Bardizbanyan, Magnus Själander, David Whalley, Per Larsson-Edefors
Article No.: 54
Conventional Data Filter Cache (DFC) designs improve processor energy efficiency, but degrade performance. Furthermore, the single-cycle line transfer suggested in prior studies adversely affects Level-1 Data Cache (L1 DC) area and energy...
GPU code generation for ODE-based applications with phased shared-data access patterns
Andrei Hagiescu, Bing Liu, R. Ramanathan, Sucheendra K. Palaniappan, Zheng Cui, Bipasa Chattopadhyay, P. S. Thiagarajan, Weng-Fai Wong
Article No.: 55
We present a novel code generation scheme for GPUs. Its key feature is the platform-aware generation of a heterogeneous pool of threads. This exposes more data-sharing opportunities among the concurrent threads and reduces the memory requirements...
TornadoNoC: A lightweight and scalable on-chip network architecture for the many-core era
Junghee Lee, Chrysostomos Nicopoulos, Hyung Gyu LEE, Jongman Kim
Article No.: 56
The rapid emergence of Chip Multi-Processors (CMP) as the de facto microprocessor archetype has highlighted the importance of scalable and efficient on-chip networks. Packet-based Networks-on-Chip (NoC) are gradually cementing themselves as the...
A system architecture, processor, and communication protocol for secure implants
Christos Strydis, Robert M. Seepers, Pedro Peris-Lopez, Dimitrios Siskos, Ioannis Sourdis
Article No.: 57
Secure and energy-efficient communication between Implantable Medical Devices (IMDs) and authorized external users is attracting increasing attention these days. However, there currently exists no systematic approach to the problem, while...
Fast modulo scheduler utilizing patternized routes for coarse-grained reconfigurable architectures
Wonsub Kim, Yoonseo Choi, Haewoo Park
Article No.: 58
Coarse-Grained Reconfigurable Architectures (CGRAs) present a potential of high compute throughput with energy efficiency. A CGRA consists of an array of Functional Units (FUs), which communicate with each other through an interconnect network...
JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages
Dorit Nuzman, Revital Eres, Sergei Dyshel, Marcel Zalmanovici, Jose Castanos
Article No.: 59
The growing gap between the advanced capabilities of static compilers as reflected in benchmarking results and the actual performance that users experience in real-life scenarios makes client-side dynamic optimization technologies imperative to...
Automatic data allocation and buffer management for multi-GPU machines
Thejas Ramashekar, Uday Bondhugula
Article No.: 60
Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs...
Analysis of dependence tracking algorithms for task dataflow execution
Hans Vandierendonck, George Tzenakis, Dimitrios S. Nikolopoulos
Article No.: 61
Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core...
Evaluator-executor transformation for efficient pipelining of loops with conditionals
Yeonghun Jeong, Seongseok Seo, Jongeun Lee
Article No.: 62
Control divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches...
A decoupled non-SSA global register allocation using bipartite liveness graphs
Rajkishore Barik, Jisheng Zhao, Vivek Sarkar
Article No.: 63
Register allocation is an essential optimization for all compilers. A number of sophisticated register allocation algorithms have been developed over the years. The two fundamental classes of register allocation algorithms used in modern compilers...
Reducing instruction fetch energy in multi-issue processors
Peter Gavin, David Whalley, Magnus Själander
Article No.: 64
The need to minimize power while maximizing performance has led to recent developments of powerful superscalar designs targeted at embedded and portable use. Instruction fetch is responsible for a significant fraction of microprocessor power and...
List of distinguished reviewers ACM TACO
Article No.: 65