enter search term and/or author name
Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs
Cedric Nugteren, Henk Corporaal
Article No.: 35
The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this programmability challenge, this article presents a technique to fully automatically generate efficient and...
Building and Optimizing MRAM-Based Commodity Memories
Jue Wang, Xiangyu Dong, Yuan Xie
Article No.: 36
Emerging non-volatile memory technologies such as MRAM are promising design solutions for energy-efficient memory architecture, especially for mobile systems. However, building commodity MRAM by reusing DRAM designs is not straightforward. The...
Revisiting the Complexity of Hardware Cache Coherence and Some Implications
Rakesh Komuravelli, Sarita V. Adve, Ching-Tsun Chou
Article No.: 37
Cache coherence is an integral part of shared-memory systems but is also widely considered to be one of the most complex parts of such systems. Much prior work has addressed this complexity and the verification techniques to prove the correctness...
Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy
Gabriel Rodríguez, Juan Touriño, Mahmut T. Kandemir
Article No.: 38
On-chip power consumption is one of the fundamental challenges of current technology scaling. Cache memories consume a sizable part of this power, particularly due to leakage energy. STT-RAM is one of several new memory technologies that have been...
Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing
Cristóbal Camarero, Enrique Vallejo, Ramón Beivide
Article No.: 39
Current High-Performance Computing (HPC) and data center networks rely on large-radix routers. Hamming graphs (Cartesian products of complete graphs) and dragonflies (two-level direct networks with nodes organized in groups) are some direct...
Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories
Hanbin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, Onur Mutlu
Article No.: 40
New phase-change memory (PCM) devices have low-access latencies (like DRAM) and high capacities (i.e., low cost per bit, like Flash). In addition to being able to scale to smaller cell sizes than DRAM, a PCM cell can also store multiple bits per...
ARM ISA-based processors are no longer low-cost, low-power processors. Nowadays, ARM ISA-based processor manufacturers are striving to implement medium-end to high-end processor cores, which implies implementing a state-of-the-art out-of-order...
Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems
Zheng Wang, Dominik Grewe, Michael F. P. O’boyle
Article No.: 42
General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to...
Improving Hybrid FTL by Fully Exploiting Internal SSD Parallelism with Virtual Blocks
Dan He, Fang Wang, Hong Jiang, Dan Feng, Jing Ning Liu, Wei Tong, Zheng Zhang
Article No.: 43
Compared with either block or page-mapping Flash Translation Layer (FTL), hybrid-mapping FTL for flash Solid State Disks (SSDs), such as Fully Associative Section Translation (FAST), has relatively high space efficiency because of its smaller...
MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction
Eri Rubin, Ely Levy, Amnon Barak, Tal Ben-Nun
Article No.: 44
GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must...
Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning
Alessandro Cilardo, Luca Gallo
Article No.: 45
Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory...
Jan Kasper Martinsen, Håkan Grahn, Anders Isberg
Article No.: 46
Recent developments in register allocation, mostly linked to static single assignment (SSA) form, have shown the benefits of decoupling the problem in two phases: a first spilling phase places load and store instructions so that the...
Modern superscalar CPUs contain large complex structures and diverse execution units, consuming wide dynamic power range. Building a power delivery network for the worst-case power consumption is not energy efficient and often is impossible to fit...
Efficient Data Encoding for Convolutional Neural Network application
Hong-Phuc Trinh, Marc Duranton, Michel Paindavoine
Article No.: 49
This article presents an approximate data encoding scheme called Significant Position Encoding (SPE). The encoding allows efficient implementation of the recall phase (forward propagation pass) of Convolutional Neural Networks (CNN)—a...
Mechanistic Analytical Modeling of Superscalar In-Order Processor Performance
Maximilien B. Breughe, Stijn Eyerman, Lieven Eeckhout
Article No.: 50
Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance...
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks
Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
Article No.: 51
Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache...
The exponential growth of sequential processors has come to an end, and thus, parallel processing is probably the only way to achieve performance growth. We propose the development of parallel architectures based on data-driven scheduling....
GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory...
The Impact of the SIMD Width on Control-Flow and Memory Divergence
Thomas Schaub, Simon Moll, Ralf Karrenberg, Sebastian Hack
Article No.: 54
Power consumption is a prevalent issue in current and future computing systems. SIMD processors amortize the power consumption of managing the instruction stream by executing the same instruction in parallel on multiple data. Therefore, in the...
Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking
Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, Binyu Zang
Article No.: 55
As multicore and many-core architectures evolve, their memory systems are becoming increasingly more complex. To bridge the latency and bandwidth gap between the processor and memory, they often use a mix of multilevel private/shared caches that...
Low-Power High-Efficiency Video Decoding using General-Purpose Processors
Chi Ching Chi, Mauricio Alvarez-Mesa, Ben Juurlink
Article No.: 56
In this article, we investigate how code optimization techniques and low-power states of general-purpose processors improve the power efficiency of HEVC decoding. The power and performance efficiency of the use of SIMD instructions, multicore...
Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly
Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, Paul H. J. Kelly
Article No.: 57
We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their...
Optimal Parallelogram Selection for Hierarchical Tiling
Xing Zhou, María J. Garzarán, David A. Padua
Article No.: 58
Loop tiling is an effective optimization to improve performance of multiply nested loops, which are the most time-consuming parts in many programs. Most massively parallel systems today are organized hierarchically, and different levels of the...
Making the Most of SMT in HPC: System- and Application-Level Perspectives
Leo Porter, Michael A. Laurenzano, Ananta Tiwari, Adam Jundt, William A. Ward, Jr., Roy Campbell, Laura Carrington
Article No.: 59
This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value...
Optimizing Memory Translation Emulation in Full System Emulators
Xin Tong, Toshihiko Koju, Motohiro Kawahito, Andreas Moshovos
Article No.: 60
The emulation speed of a full system emulator (FSE) determines its usefulness. This work quantitatively measures where time is spent in QEMU [Bellard 2005], an industrial-strength FSE. The analysis finds that memory emulation is one of the most...
Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs
Martin Kong, Antoniu Pop, Louis-Noël Pouchet, R. Govindarajan, Albert Cohen, P. Sadayappan
Article No.: 61
Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared...
Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems
Nicolas Melot, Christoph Kessler, Jörg Keller, Patrick Eitschberger
Article No.: 62
Exploiting effectively massively parallel architectures is a major challenge that stream programming can help facilitate. We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable...
Language-level transactions are said to provide “atomicity,” implying that the order of operations within a transaction should be invisible to concurrent transactions and thus that independent operations within a transaction should be...
Using Template Matching to Infer Parallel Design Patterns
Zia Ul Huda, Ali Jannesari, Felix Wolf
Article No.: 64
The triumphant spread of multicore processors over the past decade increases the pressure on software developers to exploit the growing amount of parallelism available in the hardware. However, writing parallel programs is generally challenging....
Efficient Correction of Anomalies in Snapshot Isolation Transactions
Heiner Litz, Ricardo J. Dias, David R. Cheriton
Article No.: 65
Transactional memory systems providing snapshot isolation enable concurrent access to shared data without incurring aborts on read-write conflicts. Reducing aborts is extremely relevant as it leads to higher concurrency, greater performance, and...
Perfect Reconstructability of Control Flow from Demand Dependence Graphs
Helge Bahmann, Nico Reissmann, Magnus Jahre, Jan Christian Meyer
Article No.: 66
Demand-based dependence graphs (DDGs), such as the (Regionalized) Value State Dependence Graph ((R)VSDG), are intermediate representations (IRs) well suited for a wide range of program transformations. They explicitly model the flow of data and...
On Using the Roofline Model with Lower Bounds on Data Movement
Venmugil Elango, Naser Sedaghati, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, Radu Teodorescu, P. Sadayappan
Article No.: 67
The roofline model is a popular approach for “bound and bottleneck” performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance...
List of Distinguished Reviewers ACM TACO 2014
Article No.: 68