enter search term and/or author name
Evolving graph processing involves repeating analyses, which are often iterative, over multiple snapshots of the graph corresponding to different points in time. Since the snapshots of an evolving graph share a great number of vertices and edges,...
A Cross-Platform SpMV Framework on Many-Core Architectures
Yunquan Zhang, Shigang Li, Shengen Yan, Huiyang Zhou
Article No.: 33
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth...
AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
Junwhan Ahn, Sungjoo Yoo, Kiyoung Choi
Article No.: 34
In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main...
UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs
Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L. Abellán, Rafael Ubal, John Kim, Ajay Joshi, David Kaeli
Article No.: 35
In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a...
Hardware-Accelerated Cross-Architecture Full-System Virtualization
Tom Spink, Harry Wagstaff, Björn Franke
Article No.: 36
Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported...
The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore,...
Evaluation of Histogram of Oriented Gradients Soft Errors Criticality for Automotive Applications
Fernando Fernandes, Lucas Weigel, Claudio Jung, Philippe Navaux, Luigi Carro, Paolo Rech
Article No.: 38
Pedestrian detection reliability is a key problem for autonomous or aided driving, and methods that use Histogram of Oriented Gradients (HOG) are very popular. Embedded Graphics Processing Units (GPUs) are exploited to run HOG in a very efficient...
The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to...
Accelerating Intercommunication in Highly Parallel Systems
Nikolaos Tampouratzis, Pavlos M. Mattheakis, Ioannis Papaefstathiou
Article No.: 40
Every HPC system consists of numerous processing nodes interconnect using a number of different inter-process communication protocols such as Messaging Passing Interface (MPI) and Global Arrays (GA). Traditionally, research has focused on...
Hyukwoo Park, Myungsu Cha, Soo-Mook Moon
Article No.: 41
Memory Access Scheduling Based on Dynamic Multilevel Priority in Shared DRAM Systems
Dongliang Xiong, Kai Huang, Xiaowen Jiang, Xiaolang Yan
Article No.: 42
Interapplication interference at shared main memory severely degrades performance and increasing DRAM frequency calls for simple memory schedulers. Previous memory schedulers employ a per-application ranking scheme for high system performance or a...
A Reconfiguration Algorithm for Power-Aware Parallel Applications
Daniele De Sensi, Massimo Torquati, Marco Danelutto
Article No.: 43
In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet...
Impact of Intrinsic Profiling Limitations on Effectiveness of Adaptive Optimizations
Michael R. Jantz, Forrest J. Robinson, Prasad A. Kulkarni
Article No.: 44
Many performance optimizations rely on or are enhanced by runtime profile information. However, both offline and online profiling techniques suffer from intrinsic and practical limitations that affect the quality of delivered profile data. The...
Extending the WCET Problem to Optimize for Runtime-Reconfigurable Processors
Marvin Damschen, Lars Bauer, Jörg Henkel
Article No.: 45
The correctness of a real-time system does not depend on the correctness of its calculations alone but also on the non-functional requirement of adhering to deadlines. Guaranteeing these deadlines by static timing analysis, however, is practically...
Phase Change Memory (PCM) is one of the promising memory technologies but suffers from some critical problems such as poor write performance and high write energy consumption. Due to the high write energy consumption and limited power supply, the...
Designing a Tunable Nested Data-Parallel Programming System
Saurav Muralidharan, Michael Garland, Albert Sidelnik, Mary Hall
Article No.: 47
This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed...
Accuracy Bugs: A New Class of Concurrency Bugs to Exploit Algorithmic Noise Tolerance
Ismail Akturk, Riad Akram, Mohammad Majharul Islam, Abdullah Muzahid, Ulya R. Karpuzcu
Article No.: 48
Parallel programming introduces notoriously difficult bugs, usually referred to as concurrency bugs. This article investigates the potential for deviating from the conventional wisdom of writing concurrency bug--free, parallel programs. It...
Mobile devices with heterogeneous processors are becoming mainstream. With a heterogeneous processor, the runtime scheduler can pick the best CPU core for a given task based on program characteristics, performance requirements, and power...
Some Mathematical Facts About Optimal Cache Replacement
Article No.: 50
This article exposes and proves some mathematical facts about optimal cache replacement that were previously unknown or not proved rigorously. An explicit formula is obtained, giving OPT hits and misses as a function of past references. Several...
Static and Dynamic Frequency Scaling on Multicore CPUs
Wenlei Bao, Changwan Hong, Sudheer Chunduri, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan
Article No.: 51
Dynamic Voltage and Frequency Scaling (DVFS) typically adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical DVFS approaches include using default strategies such as running at the...
This article presents Pot, a system that leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use Transactional Memory. Preordered transactions eliminate the root cause of...
In CMPs, multiple co-executing applications create mutual interference when sharing the underlying network-on-chip architecture. Such interference causes different performance slowdowns to different applications. To mitigate the unfairness...
Photonic interconnects have emerged as the prime candidate technology for efficient networks on chip at future process nodes. However, the high optical loss of many nanophotonic components coupled with the low efficiency of current laser sources...
User-Assisted Store Recycling for Dynamic Task Graph Schedulers
Mehmet Can Kurt, Sriram Krishnamoorthy, Gagan Agrawal, Bin Ren
Article No.: 55
The emergence of the multi-core era has led to increased interest in designing effective yet practical parallel programming models. Models based on task graphs that operate on single-assignment data are attractive in several ways. Notably,...
Fine-Grain Power Breakdown of Modern Out-of-Order Cores and Its Implications on Skylake-Based Systems
Jawad Haj-Yihia, Ahmad Yasin, Yosi Ben Asher, Avi Mendelson
Article No.: 56
A detailed analysis of power consumption at low system levels becomes important as a means for reducing the overall power consumption of a system and its thermal hot spots. This work presents a new power estimation method that allows understanding...
A Software Cache Partitioning System for Hash-Based Caches
Alberto Scolari, Davide Basilio Bartolini, Marco Domenico Santambrogio
Article No.: 57
Contention on the shared Last-Level Cache (LLC) can have a fundamental negative impact on the performance of applications executed on modern multicores. An interesting software approach to address LLC contention issues is based on page...