enter search term and/or author name
Cooperative Multi-Agent Reinforcement Learning-Based Co-optimization of Cores, Caches, and On-chip Network
Rahul Jain, Preeti Ranjan Panda, Sreenivas Subramoney
Article No.: 32
Modern multi-core systems provide huge computational capabilities, which can be used to run multiple processes concurrently. To achieve the best possible performance within limited power budgets, the various system resources need to be allocated...
Bringing Parallel Patterns Out of the Corner: The P3 ARSEC Benchmark Suite
Daniele De Sensi, Tiziano De Matteis, Massimo Torquati, Gabriele Mencagli, Marco Danelutto
Article No.: 33
High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution....
A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have a...
Energy-Efficient Compilation of Irregular Task-Parallel Loops
Rahul Shrivastava, V. Krishna Nandivada
Article No.: 35
Energy-efficient compilation is an important problem for multi-core systems. In this context, irregular programs with task-parallel loops present interesting challenges: the threads with lesser work-loads (non-critical-threads)...
Compiler-Assisted Loop Hardening Against Fault Attacks
Julien Proy, Karine Heydemann, Alexandre Berzati, Albert Cohen
Article No.: 36
Secure elements widely used in smartphones, digital consumer electronics, and payment systems are subject to fault attacks. To thwart such attacks, software protections are manually inserted requiring experts and time. The explosion of the...
A Transactional Correctness Tool for Abstract Data Types
Christina Peterson, Damian Dechev
Article No.: 37
Transactional memory simplifies multiprocessor programming by providing the guarantee that a sequential block of code in the form of a transaction will exhibit atomicity and isolation. Transactional data structures offer the same guarantee to...
Power Consumption Models for Multi-Tenant Server Infrastructures
Matteo Ferroni, Andrea Corna, Andrea Damiani, Rolando Brondolin, Juan A. Colmenares, Steven Hofmeyr, John D. Kubiatowicz, Marco D. Santambrogio
Article No.: 38
Multi-tenant virtualized infrastructures allow cloud providers to minimize costs through workload consolidation. One of the largest costs is power consumption, which is challenging to understand in heterogeneous environments. We propose a power...
CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance
Milad Mohammadi, Tor M. Aamodt, William J. Dally
Article No.: 39
We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose processor designed to achieve close to In-Order (InO) processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance-proportional architecture....
ECS: Error-Correcting Strings for Lifetime Improvements in Nonvolatile Memories
Shivam Swami, Poovaiah M. Palangappa, Kartik Mohanram
Article No.: 40
Emerging nonvolatile memories (NVMs) suffer from low write endurance, resulting in early cell failures (hard errors), which reduce memory lifetime. It was recognized early on that conventional error-correcting codes (ECCs), which are designed for...
SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures
M. Waqar Azhar, Per Stenström, Vassilis Papaefstathiou
Article No.: 41
Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase...
Compression techniques at the last-level cache and the DRAM play an important role in improving system performance by increasing their effective capacities. A compressed block in DRAM also reduces the transfer time over the memory bus to the...
Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions
Richard Neill, Andi Drebes, Antoniu Pop
Article No.: 43
Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We...
Could Compression Be of General Use? Evaluating Memory Compression across Domains
Somayeh Sardashti, David A. Wood
Article No.: 44
Recent proposals present compression as a cost-effective technique to increase cache and memory capacity and bandwidth. While these proposals show potentials of compression, there are several open questions to adopt these proposals in real systems...
Improving the Efficiency of GPGPU Work-Queue Through Data Awareness
Libo Huang, Yashuai Lü, Li Shen, Zhiying Wang
Article No.: 45
The architecture and programming model of current GPGPUs are best suited for applications that are dominated by structured control and data flows across large regular datasets. Parallel workloads with irregular control and data structures cannot...
A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs
Alexandra Angerd, Erik Sintorn, Per Stenström
Article No.: 46
Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs...
Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach
Jaime Arteaga, Stéphane Zuckerman, Guang R. Gao
Article No.: 47
The recent evolution in hardware landscape, aimed at producing high-performance computing systems capable of reaching extreme-scale performance, has reignited the interest in fine-grain multithreading, particularly at the intranode level. Indeed,...
CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory
Ramyad Hadidi, Lifeng Nai, Hyojong Kim, Hyesoon Kim
Article No.: 48
Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the...
Triple Engine Processor (TEP): A Heterogeneous Near-Memory Processor for Diverse Kernel Operations
Hongyeol Lim, Giho Park
Article No.: 49
The advent of 3D memory stacking technology, which integrates a logic layer and stacked memories, is expected to be one of the most promising memory technologies to mitigate the memory wall problem by leveraging the concept of near-memory...
ReDirect: Reconfigurable Directories for Multicore Architectures
George Patsilaras, James Tuck
Article No.: 50
As we enter the dark silicon era, architects should not envision designs in which every transistor remains turned on permanently but rather ones in which portions of the chip are judiciously turned on/off depending on the characteristics of a...
HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems
Adarsh Patil, Ramaswamy Govindarajan
Article No.: 51
Integrated Heterogeneous System (IHS) processors pack throughput-oriented General-Purpose Graphics Pprocessing Units (GPGPUs) alongside latency-oriented Central Processing Units (CPUs) on the same die sharing certain resources, e.g., shared...