enter search term and/or author name
Introduction to the special issue on high-performance and embedded architectures and compilers
Per Stenström, Koen De Bosschere
Article No.: 18
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
Jorge Albericio, Rubén Gran, Pablo Ibáñez, Víctor Viñals, Jose María Llabería
Article No.: 19
Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main...
An architecture-independent instruction shuffler to protect against side-channel attacks
Ali Galip Bayrak, Nikola Velickovic, Paolo Ienne, Wayne Burleson
Article No.: 20
Embedded cryptographic systems, such as smart cards, require secure implementations that are robust to a variety of low-level attacks. Side-Channel Attacks (SCA) exploit the information such as power consumption, electromagnetic radiation...
Approximate graph clustering for program characterization
John Demme, Simha Sethumadhavan
Article No.: 21
An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data...
Bahurupi: A polymorphic heterogeneous multi-core architecture
Mihai Pricopi, Tulika Mitra
Article No.: 22
Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex...
Compiler mitigations for time attacks on modern x86 processors
Jeroen V. Cleemput, Bart Coppens, Bjorn De Sutter
Article No.: 23
This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side channels...
Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions
Jason Mccandless, David Gregg
Article No.: 24
Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can...
DAPSCO: Distance-aware partially shared cache organization
Antonio García-Guirado, Ricardo Fernández-Pascual, Alberto Ros, José M. García
Article No.: 25
Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks...
With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect. Sophisticated techniques are needed more than ever to improve an application's spatial and temporal...
Efficient liveness computation using merge sets and DJ-graphs
Dibyendu Das, B. Dupont De Dinechin, Ramakrishna Upadrasta
Article No.: 27
In this work we devise an efficient algorithm that computes the liveness information of program variables. The algorithm employs SSA form and DJ-graphs as representation to build Merge sets. The Merge set of node n,...
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
George Patsilaras, Niket K. Choudhary, James Tuck
Article No.: 28
Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing....
Exploring the limits of GPGPU scheduling in control flow bound applications
Roman Malits, Evgeny Bolotin, Avinoam Kolodny, Avi Mendelson
Article No.: 29
GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication,...
FlexSig: Implementing flexible hardware signatures
Lois Orosa, Elisardo Antelo, Javier D. Bruguera
Article No.: 30
With the advent of chip multiprocessors, new techniques have been developed to make parallel programing easier and more reliable. New parallel programing paradigms and new methods of making the execution of programs more efficient and more...
Hardware transactional memory with software-defined conflicts
Ruben Titos-Gil, Manuel E. Acacio, Jose M. Garcia, Tim Harris, Adrian Cristal, Osman Unsal, Ibrahim Hur, Mateo Valero
Article No.: 31
In this paper we investigate the benefits of turning the concept of transactional conflict from its traditionally fixed definition into a variable one that can be dynamically controlled in software. We propose the extension of the atomic...
Improving performance of nested loops on reconfigurable array processors
Yongjoo Kim, Jongeun Lee, Toan X. Mai, Yunheung Paek
Article No.: 32
Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained...
Soft and highly-customized processors are emerging as a common way to efficiently control large amount of computing resources available on FPGAs. Yet, some processor architectures of choice for DSP and media applications, such as wide-issue VLIW...
On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments
Petar Radojković, Sylvain Girbal, Arnaud Grasset, Eduardo Quiñones, Sami Yehia, Francisco J. Cazorla
Article No.: 34
Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However,...
Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks
Leonid Domnitser, Aamer Jaleel, Jason Loew, Nael Abu-Ghazaleh, Dmitry Ponomarev
Article No.: 35
We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents...
On the simulation of large-scale architectures using multiple application abstraction levels
Alejandro Rico, Felipe Cabarcas, Carlos Villavieja, Milan Pavlovic, Augusto Vega, Yoav Etsion, Alex Ramirez, Mateo Valero
Article No.: 36
Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating...
Optimizing explicit data transfers for data parallel applications on the cell architecture
Selma Saidi, Pranav Tendulkar, Thierry Lepley, Oded Maler
Article No.: 37
In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring...
PLDS: Partitioning linked data structures for parallelism
Min Feng, Changhui Lin, Rajiv Gupta
Article No.: 38
Recently, parallelization of computations in the presence of dynamic data structures has shown promising potential. In this paper, we present PLDS, a system for easily expressing and efficiently exploiting parallelism in computations that are...
Many automatic software parallelization systems have been proposed in the past decades, but most of them are dedicated to source-to-source transformations. This paper shows that parallelizing executable programs is feasible, even if they require...
Virtualization is gaining popularity in cloud computing and has become the key enabling technology in cloud infrastructure. By replicating the virtual server state to multiple independent platforms, virtualization improves the reliability and...
Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic
Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, Gerard J. M. Smit
Article No.: 41
In spite of the fact that floating-point arithmetic is costly in terms of silicon area, the joint design of hardware for floating-point and integer arithmetic is seldom considered. While components like multipliers and adders can potentially be...
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems
Mario Kicherer, Fabian Nowak, Rainer Buchty, Wolfgang Karl
Article No.: 42
Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and...
SYRANT: SYmmetric resource allocation on not-taken and taken paths
Nathanael Premillieu, Andre Seznec
Article No.: 43
In the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in parallel applications. Unfortunately, despite tremendous research effort on branch prediction,...
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources...
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
Javier Lira, Timothy M. Jones, Carlos Molina, Antonio González
Article No.: 45
The exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs...
Thread Tranquilizer: Dynamically reducing performance variation
Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan
Article No.: 46
To realize the performance potential of multicore systems, we must effectively manage the interactions between memory reference behavior and the operating system policies for thread scheduling and migration decisions. We observe that these...
TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks
Dongsong Zhang, Deke Guo, Fangyuan Chen, Fei Wu, Tong Wu, Ting Cao, Shiyao Jin
Article No.: 47
As the energy consumption of multi-core systems becomes increasingly prominent, it's a challenge to design an energy-efficient real-time scheduling algorithm in multi-core systems for reducing the system energy consumption while guaranteeing the...
The accelerator store: A shared memory framework for accelerator-based systems
Michael J. Lyons, Mark Hempstead, Gu-Yeon Wei, David Brooks
Article No.: 48
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems...
Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of...
Using machine learning to improve automatic vectorization
Kevin Stock, Louis-Noël Pouchet, P. Sadayappan
Article No.: 50
Automatic vectorization is critical to enhancing performance of compute-intensive programs on modern processors. However, there is much room for improvement over the auto-vectorization capabilities of current production compilers through careful...
Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system
Kanit Therdsteerasukdi, Gyungsu Byun, Jason Cong, M. Frank Chang, Glenn Reinman
Article No.: 51
Smartphones and tablets are becoming more and more powerful, replacing desktops and laptops as the users' main computing system. As these systems support higher and higher resolutions with more complex 3D graphics, a high-throughput and low-power...
VSim: Simulating multi-server setups at near native hardware speed
Frederick Ryckbosch, Stijn Polfliet, Lieven Eeckhout
Article No.: 52
Simulating contemporary computer systems is a challenging endeavor, especially when it comes to simulating high-end setups involving multiple servers. The simulation environment needs to run complete software stacks, including operating systems,...
Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems
Miao Zhou, Yu Du, Bruce Childers, Rami Melhem, Daniel Mossé
Article No.: 53
Phase-Change Memory (PCM) has emerged as a promising low-power main memory candidate to replace DRAM. The main problems of PCM are that writes are much slower and more power hungry than reads, write bandwidth is much lower than read bandwidth, and...
A transactional memory with automatic performance tuning
Qingping Wang, Sameer Kulkarni, John Cavazos, Michael Spear
Article No.: 54
A significant obstacle to the acceptance of transactional memory (TM) in real-world parallel programs is the abundance of substantially different TM algorithms. Each TM algorithm appears well-suited to certain workload characteristics, but the...
sFtree: A fully connected and deadlock-free switch-to-switch routing algorithm for fat-trees
Bartosz Bogdanski, Sven-Arne Reinemo, Frank Olaf Sem-Jacobsen, Ernst Gunnar Gran
Article No.: 55
Existing fat-tree routing algorithms fully exploit the path diversity of a fat-tree topology in the context of compute node traffic, but they lack support for deadlock-free and fully connected switch-to-switch communication. Such support is...