Compilers for Coarse-Grained Reconfigurable Array (CGRA) architectures suffer from long compilation times and code quality levels far below the theoretical upper bounds. This paper presents a new scheduler, called Bimodal Modulo Scheduler (BMS) to map inner loops onto (heterogeneous) CGRAs of the ADRES family. BMS significantly outperforms existing schedulers for similar architectures in terms of generated code quality and compilation time. This is achieved by combining new schemes for backtracking with extended and adapted forms of priority functions and cost functions, as described in the paper. BMS is evaluated by mapping multimedia and software-defined radio benchmarks onto tuned ADRES instances.
This paper presents a three-financial-application benchmark, suitable for GPGPU execution. Common benchmark-design practice has been to provide, morally, the same code for the sequential/parallel versions, which is optimized for only one class of datasets. In comparison we document: (i) all available parallelism via nested map-reduce operators in a functional implementation that closely resembles the original-code structure, (ii) the invariants and transformations that govern the trade-offs of a rich data-sensitive-optimization space, and (iii) report target multi-core/multi-versioned-GPGPU code together with an evaluation that demonstrates these trade-offs. We believe this work provides useful insight into the language/compiler infrastructure capable of expressing/optimizing such applications.
The Network-on-Chip (NoC) is becoming increasingly susceptible to emerging reliability threats. In this work, we propose NoCAlert, a comprehensive on-line and real-time fault detection and localization mechanism that demonstrates 0% false negatives within the interconnect, for the fault models and stimulus set used in this study. Based on the concept of invariance checking, NoCAlert employs a group of lightweight micro-checker modules that collectively implement real-time hardware assertions. NoCAlert can pinpoint the location of the fault at various granularity levels. Extensive cycle-accurate simulations in a 64-node CMP and analysis at the RTL netlist-level demonstrate the efficacy of the proposed technique.
Dynamic Process Migration based on Block Access Patterns occurred in Storage Servers
We present an automated framework that identifies custom instructions that run on a domain-specific functional unit. A preliminary set of custom instruction hardware implementations is transformed to a new code abstraction that improves similarity identification across applications, based on exact and partial matching. New custom instructions cover either whole body loops or fragments of the code. The framework selects merged custom instructions to efficiently exploit the available area for specialization. For media applications, custom instructions improve the EDP and speedup. Those with the highest utilization opportunities achieve an average EDP improvement of 3.8x compared to the baseline processor.
This paper presents an innovative scheduling technique, called micro-scheduling, specifically adapted to the mapping of parametric dataflow programs on dedicated MPSoC. This technique is implemented in a new compilation flow that is able to compile parametric dataflow graphs. The compiler offers an actor-based C++ programming model to describe parametric graphs, a compilation front end for graph analysis, and a retargetable back end. The experimental results show compilation of 3GPP LTE-Advanced demodulation on a heterogeneous MPSoC with distributed scheduling features and memory size constraints. The compiled programs achieve performances similar to hand-written-and-optimized code.
The variety of today's architectures forced programmers to spend effort for porting and tuning application codes across different platforms as the compilers' standard optimization levels quite often fail to bring the best results for obtaining the average case or being customized to specific architectures. This paper proposes COBAYN: COmpiler autotuning framework using BAYesian Networks, a novel approach for compiler autotuning based on machine learning to speedup application performance and to reduce the cost of the compiler optimization phases. The proposed framework is based on the application characterization done dynamically by using micro-architecture independent features and Bayesian Networks.
Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers control locality at one level through establishing affinity between data and executing activities. This does not enable locality exploitation at other levels. Techniques applied directly by programmers, beyond the first level, burden programmers and hinder productivity. We propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and runtime system. This paper presents concepts and techniques that drive such runtime system. Experiments show that our techniques scale and achieve up to 88% performance gains.