SWITCHES is a task-based Data-flow runtime that implements a lightweight distributed triggering system for runtime dependency resolution, and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased as to favor data-locality, even when having dependencies across different loops. SWITCHES introduces explicit resource allocation mechanisms for tasks to improve system utilization and adopts the latest OpenMP API, as to maintain high-levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Results on an Intel Xeon-Phi show an average of 32% performance increase compared to OpenMP.
We present a method of iterative schedule optimization in the polyhedron model that targets tiling and parallelization. We can sample the search space of legal schedules at random or perform a more directed search via a genetic algorithm. For the latter, we propose a set of novel reproduction operators. We evaluate our approach against existing iterative and model-driven optimization strategies. Our approach outperforms existing optimization techniques in that it finds significantly faster schedules. We compare our genetic algorithm against random exploration. If well configured, random exploration turns out to be very profitable and reduces the lead of the genetic algorithm.
Specialized Digital Signal Processors (DSPs) play an important role in power-efficient, high-performance image processing. However, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) which enables low-effort development of portable, high-performance imaging pipelines. We propose a set of extensions to Halide to support DSPs in combination with arbitrary C compilers, including a template solution to support scratchpad memories. Using a commercial DSP, we demonstrate that this solution can achieve performance within 20% of highly tuned C code, while leading to a reduction in development time and code complexity.
Modern multi-core systems require efficient multiple resource allocation for optimal system EDP. Choosing between multiple optimizations at runtime is complex due to the non-additive effects. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the LLC, along with DVFS of the core and uncore. We show that the co-optimization results in much lower system EDP than any of the techniques applied individually. The results show an average of 20.77% and 23.63% system EDP improvement on a 4-core and 16-core system respectively with limited degradation of Throughput and Fairness.
Optimal code performance is one of the most important objective in compute intensive applications. In many of these applications, GPUs are used because of their high amount of compute power. However, caused by their massively parallel architecture the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In this paper we give an unified description of the MATOG auto-tuner, its previously published components and a series of conceptual and implementation improvements. MATOG optimizes the array access in CUDA applications, independent of their application domain.
Programmers can no longer depend on new processors to have significantly improved performance. Instead, gains have to come from other sources such as the compiler and its optimization passes. Advanced passes make use of information on the dependences related to loops. We improve the quality of that information by reusing the information given by the programmer for parallelization. We have implemented a prototype based on GCC into which we also add a new optimization pass. Our approach improves the amount of correctly classified dependences resulting in 46% average improvement in single-thread performance for benchmarks compared to GCC 6.1.
Inter-application interference at shared main memory slows down different applications differently. Previous memory schedulers focus on alleviating interference to improve system performance and fairness or quantifying interference to provide predictable performance, but few provide both. We propose a SlowDown-aware Memory Scheduler (SDMS). First, SDMS improves estimation accuracy by considering refresh and row-buffer interference and measuring IPC directly. Second, SDMS groups applications and allocates different bandwidth to different applications based on estimated alone MPKC. The evaluation results show that SDMS can improve harmonic speedup by 8.8% over FRFCFS and has higher estimation accuracy than MISE and STFM.
This paper proposes MiCOMP: Mitigating the Compiler Phase-ordering problem using optimization sub- sequences and machine learning, an autotuning framework to effectively mitigate the compiler phase- ordering problem based on machine-learning techniques. The idea is to cluster the optimization passes of the LLVMs O3 setting into different clusters to predict the speedup of the complete-sequence of all the optimization clusters instead of having to face more than 60 different individual optimization passes. The predictive model uses (i) a platform-independent dynamic features, (ii) an encoded version of the compiler sequence and (iii) an exploration heuristic to tackle the problem.
To increase the performance of data-intensive applications, we present an extension to a CPU architecture which enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU system-bus, and a component at the memory side, which together support the hardware-managed integration of near-data processing. We present an implementation of the components, as well as a system-simulator, providing detailed performance estimations. Synthetic benchmarks and the Graph500 benchmark show high inter-NDP communication bandwidths, small overhead for the proposed coherence mechanisms, and the ability to outperform a real CPU by a factor two.
In this paper we demonstrate that pattern-based parallel programming approach is flexible enough to parallelize 12 out of 13 PARSEC applications. Our analysis, conducted on three different multi-core architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e. openmp and ompss) and native implementations (i.e. pthreads). Regarding the programming effort, we also demonstrate a considerable improvement compared to pthreads and comparable results on other existing implementations.
Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of NVM. However, current LLC policies are unaware of hybrid memory. Cache misses to NVM introduce high cost due to long NVM latency. Moreover, evicting dirty NVM data suffers from long write latency. We propose hybrid memory aware cache partitioning to dynamically adjust cache spaces and give NVM dirty data more chances to reside in LLC. Experimental results show HAP improves performance by 46.7% and reduces energy consumption by 21.9% on average against LRU management.
Unlike execution-based simulations, trace-driven simulation can enable fast simulation of multi-core architectures. An efficient, on-the-fly high-fidelity trace generation method for multi-threaded applications is reported. The generated trace, not exceeding double the size of the original executable code, is encoded in architecture-agnostic, instruction-like binary format that can be interpreted by a timing simulator. A complete tool suite that has been developed and used for evaluation of the proposed method showed that it produces smaller traces over existing trace compression methods while retaining maximum fidelity including all threading and synchronization related events.