A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have large private cache while still benefit from shared cache. The shared cache stores the victims, i.e. data evicted from private caches. The performance depends on how victims of co-run programs interact in shared cache.
SWITCHES is a task-based Data-flow runtime that implements a lightweight distributed triggering system for runtime dependency resolution, and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased as to favor data-locality, even when having dependencies across different loops. SWITCHES introduces explicit resource allocation mechanisms for tasks to improve system utilization and adopts the latest OpenMP API, as to maintain high-levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Results on an Intel Xeon-Phi show an average of 32% performance increase compared to OpenMP.
Specialized Digital Signal Processors (DSPs) play an important role in power-efficient, high-performance image processing. However, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) which enables low-effort development of portable, high-performance imaging pipelines. We propose a set of extensions to Halide to support DSPs in combination with arbitrary C compilers, including a template solution to support scratchpad memories. Using a commercial DSP, we demonstrate that this solution can achieve performance within 20% of highly tuned C code, while leading to a reduction in development time and code complexity.
Modern multi-core systems require efficient multiple resource allocation for optimal system EDP. Choosing between multiple optimizations at runtime is complex due to the non-additive effects. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the LLC, along with DVFS of the core and uncore. We show that the co-optimization results in much lower system EDP than any of the techniques applied individually. The results show an average of 20.77% and 23.63% system EDP improvement on a 4-core and 16-core system respectively with limited degradation of Throughput and Fairness.
Optimal code performance is one of the most important objective in compute intensive applications. In many of these applications, GPUs are used because of their high amount of compute power. However, caused by their massively parallel architecture the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In this paper we give an unified description of the MATOG auto-tuner, its previously published components and a series of conceptual and implementation improvements. MATOG optimizes the array access in CUDA applications, independent of their application domain.
This paper proposes MiCOMP: Mitigating the Compiler Phase-ordering problem using optimization sub- sequences and machine learning, an autotuning framework to effectively mitigate the compiler phase- ordering problem based on machine-learning techniques. The idea is to cluster the optimization passes of the LLVMs O3 setting into different clusters to predict the speedup of the complete-sequence of all the optimization clusters instead of having to face more than 60 different individual optimization passes. The predictive model uses (i) a platform-independent dynamic features, (ii) an encoded version of the compiler sequence and (iii) an exploration heuristic to tackle the problem.
To increase the performance of data-intensive applications, we present an extension to a CPU architecture which enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU system-bus, and a component at the memory side, which together support the hardware-managed integration of near-data processing. We present an implementation of the components, as well as a system-simulator, providing detailed performance estimations. Synthetic benchmarks and the Graph500 benchmark show high inter-NDP communication bandwidths, small overhead for the proposed coherence mechanisms, and the ability to outperform a real CPU by a factor two.
In this paper we demonstrate that pattern-based parallel programming approach is flexible enough to parallelize 12 out of 13 PARSEC applications. Our analysis, conducted on three different multi-core architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e. openmp and ompss) and native implementations (i.e. pthreads). Regarding the programming effort, we also demonstrate a considerable improvement compared to pthreads and comparable results on other existing implementations.
Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of NVM. However, current LLC policies are unaware of hybrid memory. Cache misses to NVM introduce high cost due to long NVM latency. Moreover, evicting dirty NVM data suffers from long write latency. We propose hybrid memory aware cache partitioning to dynamically adjust cache spaces and give NVM dirty data more chances to reside in LLC. Experimental results show HAP improves performance by 46.7% and reduces energy consumption by 21.9% on average against LRU management.
Unlike execution-based simulations, trace-driven simulation can enable fast simulation of multi-core architectures. An efficient, on-the-fly high-fidelity trace generation method for multi-threaded applications is reported. The generated trace, not exceeding double the size of the original executable code, is encoded in architecture-agnostic, instruction-like binary format that can be interpreted by a timing simulator. A complete tool suite that has been developed and used for evaluation of the proposed method showed that it produces smaller traces over existing trace compression methods while retaining maximum fidelity including all threading and synchronization related events.