In today's computers, heterogeneous processing is used to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on communication increases. System and software optimizations for communication often come at the costs of increased complexity and reduced portability. The Decoupled Supply-Compute (DeSC) approach offers a way to attack communication latency bottlenecks automatically, while maintaining good portability and low complexity. Our work expands prior Decoupled Access Execute techniques with hardware/software specialization. Across the evaluated workloads, DeSC offers an average of 2.04x speedup on homogeneous CMPs and 1.56x speedup on accelerator-based heterogeneous systems.
We present a method of iterative schedule optimization in the polyhedron model that targets tiling and parallelization. We can sample the search space of legal schedules at random or perform a more directed search via a genetic algorithm. For the latter, we propose a set of novel reproduction operators. We evaluate our approach against existing iterative and model-driven optimization strategies. Our approach outperforms existing optimization techniques in that it finds significantly faster schedules. We compare our genetic algorithm against random exploration. If well configured, random exploration turns out to be very profitable and reduces the lead of the genetic algorithm.
Specialized Digital Signal Processors (DSPs) play an important role in power-efficient, high-performance image processing. However, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) which enables low-effort development of portable, high-performance imaging pipelines. We propose a set of extensions to Halide to support DSPs in combination with arbitrary C compilers, including a template solution to support scratchpad memories. Using a commercial DSP, we demonstrate that this solution can achieve performance within 20% of highly tuned C code, while leading to a reduction in development time and code complexity.
Optimal code performance is one of the most important objective in compute intensive applications. In many of these applications, GPUs are used because of their high amount of compute power. However, caused by their massively parallel architecture the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In this paper we give an unified description of the MATOG auto-tuner, its previously published components and a series of conceptual and implementation improvements. MATOG optimizes the array access in CUDA applications, independent of their application domain.
This paper addresses the topic of programming CPU/FPGA heterogeneous systems for high performance and energy efficient image processing applications, including creating and programming FPGA accelerators and integrating them into systems. We extend the Halide DSL so user can specify which portions of their applications should become hardware accelerators, and then we provide a compiler which creates the accelerator along with the glue code needed for the users application to access this hardware. Our system provides high-level semantics to explore different mappings of applications to a heterogeneous system, with the added flexibility of being able to map at different throughput rates.
Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of NVM. However, current LLC policies are unaware of hybrid memory. Cache misses to NVM introduce high cost due to long NVM latency. Moreover, evicting dirty NVM data suffers from long write latency. We propose hybrid memory aware cache partitioning to dynamically adjust cache spaces and give NVM dirty data more chances to reside in LLC. Experimental results show HAP improves performance by 46.7% and reduces energy consumption by 21.9% on average against LRU management.
Unlike execution-based simulations, trace-driven simulation can enable fast simulation of multi-core architectures. An efficient, on-the-fly high-fidelity trace generation method for multi-threaded applications is reported. The generated trace, not exceeding double the size of the original executable code, is encoded in architecture-agnostic, instruction-like binary format that can be interpreted by a timing simulator. A complete tool suite that has been developed and used for evaluation of the proposed method showed that it produces smaller traces over existing trace compression methods while retaining maximum fidelity including all threading and synchronization related events.