Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift
Indirect memory accesses have irregular access patterns which limit the performance of conventional software and hardware-based prefetchers. To address this problem, we propose the Array Tracking Prefetcher(ATP) which tracks array-based indirect memory accesses using a novel combination of software and hardware. ATP yields average speedup of 2.17 as compared to a single-core without prefetching. By contrast, the speedup for conventional software and hardware-based prefetching is 1.84 and 1.32, respectively. For four-cores, the average speedup for ATP is 1.85, while the corresponding speedups for software and hardware-based prefetching are 1.60 and 1.25, respectively.
We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy MpSpMV on an Nvidia V100 GPU. The approach has three benefits: (1) reduces computation time; (2) reduces size of the input matrix, and therefore reduces data movement; and, (3) provides an opportunity for increased parallelism across computations, e.g., distinct floating point units. MpSpMV's decision to lower to single precision is data-driven, based on individual nonzero values of the sparse matrix. On large representative matrices, we obtain average speedup of 2.35X over double precision, while maintaining two or more additional significant digits of accuracy compared to single precision.
DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.
Despite key technological advancements in racetrack memories (RMs) in the last decade, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift-operations. This paper presents data-placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an ILP-formulation for data-placement in RMs and revisit existing heuristics. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic-search that reduces the number of shifts of up-to 52.5%, outperforming the state-of-the-art by up-to 16.1%.
In this paper, we present a compiler-support that automatically inserts complex instructions into kernels to achieve NVMM data-persistence based on a simple programmer directive. We only make the data resulting from the computations durable. Specific parameters are also logged periodically to track a safe restart point. Data transfer operations are decreased because we scale to the cache line block size instead of individual data elements. Our compiler-support was implemented in the LLVM tool-chain and introduces the necessary modifications to loop-intensive computational kernels to force data persistence. The experiments show that our proposed compiler support performance overheads are insignificant.
Low latency Remote Direct Memory Access (RDMA) fabrics such as InfiniBand and Converged Ethernet are attractive as potential replacements for TCP in data centers (DCs) with latency-critical workloads. Unfortunately, InfiniBand's existing primitives ('verbs') are limited in ways that impose serious penalties (e.g., memory wastage or significant programmer burden or performance loss) for typical DC traffic. We propose Remote Indirect Memory Access (RIMA)} which avoids these pitfalls by providing a network interface card (NIC) microarchitecture support for novel queue-based verbs.
In-silico brain simulations are the de-facto tools through which computational neuroscientists seek to understand brain-function dynamics. Current brain simulators do no scale efficiently to large-scale problem sizes. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA's GPUDirect technology on the extended Hodgkin-Huxley, model of the inferior-olivary nucleus to assess its scalability. Not only the simulation of the cells, but also the setup of the network is taken into account. Network sizes simulated range from 65K to 4M cells, with 10, 100 and 1,000 synapses/neuron are executed on 8, 16, 32 and 48 GPUs.
The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi
We present FailAmp, a novel LLVM program transformation algorithm that makes a program manipulating arrays more robust against soft-errors by guarding its array index calculations. Without FailAmp, an offset calculation error can go undetected; with FailAmp, all subsequent offset calculations are relativized, building on the faulty one to be fully detected. FailAmp can exploit ISAs such as ARM to further reduce overheads. We present a thorough evaluation of FailAmp applied to a large collection of HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of only 5%.
Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.
We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional ?scalene trapezoid? shapes as well as original ?right-rectangle? variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and representative iterated stencils, validating the effectiveness and general applicability of our technique on both general-purpose multicores and GPU accelerators.
Declarative Loop Tactics for Domain-Specific Optimization
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This paper presents Metric-Guided Method (MGM) ? a new structured approach for analysis of architectural enhancements. We evaluate MGM through two evaluations at the microarchitecture and the Instruction Set Architecture (ISA) levels. Our evaluation results show that simple optimizations, such as improved representation of CISC instructions, broadly improve performance, while changes in the Floating-Point execution units had mixed impact. The paper also contributes a set of specially designed micro-benchmarks that isolates features of the Skylake processor that were influential on SPEC CPU benchmarks.