We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.
Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift
Bitstream computing (BC) enables these complex robotic algorithms under a strict power budget. Yet, BC can expose complex design decisions. To address these challenges, we propose compiler extensions to BitSAD, DSL for BC. Our work enables bit-level software emulation of hardware units, implements automated generation of synthesizable hardware from a program, highlights potential optimizations, and proposes compiler phases to implement them in a hardware-aware manner. Finally, we introduce population coding, a parallelization scheme for stochastic computing that decreases latency without sacrificing accuracy. This work is a series of analyses and experiments on BC designs that inform compiler extensions to BitSAD.
We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy MpSpMV on an Nvidia V100 GPU. The approach has three benefits: (1) reduces computation time; (2) reduces size of the input matrix, and therefore reduces data movement; and, (3) provides an opportunity for increased parallelism across computations, e.g., distinct floating point units. MpSpMV's decision to lower to single precision is data-driven, based on individual nonzero values of the sparse matrix. On large representative matrices, we obtain average speedup of 2.35X over double precision, while maintaining two or more additional significant digits of accuracy compared to single precision.
DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.
Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.
In this paper, we present a compiler-support that automatically inserts complex instructions into kernels to achieve NVMM data-persistence based on a simple programmer directive. We only make the data resulting from the computations durable. Specific parameters are also logged periodically to track a safe restart point. Data transfer operations are decreased because we scale to the cache line block size instead of individual data elements. Our compiler-support was implemented in the LLVM tool-chain and introduces the necessary modifications to loop-intensive computational kernels to force data persistence. The experiments show that our proposed compiler support performance overheads are insignificant.
In-silico brain simulations are the de-facto tools through which computational neuroscientists seek to understand brain-function dynamics. Current brain simulators do no scale efficiently to large-scale problem sizes. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA's GPUDirect technology on the extended Hodgkin-Huxley, model of the inferior-olivary nucleus to assess its scalability. Not only the simulation of the cells, but also the setup of the network is taken into account. Network sizes simulated range from 65K to 4M cells, with 10, 100 and 1,000 synapses/neuron are executed on 8, 16, 32 and 48 GPUs.
The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi
We present FailAmp, a novel LLVM program transformation algorithm that makes a program manipulating arrays more robust against soft-errors by guarding its array index calculations. Without FailAmp, an offset calculation error can go undetected; with FailAmp, all subsequent offset calculations are relativized, building on the faulty one to be fully detected. FailAmp can exploit ISAs such as ARM to further reduce overheads. We present a thorough evaluation of FailAmp applied to a large collection of HPC benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of only 5%.
Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.
We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional ?scalene trapezoid? shapes as well as original ?right-rectangle? variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and representative iterated stencils, validating the effectiveness and general applicability of our technique on both general-purpose multicores and GPU accelerators.
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This paper presents Metric-Guided Method (MGM) ? a new structured approach for analysis of architectural enhancements. We evaluate MGM through two evaluations at the microarchitecture and the Instruction Set Architecture (ISA) levels. Our evaluation results show that simple optimizations, such as improved representation of CISC instructions, broadly improve performance, while changes in the Floating-Point execution units had mixed impact. The paper also contributes a set of specially designed micro-benchmarks that isolates features of the Skylake processor that were influential on SPEC CPU benchmarks.