We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.
Bitstream computing (BC) enables these complex robotic algorithms under a strict power budget. Yet, BC can expose complex design decisions. To address these challenges, we propose compiler extensions to BitSAD, DSL for BC. Our work enables bit-level software emulation of hardware units, implements automated generation of synthesizable hardware from a program, highlights potential optimizations, and proposes compiler phases to implement them in a hardware-aware manner. Finally, we introduce population coding, a parallelization scheme for stochastic computing that decreases latency without sacrificing accuracy. This work is a series of analyses and experiments on BC designs that inform compiler extensions to BitSAD.
DNN models are now being deployed in the cloud, on the mobile devices, or even mobile-cloud coordinate processing, making it a big challenge to select an optimal deployment strategy. This paper proposes a DNN tuning framework \dnntune, which can provide layer-wise behavior analysis across a number of platforms. This paper selects 10 representative DNN models and three mobile devices to characterize the DNN models on these devices, to further assist users finding opportunities for mobile-cloud coordinate computing. Experimental results demonstrate that \dnntune can find a coordinated deployment achieving up to 20\% speedup and 15\% energy saving comparing with mobile-only deployment.
Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.
The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi
Integrated heterogeneous microprocessors provide fast CPU-GPU communication and ?in-place? computation, permitting finer granularity GPU parallelism, and use of the GPU in more complex and irregular codes. This paper proposes exploiting nested parallelism, a common OpenMP program paradigm wherein SIMD loop(s) lie underneath an outer MIMD loop. Scheduling the MIMD loop on multiple CPU cores allows multiple instances of their inner SIMD loop(s) to be scheduled on the GPU, boosting GPU utilization, and parallelizing non-SIMD code. Our results on simulated and physical machines show exploiting nested MIMD-SIMD parallelism speeds up the next-best parallelization scheme per benchmark by 1.59x and 1.25x, respectively.