We identify a novel fine-grained microarchitectural timing channel in the GPU's Shared Memory. By considering the timing channel caused by Shared Memory bank conflicts, we have developed a differential timing attack that can compromise table-based cryptographic algorithms, e.g., AES. We evaluate our attack method by attacking an implementation of the AES encryption algorithm that fully occupies the compute resources of the GPU. We extend our timing analysis onto the Pascal architecture. We also discuss countermeasures and experiment with a novel multi-key implementation, quantifying its resistance to our side-channel timing attack.
In this paper, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations (i) AT-Opt that minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning that identifies/elides redundant place-change operations and does parallel execution of place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72x and 17.83x, on a four-node (32 cores/node) Intel and two-node (16 cores/node) AMD system, respectively.
PIMBALL: Binary Neural Networks in Spintronic Memory
Deep learning frameworks automate the deployment and hardware acceleration of models represented as DAGs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: a DSL with a tensor notation close to the mathematics of deep learning; a JIT optimizing compiler based on the polyhedral framework; carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels.
Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This paper presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.
Common memory prefetchers are designed to target specific memory access-patterns, including spatio-temporal locality, recurring patterns, and irregular patterns. In this paper, we propose a conceptual neural network (NN) prefetcher that dynamically adapts to arbitrary access patterns to capture semantic locality. Leveraging recent advances in machine learning, the proposed NN prefetcher correlates program context with memory accesses using online-training, and enables tapping into previously undetected access patterns. We present an architectural implementation of our prefetcher and evaluate it over SPEC2006, Graph500, and other kernels, showing it delivers up to 30% speedup over SPEC2006 and up to 4.4x speedup on some kernels.
Dynamic scheduling and dynamic creation of the pipeline structure are crucial for the efficient execution of pipelined programs. However, dynamic systems imply higher overhead than static systems, so chunking groups activities to decrease the synchronization and scheduling overhead. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at runtime. The evaluation on 44 cores shows that dynamic chunking brings the overhead of a dynamic system down to that of an efficient static system. Therefore, dynamic chunking enables efficient and scalable execution of fine-grained workloads.
Layup: Layer-Adaptive and Multi-Type Intermediate-Oriented Memory Optimization for GPU-Based CNNs
The information that compilers have at their disposal is instrumental for obtaining good auto-vectorization optimizations. However, the exact information available at compile-time varies greatly, as does the resulting performance. In this paper, we propose a novel method for evaluating the auto-vectorization capability of compilers by objectively withdraw and withhold information that would otherwise aid the compiler in the auto-vectorization process. As such, our approach is orthogonal to well-known frameworks such as Test Suite for Vectorizing Compilers (TSVC), and thus re-aligns the compile evaluations to be more realistic in embracing the real-world conditions.
The polyhedral model is used in production compilers.Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. We go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation. It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs.It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affi
Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with FPGAs, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of Multiple Iterations (DCMI) which improves upon prior work by pre-computing the effective stencil coefficients after a number of iterations at design time --- resulting in accelerators that use minimal on-chip memory and avoid redundant computation. This enables DCMI to improve throughput by up to 7.7X compared to the state-of-the-art cone-based architecture.