In this paper, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations (i) AT-Opt that minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning that identifies/elides redundant place-change operations and does parallel execution of place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72x and 17.83x, on a four-node (32 cores/node) Intel and two-node (16 cores/node) AMD system, respectively.
PIMBALL: Binary Neural Networks in Spintronic Memory
Deep learning frameworks automate the deployment and hardware acceleration of models represented as DAGs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: a DSL with a tensor notation close to the mathematics of deep learning; a JIT optimizing compiler based on the polyhedral framework; carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels.
Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This paper presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.
Common memory prefetchers are designed to target specific memory access-patterns, including spatio-temporal locality, recurring patterns, and irregular patterns. In this paper, we propose a conceptual neural network (NN) prefetcher that dynamically adapts to arbitrary access patterns to capture semantic locality. Leveraging recent advances in machine learning, the proposed NN prefetcher correlates program context with memory accesses using online-training, and enables tapping into previously undetected access patterns. We present an architectural implementation of our prefetcher and evaluate it over SPEC2006, Graph500, and other kernels, showing it delivers up to 30% speedup over SPEC2006 and up to 4.4x speedup on some kernels.
Layup: Layer-Adaptive and Multi-Type Intermediate-Oriented Memory Optimization for GPU-Based CNNs
The information that compilers have at their disposal is instrumental for obtaining good auto-vectorization optimizations. However, the exact information available at compile-time varies greatly, as does the resulting performance. In this paper, we propose a novel method for evaluating the auto-vectorization capability of compilers by objectively withdraw and withhold information that would otherwise aid the compiler in the auto-vectorization process. As such, our approach is orthogonal to well-known frameworks such as Test Suite for Vectorizing Compilers (TSVC), and thus re-aligns the compile evaluations to be more realistic in embracing the real-world conditions.
Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with FPGAs, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of Multiple Iterations (DCMI) which improves upon prior work by pre-computing the effective stencil coefficients after a number of iterations at design time --- resulting in accelerators that use minimal on-chip memory and avoid redundant computation. This enables DCMI to improve throughput by up to 7.7X compared to the state-of-the-art cone-based architecture.