It has been widely observed that there is no "best-for-all" sparse format for the SpMV kernel on GPU. Indeed, a serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this paper BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. Our experimental results on two different NVIDIA GPUs using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance and energy efficiency improvement.
Running an extreme-scale DNN with continuously increasing scale (depth and width) in a single GPU is urgently required. To reduce the memory footprint for extreme-scale deep learning, we present Layrub, a runtime data placement strategy that achieves layer-centric memory reuse at both the intra-layer and inter-layer levels. Compared to Caffe, Layrub cuts down the memory usage rate by up to 98.9%, at the moderate time cost of 24.1% on average. Layrub outperforms systems such as GeePS, vDNN, MXNet and Tensorflow. It also tackles extreme-scale tasks, including successfully training a ResNet with 1517 layers in one GPU with 12GB memory.
Modern data centers employ workload consolidation to increase server utilization and reduce total cost of ownership, as well as mitigate data center scaling limitations. However, server resource sharing creates performance interference across applications and consequently, increases performance variability and hurts application QoS and user experience. A challenging problem today is to increase shared server utilization while maintaining application QoS. In this paper, we present QuMan, a server resource manager that uses application isolation and profiling to increase server utilization while controlling degradation of application performance.
The efficiency of generalized matrix-matrix multiplication has been of high importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches more than 85% of an optimized BLAS library performance without the need for an external implementation or automatic tuning.
Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the page waste caused by early block failures, other blocks can be used to support the failed block, working cooperatively to keep it alive and extend the faulty pages lifetime.
CNN Hardware accelerators typically contain large numbers of MAC units, the multipliers of which are large in IC gate count and power consumption. We reduce power and area by implementing PASM in a weight-shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. Experiments show that our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency in both ASIC and FPGA implementations.
To exploit multiple GPUs in a system efficiently, it is critical to co-place compute and data. However, two key techniques that have been used to hide memory latency and improve TLP in traditional GPU systems, memory-interleaving and thread-block scheduling, are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve memory-bandwidth-utilization incurs high remote traffic when data and compute are misaligned. Nondeterministic thread-block scheduling to improve compute resource utilization impedes compute and data co-placement. This paper proposes a mechanism that identifies exclusive data and places data along with code accessing it in the same GPU.
This paper introduces computer cluster nodes as simple OpenMP offloading devices that can be used either from a local computer or from the cluster head-node. It proposes a methodology that transforms OpenMP directives to Spark runtime calls with fully integrated communication management, in a way that a cluster appears to the programmer as yet another accelerator device. Results show that although data transfers can impose overheads, cloud offloading from a local machine can still achieve promising speedups for larger granularity: up to 115x in 256 cores for the 2MM benchmark using 1GB sparse matrices.
To achieve runtime on-the-spot branch divergence reduction, we propose the first on-GPU thread-data remapping scheme. Before kernel launching, our solution inserts codes into GPU kernels immediately before each target branch so as to acquire actual divergence information. Threads can be remapped to data multiples times during single kernel execution. We propose two on-GPU thread-data remapping algorithms. Effective on two generations of GPUs from NVIDIA and AMD, our solution achieves speedups up to 2.718 with third-party benchmarks. We also implement three GPGPU frontier benchmarks and show that our solution works better than the traditional one.
Prefetching is a well-known technique to mitigate scalability challenges in the PGAS model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce LAPPS: Locality-Aware Productive Prefetching Support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborous manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.
TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is possible to obtain significant prediction accuracy improvements by complementing TAGE with a statistical corrector using the same input information. This paper proposes an alternative TAGE-like predictor making statistical correction practically superfluous.
Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. Existing dead-block prediction schemes take decisions locally for each cache level and do not efficiently manage the entire cache hierarchy. This paper introduces runtime-orchestrated global dead-block management in which static and dynamic information about tasks available to the run-time system is used to effectively detect and manage dead blocks across the cache hierarchy.
In this paper, we first demonstrate how to combine the general tuning techniques with the POWER8 hardware architecture through optimizing three representative stencil benchmarks, and then employ two typical real-world applications (similar kernels of the winner program of Gordon Bell Prize 2016 and 2017) to illustrate how to make proper algorithm modifications and fully combine the hardware-oriented tuning strategies with the application algorithms. As a result, this work fills the gap between hardware capability and software performance of the POWER8 processor, and provides useful guidance for optimizing stencil-based scientific applications on POWER systems.
GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound, and would benefit from a larger register file to enable more concurrent threads. Our paper demonstrates three approaches to better utilize GPU register files and shows that that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work.
Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.