It has been widely observed that there is no "best-for-all" sparse format for the SpMV kernel on GPU. Indeed, a serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this paper BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. Our experimental results on two different NVIDIA GPUs using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance and energy efficiency improvement.
Modern data centers employ workload consolidation to increase server utilization and reduce total cost of ownership, as well as mitigate data center scaling limitations. However, server resource sharing creates performance interference across applications and consequently, increases performance variability and hurts application QoS and user experience. A challenging problem today is to increase shared server utilization while maintaining application QoS. In this paper, we present QuMan, a server resource manager that uses application isolation and profiling to increase server utilization while controlling degradation of application performance.
The efficiency of generalized matrix-matrix multiplication has been of high importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches more than 85% of an optimized BLAS library performance without the need for an external implementation or automatic tuning.
CNN Hardware accelerators typically contain large numbers of MAC units, the multipliers of which are large in IC gate count and power consumption. We reduce power and area by implementing PASM in a weight-shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. Experiments show that our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency in both ASIC and FPGA implementations.
To exploit multiple GPUs in a system efficiently, it is critical to co-place compute and data. However, two key techniques that have been used to hide memory latency and improve TLP in traditional GPU systems, memory-interleaving and thread-block scheduling, are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve memory-bandwidth-utilization incurs high remote traffic when data and compute are misaligned. Nondeterministic thread-block scheduling to improve compute resource utilization impedes compute and data co-placement. This paper proposes a mechanism that identifies exclusive data and places data along with code accessing it in the same GPU.
This paper introduces computer cluster nodes as simple OpenMP offloading devices that can be used either from a local computer or from the cluster head-node. It proposes a methodology that transforms OpenMP directives to Spark runtime calls with fully integrated communication management, in a way that a cluster appears to the programmer as yet another accelerator device. Results show that although data transfers can impose overheads, cloud offloading from a local machine can still achieve promising speedups for larger granularity: up to 115x in 256 cores for the 2MM benchmark using 1GB sparse matrices.
Prefetching is a well-known technique to mitigate scalability challenges in the PGAS model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce LAPPS: Locality-Aware Productive Prefetching Support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborous manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.
TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is possible to obtain significant prediction accuracy improvements by complementing TAGE with a statistical corrector using the same input information. This paper proposes an alternative TAGE-like predictor making statistical correction practically superfluous.
Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. Existing dead-block prediction schemes take decisions locally for each cache level and do not efficiently manage the entire cache hierarchy. This paper introduces runtime-orchestrated global dead-block management in which static and dynamic information about tasks available to the run-time system is used to effectively detect and manage dead blocks across the cache hierarchy.
Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.