A Store operation is called "silent" if it writes in memory a value that is already there. Silent stores are traditionally detected via profiling. We depart from this methodology and predict silentness by analyzing the syntax of programs. To accomplish this goal, we classify store operations in terms of syntactic features of programs. Based on such features, we develop different kinds of predictors, some of which go much beyond what any trivial approach could achieve. To illustrate how static prediction can be employed in practice, we use it to optimize programs running on non-volatile memory systems.
Although prefetching concepts have been proposed for decades, new challenges are introduced by sophisticated system architecture and emerging applications. Large instruction windows coupled with out-of-order execution makes program data access sequence distorted from cache perspective. Big data applications stress memory subsystems heavily with their large working set sizes and complex data access patterns. To address such challenges, this work proposes a high performance hardware prefetching scheme, SelSMaP. SelSMaP is able to detect both regular and non-uniform stride patterns by taking the minimum observed address offset (called a reference stride) as a heuristic.
GEMM is the most fundamental routine in BLAS. Software prefetching and data packing are used to exploit LRU caches. However, processors equipped with shared non-LRU caches are emerging. This poses challenges to GEMM development in multi-threaded contexts. We present a Shared Cache Partitioning method to eliminate inter-thread cache conflicts in GEMM, by partitioning a shared cache into disjoint sets and assigning different sets to different threads. We implemented SCP in OpenBLAS and evaluated it on Phytium 2000+, a 64-core AArch64 processor with shared pseudo-random L2 caches. Evaluation shows that SCP has effectively reduced conflict misses, and performance improves by 2.75-6.91%.
Performance optimization of stencil codes requires data locality improvements. The polyhedron model for loop transformation is well suited for such optimizations with established techniques, such as the PLuTo algorithm and diamond tiling. However, in the domain of our project ExaStencils, stencil codes, it fails to yield optimal results. As an alternative, we propose a new, optimized, multi-dimensional polyhedral search space exploration and demonstrate its effectiveness: we obtain better results than existing approaches. We also propose how to specialize the search for the domain of stencil codes, which dramatically reduces the exploration effort without significantly impairing performance.
This paper proposes RAGuard, an efficient and user-transparent hardware-based approach to prevent ROP attacks. RAGuard binds a MAC to each return address to ensure its integrity. To guarantee the security of the MAC and reduce runtime overhead: RAGuard (1) computes the MAC by encrypting the signature of a return address with AES-128, (2) develops a key management module based on a PUF, and (3) uses a dedicated register to reduce MACs' load and store operations of leaf functions. Furthermore, we evaluate our mechanism based on the LEON3 processor and show that RAGuard incurs acceptable performance overhead and occupies reasonable area.
Modern Graphics Processing Units (GPUs) is equipped with large register file (RF) to support fast context switch between massive threads, and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPU is usually limited by the inefficient resource management of register file and scratchpad memory. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. The experimental results show that our approach achieves 20.01% performance improvement on average with negligible energy overhead.
To achieve runtime on-the-spot branch divergence reduction, we propose the first on-GPU thread-data remapping scheme. Before kernel launching, our solution inserts codes into GPU kernels immediately before each target branch so as to acquire actual divergence information. Threads can be remapped to data multiples times during single kernel execution. We propose two on-GPU thread-data remapping algorithms. Effective on two generations of GPUs from NVIDIA and AMD, our solution achieves speedups up to 2.718 with third-party benchmarks. We also implement three GPGPU frontier benchmarks and show that our solution works better than the traditional one.
We introduce Poker, a permutation-based approach for vectorizing queries over B+-trees. Our insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multicores. For five benchmarks evaluated with 24 configurations each, Poker outperforms the prior art by 2.11x with one thread and 2.28x with eight threads on Broadwell, on average. In addition, strip-mining queries further improves Pokers performance by 1.21x, on average.
Modern applications often have high memory intensity, with common memory request patterns requiring a large number of address generation instructions and a high degree of memory-level parallelism. This paper proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address generation instructions by offloading addressing to dedicated hardware, and reduce memory interference by grouping related requests together. Our results show that we eliminate 33% of dynamic instructions, leading to an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%.
In this paper, we first demonstrate how to combine the general tuning techniques with the POWER8 hardware architecture through optimizing three representative stencil benchmarks, and then employ two typical real-world applications (similar kernels of the winner program of Gordon Bell Prize 2016 and 2017) to illustrate how to make proper algorithm modifications and fully combine the hardware-oriented tuning strategies with the application algorithms. As a result, this work fills the gap between hardware capability and software performance of the POWER8 processor, and provides useful guidance for optimizing stencil-based scientific applications on POWER systems.
Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.