In this paper we present a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls on GPUs. HAWS leverages our compiler infrastructure to identify potential opportunities to bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 15.3% on average for memory intensive applications.
Data persistence is highly desired by many in-memory applications. This paper demonstrates that a hardware-based high-frequency checkpointing technique can be used to achieve efficient in-memory data persistence on NVM. we design a new dual-page checkpointing scheme, which achieves low metadata cost and eliminates most excessive NVM writes at the same time. It breaks the traditional trade-off between metadata space cost and extra data writes. Our solution outperforms the state-of-the-art data persistence software by 24.2× higher throughput, and leads to 28% less NVM wear and 1.25× higher throughput than state-of-the-art hardware checkpointing systems.
A Store operation is called "silent" if it writes in memory a value that is already there. Silent stores are traditionally detected via profiling. We depart from this methodology and predict silentness by analyzing the syntax of programs. To accomplish this goal, we classify store operations in terms of syntactic features of programs. Based on such features, we develop different kinds of predictors, some of which go much beyond what any trivial approach could achieve. To illustrate how static prediction can be employed in practice, we use it to optimize programs running on non-volatile memory systems.
This paper presents GenMatcher, a generic, software-only, arbitrary matching framework for fast, efficient searches. The key idea of our approach is to represent arbitrary rules with efficient prefix-based tries. Our contribution includes a novel, clustering-based grouping algorithm to group rules based upon their bit-level similarities. Our algorithm generates near-optimal trie groupings with low configuration times and provides significantly higher match throughput compared to prior techniques. Experiments with synthetic traffic show that our method can achieve an 58.9X speedup compared to the baseline on a single core processor under a given memory constraint.
Some solid state drives (SSDs) adopt overlong error correction codes (ECCs), whose redundancy size exceeds the spare area limit of flash pages, to improve the reliability, but suffer from significantly degraded read performance. In this paper, we propose a novel scheme to cache overlong ECCs, called SCORE, to improve the SSD performance. Exceeding ECC redundancy of logically consecutive data pages are grouped into ECC pages. SCORE partitions RAM to cache both data pages and ECC pages in a workload-adaptive manner. Experimental results show that SCORE improves the SSD performance by an average of 34%, compared to the state-of-the-art schemes.
To address the large computational effort of iterative compilation, we pursue the guidance of a genetic algorithm for program optimization via feedback from a surrogate performance model. We train the model on program transformations that were evaluated during the iterative optimization of a set of training programs. For the representation of programs and program transformations we employ the polyhedron model. Our experimental evaluation reveals that our models can be used to speed up the future optimization of programs. We demonstrate that we can reduce the benchmarking effort required for an iterative optimization without substantially worsening the result.
In this paper, we aim to automate the process of selecting an appropriate DNN topology that fulfills both functional and non-functional requirements of the application. Specifically, we tune two important hyperparameters, depth and width, which together define the shape of the neural network and directly affect its accuracy, speed, size, and energy consumption. By modeling the accuracy of DNNs we achieve up to 4x speed-up in design space traversal compared to exhaustive search. We are able to produce tuned ResNets, which are up to 4.22x faster than original depth-scaled ResNets on a batch of 128 images while matching their accuracy.
We present an approach and a tool to answer the need for effective, generic and easily applicable protections against side-channel attacks. The protection mechanism is based on code polymorphism, so that the observable behaviour of the protected component is variable and unpredictable to the attacker. Our approach combines lightweight specialized runtime code generation with the optimization capabilities of static compilation. It is extensively configurable and mitigations are set-up against security holes related to runtime code generation. Experimental results show that programs secured by our approach present strong security levels and meet the performance requirements of constrained systems.
In this paper, we propose a novel technique called Idle-Time-Aware Power Management (ITAP) to effectively reduce the static energy consumption of GPU execution units. ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length using prediction and look-ahead techniques in a synergistic way and then, applies the most proper static power reduction mode based on the estimated idle period length. Our experimental results show that ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with a negligible performance overhead.
This paper proposes RAGuard, an efficient and user-transparent hardware-based approach to prevent ROP attacks. RAGuard binds a MAC to each return address to ensure its integrity. To guarantee the security of the MAC and reduce runtime overhead: RAGuard (1) computes the MAC by encrypting the signature of a return address with AES-128, (2) develops a key management module based on a PUF, and (3) uses a dedicated register to reduce MACs' load and store operations of leaf functions. Furthermore, we evaluate our mechanism based on the LEON3 processor and show that RAGuard incurs acceptable performance overhead and occupies reasonable area.
Modern Graphics Processing Units (GPUs) is equipped with large register file (RF) to support fast context switch between massive threads, and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPU is usually limited by the inefficient resource management of register file and scratchpad memory. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. The experimental results show that our approach achieves 20.01% performance improvement on average with negligible energy overhead.
We introduce Poker, a permutation-based approach for vectorizing queries over B+-trees. Our insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons to a minimum. Implemented as a C++ template library, Poker represents a general-purpose solution for vectorizing the queries over indexing trees on multicores. For five benchmarks evaluated with 24 configurations each, Poker outperforms the prior art by 2.11x with one thread and 2.28x with eight threads on Broadwell, on average. In addition, strip-mining queries further improves Pokers performance by 1.21x, on average.
Repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have applied it to the datasets of Intel CPU specifications and SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also quantify the self-similarity of these suites for the first time in the literature.
Advances in non-volatile resistive switching random access memory (RRAM) have made it a promising memory technology with applications in low power and embedded in-memory computing devices owing to advantages such as low energy consumption and area cost, and good scaling. However, it is challenging to employ RRAM devices in neuromorphic chips due to non-ideal behavior of RRAM. In this paper, we propose a cycle-accurate, scalable system-level simulator that can be used to study the effects of RRAM devices in neuromorphic chips. The simulator models a spatial neuromorphic chip architecture containing many neural cores with RRAM crossbars connected via a Network-on-Chip.
Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to more traditional imaging devices, these devices enable massive pixel-parallel execution of image processing algorithms. This enables the application of certain image processing algorithms at extreme frame rates. By performing some early-stage processing in-situ, FPSPs have the potential to consume less power compared to conventional approaches using standard digital cameras. In this paper we explore code generation for an FPSP whose processors operate on analogue signal data, leading to further opportunities for power reduction.
Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DDR memory and High Bandwidth Memory (HBM) for the high bandwidth. However, existing task schedulers suffer from low bandwidth usage and poor data locality problems in this architecture. To solve the two problems, we propose BATS, a task scheduling system that consists of a HBM-aware data allocator, a bandwidth-aware traffic balancer, and a hierarchical task-stealing scheduler. Experiments on an Intel Knights Landing server that adopt the bandwidth-asymmetric memory show that BATS reduces the execution time of memory-bound programs up to 83.5% compared with traditional task-stealing schedulers.
The submitted paper presents a lightweight region formation method guided by processor tracing, e.g., Intel PT. We leverage the branch history information stored in the processor to re-construct program execution profile and effectively form high quality regions with low cost. We also present the designs of lightweight HPM sampling and branch instruction decode cache to minimize region formation overhead. Using ARM32-to-x86-64 translations, the experiment results show that our method achieves a performance speedup of up to 1.38x (1.10x on average) for SPEC CPU2006 benchmarks with reference inputs, compared to the well-known software-based trace formation method, Next Executing Tail (NET).
Value prediction improves instruction level parallelism in superscalar processors by breaking true data dependencies. Despite this technique can significantly improve overall performance, most of the approaches achieve that at the cost of a high hardware cost. Our work tries to reduce the complexity of value prediction by optimizing the prediction infrastructure for predicting only load instructions and leveraging existent hardware in modern processors. Also, we propose a new load value predictor that outperforms all the state-of-the-art predictors, and with very low cost. Moreover, we propose a new taxonomy for the different policies that can be used in value prediction.
This paper presents a framework which uses a detailed RDA algorithm to generate reuse distance profiles and other performance parameters in GPU cache hierarchy. The effects of reservation fails in cache blocks and miss status holding registers are included in the RDA model. The framework is 264KX slower than GPU executions (10X faster than GPGPU-Sim) and obtains the results within the average of 4.93% and 4.87% on L1 and L2 miss rates, respectively. To alleviate RDA computations complexity, we applied a statistical sampling method which achieves 1.8X speedup on average and predicts the L1/L2 miss rates within ~5/9%, respectively.