In this paper we present a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls on GPUs. HAWS leverages our compiler infrastructure to identify potential opportunities to bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 15.3% on average for memory intensive applications.
Shared memory cache coherence paradigm is prevalent for data sharing in modern single-chip multicores. However, atomic instructions based thread synchronization suffers from cache line ping-pong overhead, which prevents performance scaling as core counts increase. To mitigate this bottleneck, this paper proposes to utilize in-hardware explicit messaging to enable a novel moving computation to data model (MC) that offloads the execution of critical code regions at dedicated cores. The key idea is to utilize low-latency, non-blocking capabilities of in-hardware messaging to overlap communication with computation, and exploit data locality for efficient critical code execution in multicores at 1000-cores scale.
Byte-addressable non-volatile memory (NVM) blends the concepts of storage and memory and can radically improve data-centric applications, from in-memory databases to graph processing. NVM changes the nature of rack-scale systems and enables short-latency direct memory access while retaining data persistence properties and simplifying the software stack. This paper proposes CEP (Capability Enforcement Coprocessor), a memory-side coprocessor which implements fine-grained protection through the capability model. By doing so, it opens up important performance optimization opportunities (without compromising security).
SIMD has been adopted for decades because of its superior efficiency. However, typical methods of migrating existing applications to another host ISA underutilize host's SIMD parallelism, register capacity and advanced instructions. In this paper, we present a novel binary translation technique called spill-aware SLP (saSLP), which combines ARMv8 instructions and registers to exploit x86 host's SIMD resources. Experiment results show that saSLP improves the performance by 1.6X (2.3X) across several benchmarks, and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX gather instructions, saSLP speeds up several data-irregular applications by up to 4.2X.
In this paper, we propose a novel technique called Idle-Time-Aware Power Management (ITAP) to effectively reduce the static energy consumption of GPU execution units. ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length using prediction and look-ahead techniques in a synergistic way and then, applies the most proper static power reduction mode based on the estimated idle period length. Our experimental results show that ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with a negligible performance overhead.
We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their dependences to be determined at run-time. Pipelite unifies communication, scheduling and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely-used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state-of-the-art in terms of speedup, scalability, and memory usage.
In this paper, we explore an alternative cost function for combinatorial register-pressure-aware instruction scheduling, which is the Sum of Live Interval Lengths (SLIL). Unlike the classical peak cost function, which captures register pressure only at the highest pressure point, SLIL captures register pressure at all points in the schedule. The paper describes a Branch-and-Bound (B&B) algorithm for minimizing SLIL. The algorithm is implemented into LLVM. The experimental results using SPEC CPU2006 on Intel x86 show that the proposed algorithm gives substantially less spilling and speeds the execution by up to 18% relative to LLVMs default scheduler.
We propose two hardware mechanisms for reducing the frequency and penalty of on-die TLB misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die last-level-cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity. The second, DRAM-TLB, memoizes virtual to physical address translations in DRAM and reduces on-die TLB miss penalty when the UCAT is unable to fully cover application working-set size. Combining both these mechanisms, we propose DUCATI, an address translation architecture that improves GPU performance by 81% (up to 4.5x) and requires minimal changes to the existing design