In this paper we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multi-coloring approach for parallelization, and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77\% of the theoretical memory bandwidth and scale to the full system of over ten million cores, with an aggregated performance of 480.8 Tflop/s and a parallel efficiency of 87.3\%.
We present SymGraph, a judicious graph engine with symbolic iteration that enables the parallelism of dependent computations for the embarrassingly-parallel graph computation by allowing using abstract symbolic value (instead of taking on the concrete value) if the desired data is unavailable. In an effort to maximize the potential of symbolic iteration, we also propose a chain of tailored sophisticated techniques, enabling SymGraph to scale out with a new milestone of efficiency for large-scale graph processing. Experimental results show that SymGraph outperforms traditional engines.
This paper proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks. Our evaluation results show that Benzene significantly reduces energy consumption of distributed hybrid caches.
Sunway TaihuLight supercomputer is powered by SW26010, a new many-core processor designed with on-chip heterogeneous techniques. In this paper, we present our work on optimizing convolutional neural networks on SW26010. We first derive a performance model based on the architecture. Guided by the model, we propose a parallel algorithm design and a systematic optimization process targeting different architecture features. Our work can achieve a double-precision performance of 1.68 TFlops, which is 56% of the theoretical peak performance. Compared with cuDNN on K40m GPU, our work results in 2.5 times speedup and about 24% improvement on hardware efficiency.
In this paper, we make a case for a more effective boosting strategy, which invests energy in activities with the best estimated return. In addition to running faster clocks, we can also use a look-ahead thread to overlap the penalties of cache misses and branch mispredicts. Overall, for similar power consumptions, the proposed adaptive turbo boosting strategy can achieve about twice the performance benefits while halving the energy overhead.
Phase Change Memory (PCM) is the most promising candidate to be used at main memory hierarchy. In order to deal with the limited lifetime of PCM, some extra storage per memory line is required to correct permanent hard errors (stuck-at faults). Since the extra storage is used only when permanent faults occur, it has a low utilization for a long time before hard errors start to occur. In this paper, we utilize the extra storage to improve the read/write latency in a 2-bit MLC PCM using a relaxation scheme for reading and writing the cells for intermediate resistance levels.
While the paradigm shift from planar(2D) to vertical(3D) model can prepare the NAND-flash technology for next-generation storage applications, the fast-threshold-drift can introduce charge-loss in such 3D-NAND cells and generate errors. In this work, we present an elastic read reference(ERR) for reducing such errors, introduce an intra-block page organization scheme(hitch-hike) to provide stronger ECC for error-prone pages, and a reinforcement-learning based data refill scheme(iRefill) to counter fast-drift with minimum overhead. Finally, we present the first analytic model to characterize fast-drift. Compared to conventional designs, our design can reduce fast-drift errors by 87%, on average, and lower the ECC latency/energy overhead significantly.
SIMT machine emerges as a primary computing device in high performance computing since the SIMT execution paradigm can exploit data-level parallelism effectively. This paper explores the SIMT execution potential on homogeneous multi-core processors, which generally run in MIMD mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence. Our study shows that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve up to 5x speedups over the MIMD counterpart for OpenCL benchmarks.
Simulators are the most popular tool to study computer architecture. However, modern simulators have become prohibitively complex to fully understand and utilize. Users therefore end up analyzing and modifying only the modules of interest when performing simulations. In this paper, we propose DiagSim, an efficient and systematic method to diagnose simulators. It ensures the target modules behave as expected to perform simulation. We diagnose three popular open-source simulators and show that they have different modeling details and interactions. We observe that these differences create large performance discrepancies for an identical processor design, and illustrate that diagnosis can mitigate them.
Dennard scaling has ended. Lowering Vdd to sub volt levels causes intermittent losses in signal integrity, rendering further scaling no longer acceptable. However, it is possible to correct errors caused due to lower Vdd in an efficient manner, and effectively lower power. By deploying redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower Vdd. We use RRNS to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, ISA and RRNS centered algorithms. This system can reduce the energy-delay-product by about 3× for multiplication intensive workloads and by about 2× in general.
In this paper, we perform a novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions. We show that the interconnect bandwidth is a critical bound for GPU performance scalability. For the applications that do not have saturated throughput utilization on a particular resource, their performance scale with increased TLP. We propose a fast context switching approach to improve TLP for such applications.With this fine-grain fast context switching, higher TLP can be supported without increasing the sizes of critical resources.
This work introduces a lightweight persistent object framework, dubbed Scalable In-Memory Persistent Object (SIMPO), to support data persistence for high-concurrency big-data applications through optimized exploitation of NVRAM. Using optimized redo logging, we propose a deferrable programming and execution model to support efficient data persistence with zero data loss. Our model is well-suited to in-memory big data computing workloads with improved data locality and concurrency. SIMPO features a write-combining checkpointing scheme to save overheads of flushing checkpoints to NVRAM. Experimental results with various benchmarks show that SIMPO incurs less than 5% runtime overhead, and achieves 2.35x more speedup in highly-threaded situations.
Trace-driven simulation of chip multi-processors (CMP) offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology for fast design space exploration of CMPs. Our trace-based approach has a peak speedup of up to 15.7X over simulation in gem5 full-system with an average of about 8X speedup and efficiently scales up to 64 threads.
Modern data centers increasingly employ FPGA-based heterogeneous acceleration platforms as a result of their great potential for continued performance and energy efficiency. Today, FPGAs provide more hardware parallelism than is possible with GPUs or CPUs, while C-like programming environments facilitate shorter development time. In this work we address limitations and overheads in access and transfer of data to accelerators over PCIe. Additionally, matching of accelerators computing capacity and requirements, such as performance/watt or bandwidth to the diversity of workloads and to servers performance or low energy requirements is an equally important factor we address at the architectural level.
Several cache partitioning schemes are proposed in multi-cores that implement monolithic LLCs to optimize the energy and performance metrics by primarily minimizing the interferences between applications competing for the shared LLC. The work presents DarkCache a novel cache partitioning scheme for tiled LLC architectures to optimize the EDP by dynamically turning off the unused LLC banks. The proposed architecture is application independent and allows the LLC reconfiguration without blocking the application execution. Results considering both 16- and 64-core architectures using the complete Splash2x suite show an average EDP improvement up to 38% with a performance degradation limited to 2%.
Today's hardware transactional memory (HTM) systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to poor performance when transactions frequently conflict, causing them to resort to a non-speculative fallback path. Often, such a path severely limits parallelism. In this paper, we propose very simple architectural changes to the existing requester-wins HTM implementations that enhance conflict resolution between hardware transactions and thus improve their parallelism. Our idea is backward-compatible with existing HTM systems, requires no changes to target applications that employ traditional lock synchronization, and is shown to provide robust performance benefits.
Parallelism is one of the key performance sources in modern computer systems. When heuristics-based automatic parallelization fails to improve performance, a cumbersome and error-prone manual transformation is often required. As a solution, we propose an interactive visual approach building on the polyhedral model that visualizes exact dependences and parallelism; decomposes and replays a complex automatically-computed transformation step by step; and allows for directly manipulating the visual representation as a means of transforming the program with immediate feedback. User studies suggest that our visualization is understood by experts and non-experts alike, and that it may favor an exploratory approach.
The diversity and complexity of modern computing platforms makes the development of scalable and portable software challenging. Tuning the amount of work per task can balance parallelism and scheduling overheads, but cannot easily tune for memory bandwidth nor avoid inter-task interference. We propose a complementary approach, tuning the amount of resources allocated to tasks, and combine it with software-defined task topologies to provide portable locality. These ideas are combined into a low-overhead resource management scheme called Elastic Places. Experimental results show that Elastic Places provides both high scalability and performance portability, with speed-ups up to 2.3x compared to state-of-the-art runtimes.
CGRAs excel at exploiting loop-level parallelism at a high performance-per-watt ratio, yet 25-45 percent of the consumed energy are spent on the instruction memory and fetches therefrom. This article presents a hardware/software co-design methodology that is able to reduce the energy consumed by the instruction decode logic by 60%. The hardware modifications improve the spatial organization of code by re-organizing the configuration memory into separate partitions based on a statistical analysis. A compiler technique optimizes code in the temporal dimension by minimizing the number of signal changes. These optimizations enable a code size reduction of 55% for different application domains.