Ensuring fairness in a system with scarce and commonly preferred resources requires time sharing. We consider a heterogeneous system with a few ``big'' and many ``small'' processors. We allocate heterogeneous processors using a novel token mechanism that supports game-theoretic notions of fairness. The mechanism frames the allocation problem as a repeated game. In each round of the game, users request big processors and spend a token if their request is granted. We formulate game dynamics and optimize users' strategies to produce an equilibrium. Our token mechanism outperforms classical, fair mechanisms by 1.7x, and is competitive with a performance maximizing mechanism
High-order stencil computations, frequently found in many applications, pose severe challenges to emerging many-core platforms due to the complexities of hardware architectures as well as the sophisticated computing and data movement patterns. In this work tackle the challenge of high-order WENO computations in extreme-scale simulations of 3-D gaseous wave on Sunway TaihuLight. We design efficient parallelization algorithms and present effective optimization techniques to fully exploit various parallelisms with reduced memory footprint, enhanced data reuse, and balanced computation load. Test results show the optimized code can scale to 9.98 million cores, solving 12.74 trillion unknowns with 23.12 Pflops double-precision performance.
While the paradigm shift from planar(2D) to vertical(3D) model can prepare the NAND-flash technology for next-generation storage applications, the fast-threshold-drift can introduce charge-loss in such 3D-NAND cells and generate errors. In this work, we present an elastic read reference(ERR) for reducing such errors, introduce an intra-block page organization scheme(hitch-hike) to provide stronger ECC for error-prone pages, and a reinforcement-learning based data refill scheme(iRefill) to counter fast-drift with minimum overhead. Finally, we present the first analytic model to characterize fast-drift. Compared to conventional designs, our design can reduce fast-drift errors by 87%, on average, and lower the ECC latency/energy overhead significantly.
Thread coarsening on GPUs combines the work of several threads into one. We present a method for estimating the optimal coarsening factor, based on a low-cost, approximate analysis of cache line re-use and a performance prediction model based on occupancy. An LLVM-based tool chain is also described that implements this model-directed coarsesing optimisation for two different coarsening strategies. The results obtained on three different NVidia GPU architectures show that achieving speedups of up to 4.85x can thus be automated and that cache line re-use analysis avoids coarsening in around 89% of cases that would otherwise lead to worse performance.
This paper introduces DDRNoC, an on-chip interconnection network able to route packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath which exhibits significant slack. DDRNoC capitalizes on this observation allowing two flits per cycle to share the same datapath, thereby, improving throughput. Alternatively, using lower voltage circuits, the above slack is exploited reducing power consumption while matching SDR throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs slower clock than a SDR network that routes packets at the same rate.
ReRAM is a promising candidate as the replacement for DRAM. In this paper, we propose a novel circuit-architecture co-optimization framework for improving the performance, reliability and energy of ReRAM-based main memory system. At circuit level, we propose a novel double-sided write driver design to reduce the IR drops along bitlines. At circuit-architecture level, we propose a RESET disturbance detection scheme to solve the write disturbance problem for the high reliability with low overheads. At architecture level, a region partition with address remapping method and two flip schemes are proposed to improve the access latency, reliability and energy of ReRAM arrays.
Convolution Neural Networks have become increasingly popular for their capability in pattern classification. However, modern CPU architectures employ NUMA technique to integrate multiple sockets, which incurs unique challenges for designing highly efficient CNN frameworks. For this we propose NUMA-aware multi-solver based CNN design, named NUMA-Caffe, for accelerating DNN on multi- and many-core CPU architectures. Through a thorough empirical study on four contemporary NUMA-based multi- and many-core architectures, our experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability.
Memory efficiency has become critically important for a wide range of computing domains. It is difficult to control the distribution and usage of memory power because these effects depend upon activity across the vertical execution stack. To address this challenge, we construct a novel framework that employs object placement, cross-layer communication, and page-level management to effectively distribute application objects in the DRAM hardware to achieve desired power/performance goals. This work describes the design and implementation of our framework, which is the first to integrate automatic profiling and analysis of application data with fine-grained management of memory hardware in the OS.
Several cache partitioning schemes are proposed in multi-cores that implement monolithic LLCs to optimize the energy and performance metrics by primarily minimizing the interferences between applications competing for the shared LLC. The work presents DarkCache a novel cache partitioning scheme for tiled LLC architectures to optimize the EDP by dynamically turning off the unused LLC banks. The proposed architecture is application independent and allows the LLC reconfiguration without blocking the application execution. Results considering both 16- and 64-core architectures using the complete Splash2x suite show an average EDP improvement up to 38% with a performance degradation limited to 2%.
The diversity and complexity of modern computing platforms makes the development of scalable and portable software challenging. Tuning the amount of work per task can balance parallelism and scheduling overheads, but cannot easily tune for memory bandwidth nor avoid inter-task interference. We propose a complementary approach, tuning the amount of resources allocated to tasks, and combine it with software-defined task topologies to provide portable locality. These ideas are combined into a low-overhead resource management scheme called Elastic Places. Experimental results show that Elastic Places provides both high scalability and performance portability, with speed-ups up to 2.3x compared to state-of-the-art runtimes.