Three-dimensional stacking technology and the memory-wall problem have popularized processing-in-memory (PIM), which offers the benefits of bandwidth and energy savings by offloading computations to the memory. Although industry prototypes such as Hybrid Memory Cube (HMC) have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. In this paper, we analyze the advantages of instruction-level PIM offloading in the context of HMC for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers.
We present SymGraph, a judicious graph engine with symbolic iteration that enables the parallelism of dependent computations for the embarrassingly-parallel graph computation by allowing using abstract symbolic value (instead of taking on the concrete value) if the desired data is unavailable. In an effort to maximize the potential of symbolic iteration, we also propose a chain of tailored sophisticated techniques, enabling SymGraph to scale out with a new milestone of efficiency for large-scale graph processing. Experimental results show that SymGraph outperforms traditional engines.
With the advent of 3D memory stacking technology that integrates a logic layer and the stacked memory, it is expected to mitigate memory wall problem by leveraging the concept of near-memory processing (NMP). Previous studies have been focused on acceleration and offloading for the specific kernel operations, but the limited functionality of NMP logic architecture can degrade the performance and energy efficiency by increasing the communication overhead with the host processor. In this study, we propose a Triple Engine Processor (TEP) which can efficiently process various kernel operations. The proposed TEP provides about 3.4 times speedup than the baseline system.
In this paper, we make a case for a more effective boosting strategy, which invests energy in activities with the best estimated return. In addition to running faster clocks, we can also use a look-ahead thread to overlap the penalties of cache misses and branch mispredicts. Overall, for similar power consumptions, the proposed adaptive turbo boosting strategy can achieve about twice the performance benefits while halving the energy overhead.
In this work we provide an efficient, portable and robust programming framework, based on the Data-Driven Multithreading (DDM) model of execution, that enables data-driven concurrency on HPC systems. The proposed framework has been evaluated using a suite of eight benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. Our evaluation analysis shows that the proposed system scales well and it achieves comparable or better performance when is compared with other systems, such as MPI/ScaLAPACK, SWARM and DDM-VM.
Integrated Heterogeneous System(IHS) processors pack throughput oriented GPGPUs alongside latency oriented CPUs on the same die sharing certain resources. We propose adding a large capacity stacked DRAM, used as a shared last level cache, for the IHS processors. However, adding the DRAMCache naively, leaves significant performance on the table due to the disparate demands from CPU and GPU cores. We propose three techniques to enhance the performance of IHS processors (i)PrIS, Heterogeneity Aware DRAMCache scheduler, (ii)ByE, Heterogeneity aware temporal bypassing mechanism (iii)Chaining, Heterogeneity aware occupancy controlling mechanism. HAShCache results in an average system performance improvement of 41% over naive DRAMCache.
This work introduces a lightweight persistent object framework, dubbed Scalable In-Memory Persistent Object (SIMPO), to support data persistence for high-concurrency big-data applications through optimized exploitation of NVRAM. Using optimized redo logging, we propose a deferrable programming and execution model to support efficient data persistence with zero data loss. Our model is well-suited to in-memory big data computing workloads with improved data locality and concurrency. SIMPO features a write-combining checkpointing scheme to save overheads of flushing checkpoints to NVRAM. Experimental results with various benchmarks show that SIMPO incurs less than 5% runtime overhead, and achieves 2.35x more speedup in highly-threaded situations.
Shared memory machines increase in scale by adding more parallelism through additional cores, memory and bandwidth. Often, executing multiple applications concurrently provides greater efficiency rather than executing an individual application with large thread counts. However, contention for shared resources can limit improvements from concurrency. In this paper we contribute SCALO, a solution to orchestrate concurrent application execution to increase throughput. SCALO monitors co-executing applications at runtime, evaluates their scalability and adapts the parallelism of each program. Unlike previous approaches, SCALO reflects dynamic contention effects and controls parallelism during the execution of parallel regions. Thus, it significantly outperforms previous, state-of-the-art approaches
Trace-driven simulation of chip multi-processors (CMP) offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology for fast design space exploration of CMPs. Our trace-based approach has a peak speedup of up to 15.7X over simulation in gem5 full-system with an average of about 8X speedup and efficiently scales up to 64 threads.
Several architectural trends put into question the overall design of cache coherence in multicore chips. First, as the number of cores scale up, the static power and area costs of directories also increase. Second, limited power budgets restrict either the operating frequency and voltage of the system or require disabling some of the chip to stay within a fixed power budget. We argue for a different approach which leverages underprovisioned, re-configurable directories. Our directories are only used in the presence of sharing, and otherwise can be disabled to save power or re-used as cache to speed-up cache sensitive workloads.
Multicore systems have become more and more powerful, and thereby very useful in high-performance computing. However, many applications still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests handling sparse matrices in their packed formats. In this paper, we propose to use 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks.
CGRAs excel at exploiting loop-level parallelism at a high performance-per-watt ratio, yet 25-45 percent of the consumed energy are spent on the instruction memory and fetches therefrom. This article presents a hardware/software co-design methodology that is able to reduce the energy consumed by the instruction decode logic by 60%. The hardware modifications improve the spatial organization of code by re-organizing the configuration memory into separate partitions based on a statistical analysis. A compiler technique optimizes code in the temporal dimension by minimizing the number of signal changes. These optimizations enable a code size reduction of 55% for different application domains.