Emerging non-volatile memories (NVMs) suffer from low write endurance, resulting in early cell failures (hard errors), which reduce memory lifetime. This paper proposes error-correcting strings (ECS), which adopt a base-offset approach to store pointers to the failed memory cells. Unlike fixed-length error-correcting pointers (ECP), ECS uses variable-length offsets to point to the failed cells, thereby realizing more pointers to tolerate more hard errors per memory block. Furthermore, this paper proposes eXtended-ECS (XECS), a page-level error correction architecture, which employs dynamic on-demand ECS allocation and opportunistic pattern-based data compression to improve NVM lifetime for negligible impact to system performance.
Three-dimensional stacking technology and the memory-wall problem have popularized processing-in-memory (PIM), which offers the benefits of bandwidth and energy savings by offloading computations to the memory. Although industry prototypes such as Hybrid Memory Cube (HMC) have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. In this paper, we analyze the advantages of instruction-level PIM offloading in the context of HMC for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers.
With the advent of 3D memory stacking technology that integrates a logic layer and the stacked memory, it is expected to mitigate memory wall problem by leveraging the concept of near-memory processing (NMP). Previous studies have been focused on acceleration and offloading for the specific kernel operations, but the limited functionality of NMP logic architecture can degrade the performance and energy efficiency by increasing the communication overhead with the host processor. In this study, we propose a Triple Engine Processor (TEP) which can efficiently process various kernel operations. The proposed TEP provides about 3.4 times speedup than the baseline system.
This paper addresses the automated protection of loops with complex control- and data-flow patterns at compilation time. The security property we consider is that a sensitive loop must always perform the expected number of iterations, otherwise an attack must be reported. We propose a generic and portable compile-time loop hardening scheme and also investigate how to preserve the security property along the compilation flow while enabling aggressive optimizations. On average, the compiler automatically hardens 95\% of the sensitive loops of typical security benchmarks. 97\% of simulated faults are detected. Performance and code size overhead remain quite affordable.
In this work we provide an efficient, portable and robust programming framework, based on the Data-Driven Multithreading (DDM) model of execution, that enables data-driven concurrency on HPC systems. The proposed framework has been evaluated using a suite of eight benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. Our evaluation analysis shows that the proposed system scales well and it achieves comparable or better performance when is compared with other systems, such as MPI/ScaLAPACK, SWARM and DDM-VM.
Integrated Heterogeneous System(IHS) processors pack throughput oriented GPGPUs alongside latency oriented CPUs on the same die sharing certain resources. We propose adding a large capacity stacked DRAM, used as a shared last level cache, for the IHS processors. However, adding the DRAMCache naively, leaves significant performance on the table due to the disparate demands from CPU and GPU cores. We propose three techniques to enhance the performance of IHS processors (i)PrIS, Heterogeneity Aware DRAMCache scheduler, (ii)ByE, Heterogeneity aware temporal bypassing mechanism (iii)Chaining, Heterogeneity aware occupancy controlling mechanism. HAShCache results in an average system performance improvement of 41% over naive DRAMCache.
Most systems allocate computational resources to each executing task without any actual knowledge of the applications Quality-of-Service (QoS) requirements. Such best-effort policies lead to over-provisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources so as to reduce energy consumption.We propose a lightweight progress-tracking methodology based on the outer loops that hold the applications kernel by building an online history that is used by our proposed novel predictors to estimate the total execution time.
Shared memory machines increase in scale by adding more parallelism through additional cores, memory and bandwidth. Often, executing multiple applications concurrently provides greater efficiency rather than executing an individual application with large thread counts. However, contention for shared resources can limit improvements from concurrency. In this paper we contribute SCALO, a solution to orchestrate concurrent application execution to increase throughput. SCALO monitors co-executing applications at runtime, evaluates their scalability and adapts the parallelism of each program. Unlike previous approaches, SCALO reflects dynamic contention effects and controls parallelism during the execution of parallel regions. Thus, it significantly outperforms previous, state-of-the-art approaches
Compression techniques at the last-level-cache and the DRAM play an important role in improving system performance by increasing their effective capacities. Applications exhibit data locality that spread across multiple consecutive data blocks. We observe that there is significant opportunity available for compressing multiple consecutive data blocks into one single block, both at the LLC and DRAM. We propose a mechanism (MBZip) to achieve the same. Further, we also explore silent writes at the DRAM and show that certain writes need not access the memory when blocks are zipped.
Trace-driven simulation of chip multi-processors (CMP) offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology for fast design space exploration of CMPs. Our trace-based approach has a peak speedup of up to 15.7X over simulation in gem5 full-system with an average of about 8X speedup and efficiently scales up to 64 threads.
The recent evolution in hardware landscape, aimed at producing high-performance computing systems capable of reaching extreme-scale performance, has reignited the interest on fine-grain multithreading, particularly at the intranode level. Indeed, popular programming models, such as OpenMP, which features a simple interface for the parallelization of programs, are now incorporating fine-grain constructs. However, since coarse-grain directives are still heavily used, sometimes in conjunction with fine-grain constructs, OpenMP runtime is forced to support, at the same time, both models of execution, potentially reducing the advantages obtained when executing an application in a fully fine-grain environment.
Several architectural trends put into question the overall design of cache coherence in multicore chips. First, as the number of cores scale up, the static power and area costs of directories also increase. Second, limited power budgets restrict either the operating frequency and voltage of the system or require disabling some of the chip to stay within a fixed power budget. We argue for a different approach which leverages underprovisioned, re-configurable directories. Our directories are only used in the presence of sharing, and otherwise can be disabled to save power or re-used as cache to speed-up cache sensitive workloads.
Multicore systems have become more and more powerful, and thereby very useful in high-performance computing. However, many applications still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests handling sparse matrices in their packed formats. In this paper, we propose to use 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks.
Work-queues are effective for mapping irregular-parallel workloads to GPGPUs. In this paper, we present a novel hardware work-queue design named DaQueue which incorporates three data aware features to improve the efficiency of work-queues. We evaluate our proposal on the irregular-parallel workloads and carry out a case study on a path tracing pipeline. Experimental results show that, for selected workloads, the DaQueue improves the performance by 1.53X on average and up to 1.91X. Compared with an idealized hardware worklist approach which is the state-of-the-art prior work, the DaQueue can achieve an average of 29.54% extra speedup with less hardware area cost.
Hardware accelerators generated by polyhedral synthesis techniques make an extensive use of affine expressions (affine functions and convex polyhedra) in control and steering logic. Since the control is pipelined, these affine objects must to be evaluated at the same time for different values, which forbids aggressive reuse of operators. In this paper, we propose a method to factorize a collection of affine expressions without preventing pipelining. Our key contributions are (i) to use semantic factorizations exploiting arithmetic properties of addition and multiplication and (ii) to rely on a cost function whose minimization ensures a correct usage of FPGA resources.
CGRAs excel at exploiting loop-level parallelism at a high performance-per-watt ratio, yet 25-45 percent of the consumed energy are spent on the instruction memory and fetches therefrom. This article presents a hardware/software co-design methodology that is able to reduce the energy consumed by the instruction decode logic by 60%. The hardware modifications improve the spatial organization of code by re-organizing the configuration memory into separate partitions based on a statistical analysis. A compiler technique optimizes code in the temporal dimension by minimizing the number of signal changes. These optimizations enable a code size reduction of 55% for different application domains.
Collecting hardware event counts is essential to understand program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only allowing to monitor a small fraction of hardware events simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy, by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.
We take a holistic approach to evaluate the effectiveness of compression in the memory hierarchy for several real applications with real data, and complete runs of representative benchmarks. We introduce a new methodology to evaluate compressibility at both main memory and caches on real machines. Using our toolset, we evaluate a collection of workloads from different domains, such as a web server of a university department for 24 hours. We analyze different compression properties for both real applications and benchmarks. Our results suggest that compression could be of general use both in main memory and caches, and across different domains.
We introduce the Coarse-Grain-Out-of-Order general-purpose processor designed to achieve close to In-Order processor energy while maintaining Out-of-Order performance. CG-OoO is an energy-performance proportional architecture. It speculates, fetches, schedules, commits code at block-level granularity. It eliminates unnecessary accesses to energy consuming tables, and turns large tables into smaller, distributed, cheaper to access tables. CG-OoO leverages dynamic block-level and instruction-level parallelism. CG-OoO introduces Skipahead, a limited out-of-order scheduling model. CG-OoO closes 58% of the energy gap between InO and OoO baselines at the performance of OoO. This makes CG-OoO 1.9× more efficient than the OoO on the energy-delay product inverse metric.
Reducing the precision of floating-point values can improve performance in computer graphics applications. However, reducing the precision levels in a controlled fashion needs support at the compiler and at the microarchitecture level. We propose an automated precision-selection method and a GPU register file organization that can store register values at arbitrary precisions densely. By allowing a small degradation in output quality, our method can remove up to 60% of the floating-point bits in the investigated kernels. Our register file exploits these lower-precision values by packing them into the same register, reducing the register pressure per thread by up to 47%.