Emerging non-volatile memories (NVMs) suffer from low write endurance, resulting in early cell failures (hard errors), which reduce memory lifetime. This paper proposes error-correcting strings (ECS), which adopt a base-offset approach to store pointers to the failed memory cells. Unlike fixed-length error-correcting pointers (ECP), ECS uses variable-length offsets to point to the failed cells, thereby realizing more pointers to tolerate more hard errors per memory block. Furthermore, this paper proposes eXtended-ECS (XECS), a page-level error correction architecture, which employs dynamic on-demand ECS allocation and opportunistic pattern-based data compression to improve NVM lifetime for negligible impact to system performance.
In this paper, we present Transactional Correctness tool for Abstract Data Types (TxC-ADT), the first tool that can check the correctness of transactional data structures. TxC-ADT elevates the standard definitions of transactional correctness to be in terms of an abstract data type, an essential aspect for checking correctness of transactions that synchronize only for high-level semantic conflicts. To accommodate an assortment of transactional correctness conditions, we present a technique for defining correctness as a happens-before relation. This technique enables an automated approach in which correctness is evaluated by generating and analyzing a transactional happens-before graph during model checking.
This paper addresses the automated protection of loops with complex control- and data-flow patterns at compilation time. The security property we consider is that a sensitive loop must always perform the expected number of iterations, otherwise an attack must be reported. We propose a generic and portable compile-time loop hardening scheme and also investigate how to preserve the security property along the compilation flow while enabling aggressive optimizations. On average, the compiler automatically hardens 95\% of the sensitive loops of typical security benchmarks. 97\% of simulated faults are detected. Performance and code size overhead remain quite affordable.
A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have large private cache while still benefit from shared cache. The shared cache stores the victims, i.e. data evicted from private caches. The performance depends on how victims of co-run programs interact in shared cache.
Energy efficient compilation of irregular programs with task-parallel loops is a challenging problem for multi-core systems. This problem becomes more interesting in the context of multi-socket-multi-core systems, where all the cores connected to a socket run at a single frequency. We propose a mixed compile-time + runtime scheme (X10Ergy) to obtain energy gains with minimal impact on the execution time, for languages like X10, HJ, and so on, that support task-parallel loops and mutual exclusion. We implemented X10Ergy for X10 and have obtained encouraging results for the IMSuite kernels.
Most systems allocate computational resources to each executing task without any actual knowledge of the applications Quality-of-Service (QoS) requirements. Such best-effort policies lead to over-provisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources so as to reduce energy consumption.We propose a lightweight progress-tracking methodology based on the outer loops that hold the applications kernel by building an online history that is used by our proposed novel predictors to estimate the total execution time.
Modern multi-core systems require efficient multiple resource allocation for optimal system EDP. Choosing between multiple optimizations at runtime is complex due to the non-additive effects. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the LLC, along with DVFS of the core and uncore. We show that the co-optimization results in much lower system EDP than any of the techniques applied individually. The results show an average of 20.77% and 23.63% system EDP improvement on a 4-core and 16-core system respectively with limited degradation of Throughput and Fairness.
Compression techniques at the last-level-cache and the DRAM play an important role in improving system performance by increasing their effective capacities. Applications exhibit data locality that spread across multiple consecutive data blocks. We observe that there is significant opportunity available for compressing multiple consecutive data blocks into one single block, both at the LLC and DRAM. We propose a mechanism (MBZip) to achieve the same. Further, we also explore silent writes at the DRAM and show that certain writes need not access the memory when blocks are zipped.
Multi-tenant virtualized infrastructures allow cloud providers to minimize costs through workload consolidation. One of the largest costs is power consumption, which is challenging to understand in heterogeneous environments. We propose a power modeling methodology that tackles this complexity using a divide-and-conquer approach. Experiments show how we outperform previous research work, achieving a relative error of 2% on average and under 4% in almost all cases. Models are portable across similar architectures, enabling predictions before migrating a tenant to a different hardware platform. Moreover, we show how a scheduler can use these models to evaluate tenant colocation to minimize overall consumption.
Work-queues are effective for mapping irregular-parallel workloads to GPGPUs. In this paper, we present a novel hardware work-queue design named DaQueue which incorporates three data aware features to improve the efficiency of work-queues. We evaluate our proposal on the irregular-parallel workloads and carry out a case study on a path tracing pipeline. Experimental results show that, for selected workloads, the DaQueue improves the performance by 1.53X on average and up to 1.91X. Compared with an idealized hardware worklist approach which is the state-of-the-art prior work, the DaQueue can achieve an average of 29.54% extra speedup with less hardware area cost.
In this paper we demonstrate that pattern-based parallel programming approach is flexible enough to parallelize 12 out of 13 PARSEC applications. Our analysis, conducted on three different multi-core architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e. openmp and ompss) and native implementations (i.e. pthreads). Regarding the programming effort, we also demonstrate a considerable improvement compared to pthreads and comparable results on other existing implementations.
Collecting hardware event counts is essential to understand program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only allowing to monitor a small fraction of hardware events simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy, by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.
We take a holistic approach to evaluate the effectiveness of compression in the memory hierarchy for several real applications with real data, and complete runs of representative benchmarks. We introduce a new methodology to evaluate compressibility at both main memory and caches on real machines. Using our toolset, we evaluate a collection of workloads from different domains, such as a web server of a university department for 24 hours. We analyze different compression properties for both real applications and benchmarks. Our results suggest that compression could be of general use both in main memory and caches, and across different domains.
We introduce the Coarse-Grain-Out-of-Order general-purpose processor designed to achieve close to In-Order processor energy while maintaining Out-of-Order performance. CG-OoO is an energy-performance proportional architecture. It speculates, fetches, schedules, commits code at block-level granularity. It eliminates unnecessary accesses to energy consuming tables, and turns large tables into smaller, distributed, cheaper to access tables. CG-OoO leverages dynamic block-level and instruction-level parallelism. CG-OoO introduces Skipahead, a limited out-of-order scheduling model. CG-OoO closes 58% of the energy gap between InO and OoO baselines at the performance of OoO. This makes CG-OoO 1.9× more efficient than the OoO on the energy-delay product inverse metric.
Reducing the precision of floating-point values can improve performance in computer graphics applications. However, reducing the precision levels in a controlled fashion needs support at the compiler and at the microarchitecture level. We propose an automated precision-selection method and a GPU register file organization that can store register values at arbitrary precisions densely. By allowing a small degradation in output quality, our method can remove up to 60% of the floating-point bits in the investigated kernels. Our register file exploits these lower-precision values by packing them into the same register, reducing the register pressure per thread by up to 47%.