The C++ Transactional Memory Technical Specification(TMTS) has not seen widespread adoption, in large part due to its complexity. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form. We show that the elimination of support for self-abort, coupled with the use of an ?executor? interface to the TM system, can handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art
Contemporary GPUs support multiprogramming by allowing multiple kernels to run concurrently on the same streaming multiprocessors. Recent studies have demonstrated that such concurrent kernel execution improves both resource utilization and computational throughput. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. In this paper, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach significantly improves the performance even compared with the exhaustively searched CTA combination.
This work performs a thorough characterization and analysis of the Nutch web search bechnmark which is based on the popular open source Lucene search library. The paper describes in detail the architecture, the functionality and micro-architectural behaviour of the search engine, and, investigates prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for web search and also examine the potential causes of performance degradation and variability as well as causes of tail latencies.
This paper is an extension of a paper published in PACT-2017 In this work, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code.Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively,our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery.
Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally-defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements.This work also demonstrates how architecture-aware compilers improve code portability and reducing programmer effort.
Polyhedral Compilation for Multi-dimensional Stream Processing
In this paper, we want to show some potential advantages of using a formalism inspired by Quantum Computing (QC) to evaluate CMRDs with preemptions and avoid the NP-hard problem underneath. The experimental results, with a classic (non quantum) numerical approach, on a selection of Malardalen benchmark programs display very good accuracy, while the complexity of the evaluation is a low order polynomial of the number of memory accesses. Whilst it is not yet a full quantum algorithm, we provide a first roadmap on how to reach such an objective in future wo
In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.
We herein propose MH cache, a multi-retention STT-RAM based cache management scheme for last-level caches(LLC) to reduce their power consumption for mobile hardware rendering system. We analyzed the memory access patterns of processes and observed that how rendering methods affect process behaviors. We propose a cache management scheme that measures write-intensity of each process dynamically and exploits it to manage our proposed cache. Our experimental results show that our techniques signfiicantly reduce the LLC power consumption by 33% and 32.2% in single- and quad-core systems, respectively, compared to a full STT-RAM
We introduce Caliper, a technique for accurately estimating performance interference occurred in shared servers, overcoming the limitations of prior approaches by leveraging a micro-experiment based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co- location. This allows Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a wide spectrum of workload scenarios, demonstrating that it seamlessly supports up to 16 applications running simultaneously and outperforms state-of-the-art approaches.
We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect of real-time requirements. It scales to thousands of data-flow nodes, and has been validated on two production avionics applications. Unlike classical optimizing compilation, it takes as input non-functional requirements (real-time, resource constraints). To ensure respect of requirements, the compiler follows a static resource allocation strategy, from coarse grain tasks communicating over an interconnection network, all the way to individual variables and memory accesses. It keeps track of timing interferences resulting from mapping decisions in a precise, safe, and scalable way.
In this paper, we propose Elastic-Cache to support both fine- and coarse-grained cache-line management to improve the L1 cache efficiency of GPUs. Specifically, it stores 32-byte words in non-contiguous memory space to a single 128-byte cache-line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage. To improve bandwidth utilization of L1 cache, we further propose Elastic-Plus to issue 32-byte requests in parallel, which can reduce the processing latency of instructions and improve the throughput of GPUs. Our experiment shows that Elastic-Cache and Elastic-Plus improve the performance by 104% and 131%, respectively.