The C++ Transactional Memory Technical Specification(TMTS) has not seen widespread adoption, in large part due to its complexity. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form. We show that the elimination of support for self-abort, coupled with the use of an ?executor? interface to the TM system, can handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art
Contemporary GPUs support multiprogramming by allowing multiple kernels to run concurrently on the same streaming multiprocessors. Recent studies have demonstrated that such concurrent kernel execution improves both resource utilization and computational throughput. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. In this paper, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach significantly improves the performance even compared with the exhaustively searched CTA combination.
Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally-defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements.This work also demonstrates how architecture-aware compilers improve code portability and reducing programmer effort.
Polyhedral Compilation for Multi-dimensional Stream Processing
In this paper, we want to show some potential advantages of using a formalism inspired by Quantum Computing (QC) to evaluate CMRDs with preemptions and avoid the NP-hard problem underneath. The experimental results, with a classic (non quantum) numerical approach, on a selection of Malardalen benchmark programs display very good accuracy, while the complexity of the evaluation is a low order polynomial of the number of memory accesses. Whilst it is not yet a full quantum algorithm, we provide a first roadmap on how to reach such an objective in future wo
In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.
Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel attack in the on-chip network. Mapping of applications can effectively determine the interference among applications in on-chip network. In this work, we explore non-interference approaches through run-time mapping at software and application level. Through run-time mapping, we can maximize utilization of the system without leaking information. The proposed run-time mapping policy requires no router modification in contrast to the best known competing schemes, and the throughput degradation is, on average, 16\% lower than that of the state-of-the-art non-secure baselines.
We herein propose MH cache, a multi-retention STT-RAM based cache management scheme for last-level caches(LLC) to reduce their power consumption for mobile hardware rendering system. We analyzed the memory access patterns of processes and observed that how rendering methods affect process behaviors. We propose a cache management scheme that measures write-intensity of each process dynamically and exploits it to manage our proposed cache. Our experimental results show that our techniques signfiicantly reduce the LLC power consumption by 33% and 32.2% in single- and quad-core systems, respectively, compared to a full STT-RAM
We introduce Caliper, a technique for accurately estimating performance interference occurred in shared servers, overcoming the limitations of prior approaches by leveraging a micro-experiment based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co- location. This allows Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a wide spectrum of workload scenarios, demonstrating that it seamlessly supports up to 16 applications running simultaneously and outperforms state-of-the-art approaches.
DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. This paper first investigates how prior approaches perform with diverse hybrid memory configurations, and observes that no single DRAM cache organization always outperforms the other organizations across all the diverse scenarios. This paper proposes a reconfigurable DRAM cache design which can adapt to different HW configurations and application patterns. Using a sample-based mechanism, the proposed DRAM cache controller dynamically finds the best organization from three candidates and applies the best one by the reconfiguration.
We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect of real-time requirements. It scales to thousands of data-flow nodes, and has been validated on two production avionics applications. Unlike classical optimizing compilation, it takes as input non-functional requirements (real-time, resource constraints). To ensure respect of requirements, the compiler follows a static resource allocation strategy, from coarse grain tasks communicating over an interconnection network, all the way to individual variables and memory accesses. It keeps track of timing interferences resulting from mapping decisions in a precise, safe, and scalable way.
In this paper, we propose Elastic-Cache to support both fine- and coarse-grained cache-line management to improve the L1 cache efficiency of GPUs. Specifically, it stores 32-byte words in non-contiguous memory space to a single 128-byte cache-line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage. To improve bandwidth utilization of L1 cache, we further propose Elastic-Plus to issue 32-byte requests in parallel, which can reduce the processing latency of instructions and improve the throughput of GPUs. Our experiment shows that Elastic-Cache and Elastic-Plus improve the performance by 104% and 131%, respectively.