In this paper, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations (i) AT-Opt that minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning that identifies/elides redundant place-change operations and does parallel execution of place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72x and 17.83x, on a four-node (32 cores/node) Intel and two-node (16 cores/node) AMD system, respectively.
The C++ Transactional Memory Technical Specification(TMTS) has not seen widespread adoption, in large part due to its complexity. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form. We show that the elimination of support for self-abort, coupled with the use of an ?executor? interface to the TM system, can handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art
Locality is a common concept used in many areas of program and system analysis and optimization. It has been defined and measured in many ways. Previous work has focused on individual types of locality, but this paper focuses on how they are related. It formalizes their relations by categorizing commonly used definitions in three groups: access locality, timescale locality and cache locality, and by showing whether and how we can convert between them. Often two different metrics, e.g. access frequency and miss ratio, are related but not equivalent. The formalization shows precisely how they are related and where they differ.
Common memory prefetchers are designed to target specific memory access-patterns, including spatio-temporal locality, recurring patterns, and irregular patterns. In this paper, we propose a conceptual neural network (NN) prefetcher that dynamically adapts to arbitrary access patterns to capture semantic locality. Leveraging recent advances in machine learning, the proposed NN prefetcher correlates program context with memory accesses using online-training, and enables tapping into previously undetected access patterns. We present an architectural implementation of our prefetcher and evaluate it over SPEC2006, Graph500, and other kernels, showing it delivers up to 30% speedup over SPEC2006 and up to 4.4x speedup on some kernels.
In this work, we build a timing model to capture the parallel characteristics of a RSA public-key cipher implemented on a GPU. We consider optimizations that include using Montgomery multiplication and sliding window exponentiation to implement cryptographic operations. Our timing model considers the challenges of parallel execution, complications. Based on our timing model, we launch successful timing attacks on RSA running on a GPU, extracting the private key of RSA. We also present an effective error detection and correction mechanism. Our results demonstrate that GPU acceleration of RSA is vulnerable to side-channel timing attack.
Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally-defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements.This work also demonstrates how architecture-aware compilers improve code portability and reducing programmer effort.
Polyhedral Compilation for Multi-dimensional Stream Processing
In this paper, we want to show some potential advantages of using a formalism inspired by Quantum Computing (QC) to evaluate CMRDs with preemptions and avoid the NP-hard problem underneath. The experimental results, with a classic (non quantum) numerical approach, on a selection of Malardalen benchmark programs display very good accuracy, while the complexity of the evaluation is a low order polynomial of the number of memory accesses. Whilst it is not yet a full quantum algorithm, we provide a first roadmap on how to reach such an objective in future wo
Many-cores execute a large number of diverse applications concurrently. Inter-application interference can lead to a security threat as timing channel attack in the on-chip network. Mapping of applications can effectively determine the interference among applications in on-chip network. In this work, we explore non-interference approaches through run-time mapping at software and application level. Through run-time mapping, we can maximize utilization of the system without leaking information. The proposed run-time mapping policy requires no router modification in contrast to the best known competing schemes, and the throughput degradation is, on average, 16\% lower than that of the state-of-the-art non-secure baselines.
We herein propose MH cache, a multi-retention STT-RAM based cache management scheme for last-level caches(LLC) to reduce their power consumption for mobile hardware rendering system. We analyzed the memory access patterns of processes and observed that how rendering methods affect process behaviors. We propose a cache management scheme that measures write-intensity of each process dynamically and exploits it to manage our proposed cache. Our experimental results show that our techniques signfiicantly reduce the LLC power consumption by 33% and 32.2% in single- and quad-core systems, respectively, compared to a full STT-RAM
DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. This paper first investigates how prior approaches perform with diverse hybrid memory configurations, and observes that no single DRAM cache organization always outperforms the other organizations across all the diverse scenarios. This paper proposes a reconfigurable DRAM cache design which can adapt to different HW configurations and application patterns. Using a sample-based mechanism, the proposed DRAM cache controller dynamically finds the best organization from three candidates and applies the best one by the reconfiguration.