Contemporary GPUs support multiprogramming by allowing multiple kernels to run concurrently on the same streaming multiprocessors. Recent studies have demonstrated that such concurrent kernel execution improves both resource utilization and computational throughput. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. In this paper, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach significantly improves the performance even compared with the exhaustively searched CTA combination.
This work performs a thorough characterization and analysis of the Nutch web search bechnmark which is based on the popular open source Lucene search library. The paper describes in detail the architecture, the functionality and micro-architectural behaviour of the search engine, and, investigates prominent web search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for web search and also examine the potential causes of performance degradation and variability as well as causes of tail latencies.
Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This paper explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some micro-architectural improvements for data supply units to efficiently handle long-latency indirect loads.
This paper is an extension of a paper published in PACT-2017 In this work, we propose a novel recompute-based failure safety approach, and demonstrate its applicability to loop-based code.Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively,our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery.
In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. We first outline the POSE model using the established Energy-Delay Product (EDP) family of metrics. We then provide formulations of our model using the novel Energy-Delay Sum and Energy-Delay Distance metrics, as we believe these metrics are more appropriate for energy-aware optimisation efforts. Finally, we introduce an extension to POSE, named System Summary POSE. System Summary POSE allows us to reason about system-wide scope for energy-aware optimisation independently of any particular application.
We introduce Caliper, a technique for accurately estimating performance interference occurred in shared servers, overcoming the limitations of prior approaches by leveraging a micro-experiment based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co- location. This allows Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a wide spectrum of workload scenarios, demonstrating that it seamlessly supports up to 16 applications running simultaneously and outperforms state-of-the-art approaches.
In this paper, we propose Elastic-Cache to support both fine- and coarse-grained cache-line management to improve the L1 cache efficiency of GPUs. Specifically, it stores 32-byte words in non-contiguous memory space to a single 128-byte cache-line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage. To improve bandwidth utilization of L1 cache, we further propose Elastic-Plus to issue 32-byte requests in parallel, which can reduce the processing latency of instructions and improve the throughput of GPUs. Our experiment shows that Elastic-Cache and Elastic-Plus improve the performance by 104% and 131%, respectively.