David Black-Schaffer
YOU?
Author Swipe
View article: Second-level Caches: Not for Instructions
Second-level Caches: Not for Instructions Open
Growing instruction footprints are straining processor front-ends, increasing fetch latency, and causing pipeline stalls. The universal approach to addressing this has been keeping instructions in each level of the cache hierarchy, but a p…
View article: Mark–Scavenge: Waiting for Trash to Take Itself Out
Mark–Scavenge: Waiting for Trash to Take Itself Out Open
Moving garbage collectors (GCs) typically free memory by evacuating live objects in order to reclaim contiguous memory regions. Evacuation is typically done either during tracing (scavenging), or after tracing when identification of live o…
View article: Mutator-Driven Object Placement using Load Barriers
Mutator-Driven Object Placement using Load Barriers Open
Object placement impacts cache utilisation, which is itself critical for performance. Managed languages offer fewer tools than unmanaged languages in the way of controlling object placement due to the abstract view of memory. On the other …
View article: Protean: Resource-efficient Instruction Prefetching
Protean: Resource-efficient Instruction Prefetching Open
Increases in code footprint and control flow complexity have made low-latency instruction fetch challenging. Dedicated Instruction Prefetchers (DIPs) can provide performance gains (up to 5%) for a subset of applications that are poorly ser…
View article: Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of Swapping
Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of Swapping Open
Graph workloads are critical in many areas. Unfortunately, graph sizes have been increasing faster than DRAM capacity. As a result, large-scale graph processing necessarily falls back to virtual memory paging, resulting in tremendous perfo…
View article: Exploring the Latency Sensitivity of Cache Replacement Policies
Exploring the Latency Sensitivity of Cache Replacement Policies Open
With DRAM latencies increasing relative to CPU speeds, the performance of caches has become more important. This has led to increasingly sophisticated replacement policies that require complex calculations to update their replacement metad…
View article: Faster Functional Warming with Cache Merging
Faster Functional Warming with Cache Merging Open
Smarts-like sampled hardware simulation techniques achieve good accuracy by simulating many small portions of an application in detail. However, while this reduces the simulation time, it results in extensive cache warming times, as each o…
View article: Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores
Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores Open
Exploiting memory-level parallelism (MLP) is crucial to hide long memory and last-level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy effi…
View article: Every walk’s a hit: making page walks single-access cache hits
Every walk’s a hit: making page walks single-access cache hits Open
As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the…
View article: Freeway to Memory Level Parallelism in Slice-Out-of-Order Cores
Freeway to Memory Level Parallelism in Slice-Out-of-Order Cores Open
Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy effi…
View article: Early Address Prediction
Early Address Prediction Open
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via r…
View article: A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006 Open
The SPEC CPU Benchmarks are used extensively for evaluating and comparing improvements to computer systems. This ubiquity makes characterization critical for researchers to understand the bottlenecks the benchmarks do and do not expose and…
View article: Raw-Data: A Reusable Characterization Of The Memory System Behavior Of SPEC 2017 And SPEC 2006
Raw-Data: A Reusable Characterization Of The Memory System Behavior Of SPEC 2017 And SPEC 2006 Open
This dataset accompanies the ISPASS 2020 extended abstract: Architecturally-independent and time-based characterization of SPEC CPU 2017 and TACO paper: A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006. In…
View article: Page Tables: Keeping them Flat and Hot (Cached)
Page Tables: Keeping them Flat and Hot (Cached) Open
As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the…
View article: Architecturally-Independent and Time-Based Characterization of SPEC CPU 2017
Architecturally-Independent and Time-Based Characterization of SPEC CPU 2017 Open
Characterizing the memory behaviour of SPEC CPU benchmarks is critical to analyze bottlenecks in the execution. Unfortunately, most prior characterizations are tied to a particular system (e.g., via performance counters, fixed configuratio…
View article: Modeling and optimizing NUMA effects and prefetching with machine learning
Modeling and optimizing NUMA effects and prefetching with machine learning Open
Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtim…
View article: Perforated Page: Supporting Fragmented Memory Allocation for Large Pages
Perforated Page: Supporting Fragmented Memory Allocation for Large Pages Open
The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-m…
View article: Efficient thread/page/parallelism autotuning for NUMA systems
Efficient thread/page/parallelism autotuning for NUMA systems Open
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mapp…
View article: Filter caching for free
Filter caching for free Open
Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly s…
View article: FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors Open
The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or…
View article: Freeway: Maximizing MLP for Slice-Out-of-Order Execution
Freeway: Maximizing MLP for Slice-Out-of-Order Execution Open
Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy effi…
View article: Minimizing Replay under Way-Prediction
Minimizing Replay under Way-Prediction Open
Way-predictors are effective at reducing dynamic cache energy by reducing the number of ways accessed, but introduce additional latency for incorrect way-predictions. While previous work has studied the impact of the increased latency for …
View article: Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit
Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit Open
Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instructio…
View article: Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-Based GPUs
Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-Based GPUs Open
Graphics rendering is a complex multi-step process whose data demands typically dominate memory system design in SoCs. GPUs create images by merging many simpler scenes for each frame. For performance, scenes are tiled into parallel tasks …
View article: Understanding the interplay between task scheduling, memory and performance
Understanding the interplay between task scheduling, memory and performance Open
New programming models have been introduced to aid the programmer dealing with the complexity of large-scale systems, simplifying the coding process and making applications more scalable. Task-based programming is one example that became p…
View article: Exploring Scheduling Effects on Task Performance with TaskInsight
Exploring Scheduling Effects on Task Performance with TaskInsight Open
The complex memory hierarchies of nowadays machines make it very difficult to estimate the execution time of the tasks as depending on where the data is placed in memory, tasks of the same type may end up having different performance. Mult…
View article: TaskInsight
TaskInsight Open
Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provi…
View article: Adaptive Cache Warming for Faster Simulations
Adaptive Cache Warming for Faster Simulations Open
The use of hardware-based virtualization allows modern simulators to very quickly fast-forward between sample points and regions of interest. This dramatically reduces the simulation time compared to traditional functional forwarding. Howe…
View article: Characterizing Task Scheduling Performance Based on Data Reuse
Characterizing Task Scheduling Performance Based on Data Reuse Open
Through the past years, several scheduling heuristics were introduced to improve the performance of task-based ap-plications, with schedulers increasingly becoming aware of memory-related bottlenec ...
View article: Spatial and Temporal Cache Sharing Analysis in Tasks
Spatial and Temporal Cache Sharing Analysis in Tasks Open
Proceedings of the First PhD Symposium on Sustainable Ultrascale\n\t\t\t\t Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.