Alberto Ros
YOU?
Author Swipe
View article: A Complexity-Effective Local Delta Prefetcher
A Complexity-Effective Local Delta Prefetcher Open
Data prefetching is crucial for performance in modern processors by effectively masking long-latency memory accesses. Over the past decades, numerous data prefetching mechanisms have been proposed, which have continuously reduced the acces…
View article: Flexible Swapping for the Cloud
Flexible Swapping for the Cloud Open
Memory has become the primary cost driver in cloud data centers. Yet, a significant portion of memory allocated to VMs in public clouds remains unused. To optimize this resource, "cold" memory can be reclaimed from VMs and stored on slower…
View article: Alternate Path μ-op Cache Prefetching
Alternate Path μ-op Cache Prefetching Open
International audience
View article: Bounding Speculative Execution of Atomic Regions to a Single Retry
Bounding Speculative Execution of Atomic Regions to a Single Retry Open
Mutual exclusion has long served as a fundamental construct in parallel programs. Despite a long history of optimizing the lower-level lock and unlock operations used to enforce mutual exclusion, such operations largely dictate performance…
View article: Improved Converted Traces from Rebasing Microarchitectural Research with Industry Traces
Improved Converted Traces from Rebasing Microarchitectural Research with Industry Traces Open
Improved converted traces of the paper "Rebasing Microarchitectural Research with Industry Traces", published at the 2023 IEEE International Symposium on Workload Characterization. It includes the CVP-1 traces used in the paper converted w…
View article: Improved Converted Traces from Rebasing Microarchitectural Research with Industry Traces
Improved Converted Traces from Rebasing Microarchitectural Research with Industry Traces Open
Improved converted traces of the paper "Rebasing Microarchitectural Research with Industry Traces", published at the 2023 IEEE International Symposium on Workload Characterization. It includes the CVP-1 traces used in the paper converted w…
View article: On the interactions between ILP and TLP with hardware transactional memory
On the interactions between ILP and TLP with hardware transactional memory Open
Hardware implementations of Transactional Memory (HTM) are designed to facilitate efficient thread synchronization in parallel programs, encouraging the use of larger critical sections. By employing optimistic concurrency control to execut…
View article: Rebasing Microarchitectural Research with Industry Traces
Rebasing Microarchitectural Research with Industry Traces Open
International audience
View article: Data Artifact: Rebasing Microarchitectural Research with Industry Traces
Data Artifact: Rebasing Microarchitectural Research with Industry Traces Open
Data Artifact of the paper "Rebasing Microarchitectural Research with Industry Traces", published at the 2023 IEEE International Symposium on Workload Characterization. It includes the original CVP-1 traces used in the paper. Note: the imp…
View article: Data Artifact: Rebasing Microarchitectural Research with Industry Traces
Data Artifact: Rebasing Microarchitectural Research with Industry Traces Open
Data Artifact of the paper "Rebasing Microarchitectural Research with Industry Traces", published at the 2023 IEEE International Symposium on Workload Characterization. It includes the original CVP-1 traces used in the paper. Note: the imp…
View article: Towards faster, greener and easier to program computers
Towards faster, greener and easier to program computers Open
Towards faster, greener and easier to program computers The ERC Consolidator Grant project ECHO (Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures) aims to change the events that occur in multiprocessors such…
View article: Speculative inter-thread store-to-load forwarding in SMT architectures
Speculative inter-thread store-to-load forwarding in SMT architectures Open
Applications running on out-of-order cores have benefited for decades of store-to-load forwarding which accelerates communication of store values to loads of the same thread. Despite threads running on a simultaneous multithreading (SMT) c…
View article: Exploring Instruction Fusion Opportunities in General Purpose Processors
Exploring Instruction Fusion Opportunities in General Purpose Processors Open
International audience
View article: Berti: an Accurate Local-Delta Data Prefetcher
Berti: an Accurate Local-Delta Data Prefetcher Open
Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference …
View article: Free atomics
Free atomics Open
International audience
View article: Do Not Predict – Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation
Do Not Predict – Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation Open
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply…
View article: Compiler-Assisted Compaction/Restoration of SIMD Instructions
Compiler-Assisted Compaction/Restoration of SIMD Instructions Open
All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. Control flow div…
View article: On Value Recomputation to Accelerate Invisible Speculation
On Value Recomputation to Accelerate Invisible Speculation Open
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply…
View article: Boosting Store Buffer Efficiency with Store-Prefetch Bursts
Boosting Store Buffer Efficiency with Store-Prefetch Bursts Open
Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the store buffer is full, store latency is exposed to the processor causing pipeline stalls. The default strategies to mitigate these stalls are…
View article: Speculative Enforcement of Store Atomicity
Speculative Enforcement of Store Atomicity Open
Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see its own stores while they are in limbo, i.e., executed (and perhaps retired) but not yet inserted in memory order. This is known as store-to-…
View article: Regional Out-of-Order Writes in Total Store Order
Regional Out-of-Order Writes in Total Store Order Open
The store buffer, an essential component in today's processors, is designed to hide memory latency by moving stores off the processor's critical path. Furthermore, under the Total Store Order (TSO) memory model, the store buffer ensures th…
View article: The Entangling Instruction Prefetcher
The Entangling Instruction Prefetcher Open
Prefetching instructions is a fundamental technique for designing high-performance computers.There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy.Timeliness is …
View article: Efficient invisible speculative execution through selective delay and value prediction
Efficient invisible speculative execution through selective delay and value prediction Open
Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks. All these attacks are centered around a common set of behaviors: During specu…
View article: Filter caching for free
Filter caching for free Open
Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly s…
View article: Way Combination for an Adaptive and Scalable Coherence Directory
Way Combination for an Adaptive and Scalable Coherence Directory Open
© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, cre…
View article: Ghost loads
Ghost loads Open
Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectura…
View article: The Superfluous Load Queue
The Superfluous Load Queue Open
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are responsible for ensuring: i) correct forwarding of stores to loads and ii) correct ordering among loads (with respect to external stores). Th…
View article: Non-Speculative Store Coalescing in Total Store Order
Non-Speculative Store Coalescing in Total Store Order Open
We present a non-speculative solution for a coalescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory sy…
View article: Mending Fences with Self-Invalidation and Self-Downgrade
Mending Fences with Self-Invalidation and Self-Downgrade Open
Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory…
View article: Mending Fences with Self-Invalidation and Self-Downgrade
Mending Fences with Self-Invalidation and Self-Downgrade Open
Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory…