Dave Dice
YOU?
Author Swipe
View article: Semaphores Augmented with a Waiting Array
Semaphores Augmented with a Waiting Array Open
Semaphores are a widely used and foundational synchronization and coordination construct used for shared memory multithreaded programming. They are a keystone concept, in the sense that most other synchronization constructs can be implemen…
View article: Reciprocating Locks
Reciprocating Locks Open
We present "Reciprocating Locks", a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the release operation both run in consta…
View article: Exploring Time-Space trade-offs for synchronized in Lilliput
Exploring Time-Space trade-offs for synchronized in Lilliput Open
In the context of Project Lilliput, which attempts to reduce the size of object header in the HotSpot Java Virtual Machine (JVM), we explore a curated set of synchronization algorithms. Each of the algorithms could serve as a potential rep…
View article: FedPerm: Private and Robust Federated Learning by Parameter Permutation
FedPerm: Private and Robust Federated Learning by Parameter Permutation Open
Federated Learning (FL) is a distributed learning paradigm that enables mutually untrusting clients to collaboratively train a common machine learning model. Client data privacy is paramount in FL. At the same time, the model must be prote…
View article: Intra-process Caching and Reuse of Threads
Intra-process Caching and Reuse of Threads Open
Creating and destroying threads on modern Linux systems incurs high latency, absent concurrency, and fails to scale as we increase concurrency. To address this concern we introduce a process-local cache of idle threads. Specifically, inste…
View article: Ready When You Are: Efficient Condition Variables via Delegated Condition Evaluation
Ready When You Are: Efficient Condition Variables via Delegated Condition Evaluation Open
Multi-thread applications commonly utilize condition variables for communication between threads. Condition variables allow threads to block and wait until a certain condition holds, and also enable threads to wake up their blocked peers n…
View article: Optimizing Inference Performance of Transformers on CPUs
Optimizing Inference Performance of Transformers on CPUs Open
The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous rese…
View article: Hemlock : Compact and Scalable Mutual Exclusion
Hemlock : Compact and Scalable Mutual Exclusion Open
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under conte…
View article: Compact Java Monitors
Compact Java Monitors Open
For scope and context, the idea we'll describe below, Compact Java Monitors, is intended as a potential replacement implementation for the "synchronized" construct in the HotSpot JVM. The readers is assumed to be familiar with current HotS…
View article: Scalable range locks for scalable address spaces and beyond
Scalable range locks for scalable address spaces and beyond Open
Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasin…
View article: Fissile Locks
Fissile Locks Open
Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any…
View article: Avoiding Scalability Collapse by Restricting Concurrency
Avoiding Scalability Collapse by Restricting Concurrency Open
Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall…
View article: Compact NUMA-Aware Locks
Compact NUMA-Aware Locks Open
Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficien…
View article: TWA -- Ticket Locks Augmented with a Waiting Array
TWA -- Ticket Locks Augmented with a Waiting Array Open
The classic ticket lock consists of ticket and grant fields. Arriving threads atomically fetch-and-increment ticket and then wait for grant to become equal to the value returned by the fetch-and-increment primitive, at which point the thre…
View article: Improving Parallelism in Hardware Transactional Memory
Improving Parallelism in Hardware Transactional Memory Open
Today’s hardware transactional memory (HTM) systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to poor performance when transactions frequently conflict, causing them to resort to …
View article: Persistent Memory Transactions
Persistent Memory Transactions Open
This paper presents a comprehensive analysis of performance trade offs between implementation choices for transaction runtime systems on persistent memory. We compare three implementations of transaction runtimes: undo logging, redo loggin…
View article: Malthusian Locks
Malthusian Locks Open
Applications running in modern multithreaded environments are sometimes \emph{over-threaded}. The excess threads do not improve performance, and in fact may act to degrade performance via \emph{scalability collapse}. Often, such software a…
View article: The Influence of Malloc Placement on TSX Hardware Transactional Memory
The Influence of Malloc Placement on TSX Hardware Transactional Memory Open
The hardware transactional memory (HTM) implementation in Intel's i7-4770 "Haswell" processor tracks the transactional read-set in the L1 (level-1), L2 (level-2) and L3 (level-3) caches and the write-set in the L1 cache. Displacement or ev…