Exploring foci of
2025-06-02
Out-of-Vocabulary Sampling Boosts Speculative Decoding
2025-06-02 • Nadav Timor, Jonathan Mamou, Oren Pereg, Hongyang Zhang, David Harel
Speculative decoding relies on fast and accurate drafters. Recent state-of-the-art language models employ larger and larger vocabularies, which significantly slows down drafters. One promising approach to boost the efficiency of speculative decoding is to use drafters with smaller vocabularies. However, existing sampling methods cannot draw out-of-vocabulary tokens, creating a tradeoff between drafters' vocabulary size and acceptance rates. This paper introduces Redistributing Drafter Kernels (RDK), the first out-…
Sampling (Music)
Stratified Sampling
Sampling (Statistics)
Cluster Sampling
Inverse Transform Sampling
Latin Hypercube Sampling
Convenience Sampling
Sampling Frame
Systematic Sampling
Exploring foci of
2025-01-31
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
2025-01-31 • Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Gaurav Jain, Roy Schwartz, Moshe Wasserblat, David Harel
Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint.…
My Love Story With Yamada-Kun At Lv999
Die Hard With A Vengeance
Wheatfield With Crows
A Dance With Dragons
A Room With A View
You Don't Mess With The Zohan
List Of People With Bipolar Disorder
How To Get Away With Murder (Season 3)
Tuesdays With Morrie
Exploring foci of
2024-05-07
Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models
2024-05-07 • Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz
Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reache…
Dynamic Web Page
Dynamic Random-Access Memory
Dynamic Programming
Dynamic Positioning
Dynamic Time Warping
Synchronous Dynamic Random-Access Memory
Dept. Of Speculation
Dynamic Html
Dynamic Systems Development Method
Exploring foci of
2024-05-23
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
2024-05-23 • Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast …
T-Distributed Stochastic Neighbor Embedding
Fiber Distributed Data Interface
Speculative Fiction
Statistical Inference
Comparison Of Distributed File Systems
Speculative Realism
Distributed Database
Distributed Antenna System
Bayesian Inference
Exploring foci of
2022-09-22
Efficient Few-Shot Learning Without Prompts
2022-09-22 • Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg
Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy. To address these shortcomings, we propose SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sen…
Quantum Machine Learning
Association Rule Learning
Decision Tree Learning
Efficient Frontier
Reinforcement Learning
Practice (Learning Method)
Ensemble Learning
Rote Learning
Temporal Difference Learning