Tal Remez
YOU?
Author Swipe
View article: Discrete Flow Matching
Discrete Flow Matching Open
Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this…
View article: The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation Open
It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models ope…
View article: Masked Audio Generation using a Single Non-Autoregressive Transformer
Masked Audio Generation using a Single Non-Autoregressive Transformer Open
We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we pr…
View article: Code Llama: Open Foundation Models for Code
Code Llama: Open Foundation Models for Code Open
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following abil…
View article: Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis Open
International audience
View article: EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis Open
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech …
View article: Simple and Controllable Music Generation
Simple and Controllable Music Generation Open
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised …
View article: Textually Pretrained Speech Language Models
Textually Pretrained Speech Language Models Open
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show …
View article: ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement Open
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subje…
View article: AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Open
We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify …
View article: More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech Open
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate h…
View article: Translatotron 2: Robust direct speech-to-speech translation.
Translatotron 2: Robust direct speech-to-speech translation. Open
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that con…
View article: Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation Open
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation\nsystem which is capable of learning to separate sounds and associate them with\non-screen objects by looking at in-the-wild videos. We identify limitations of\nprevio…
View article: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…
View article: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…
View article: Shape Correspondence with Isometric and Non-Isometric Deformations
Shape Correspondence with Isometric and Non-Isometric Deformations Open
The registration of surfaces with non-rigid deformation, especially non-isometric deformations, is a challenging problem. When applying such techniques to real scans, the problem is compounded by topological and geometric inconsistencies b…
View article: Deep Functional Maps: Structured Prediction for Dense Shape Correspondence
Deep Functional Maps: Structured Prediction for Dense Shape Correspondence Open
We introduce a new framework for learning dense correspondence between deformable 3D shapes. Existing learning based approaches model shape correspondence as a labelling problem, where each point of a query shape receives a label identifyi…
View article: Efficient Deformable Shape Correspondence via Kernel Matching
Efficient Deformable Shape Correspondence via Kernel Matching Open
We present a method to match three dimensional shapes under non-isometric deformations, topology changes and partiality. We formulate the problem as matching between a set of pair-wise and point-wise descriptors, imposing a continuity prio…
View article: Efficient Deformable Shape Correspondence via Kernel Matching
Efficient Deformable Shape Correspondence via Kernel Matching Open
We present a method to match three dimensional shapes under non-isometric deformations, topology changes and partiality. We formulate the problem as matching between a set of pair-wise and point-wise descriptors, imposing a continuity prio…
View article: Deep Class Aware Denoising
Deep Class Aware Denoising Open
The increasing demand for high image quality in mobile devices brings forth the need for better computational enhancement techniques, and image denoising in particular. At the same time, the images captured by these devices can be categori…
View article: Deep Convolutional Denoising of Low-Light Images
Deep Convolutional Denoising of Low-Light Images Open
Poisson distribution is used for modeling noise in photon-limited imaging. While canonical examples include relatively exotic types of sensing like spectral imaging or astronomy, the problem is relevant to regular photography now more than…
View article: Cloud Dictionary: Sparse Coding and Modeling for Point Clouds
Cloud Dictionary: Sparse Coding and Modeling for Point Clouds Open
With the development of range sensors such as LIDAR and time-of-flight cameras, 3D point cloud scans have become ubiquitous in computer vision applications, the most prominent ones being gesture recognition and autonomous driving. Parsimon…
View article: FPGA system for real-time computational extended depth of field imaging using phase aperture coding
FPGA system for real-time computational extended depth of field imaging using phase aperture coding Open
We present a proof-of-concept end-to-end system for computational extended depth of field (EDOF) imaging. The acquisition is performed through a phase-coded aperture implemented by placing a thin wavelength-dependent optical mask inside th…
View article: Image reconstruction from dense binary pixels
Image reconstruction from dense binary pixels Open
Recently, the dense binary pixel Gigavision camera had been introduced, emulating a digital version of the photographic film. While seems to be a promising solution for HDR imaging, its output is not directly usable and requires an image r…
View article: Spatially Coherent Random Forests
Spatially Coherent Random Forests Open
Spatially Coherent Random Forest (SCRF) extends Random Forest to create spatially coherent labeling. Each split function in SCRF is evaluated based on a traditional information gain measure that is regularized by a spatial coherency term. …