Jiasen Lu
YOU?
Author Swipe
View article: CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching
CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching Open
Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) mod…
View article: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation Open
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervi…
View article: One Diffusion to Generate Them All
One Diffusion to Generate Them All Open
We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, lay…
View article: The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Open
Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages…
View article: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Open
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open o…
View article: SoupLM: Model Integration in Large Language and Multi-Modal Models
SoupLM: Model Integration in Large Language and Multi-Modal Models Open
Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For insta…
View article: Preserving Identity with Variational Score for General-purpose 3D Editing
Preserving Identity with Variational Score for General-purpose 3D Editing Open
We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models. Specifically, our approach is inspired by the recently proposed method…
View article: Hierarchical Question-Image Co-Attention for Visual Question Answering
Hierarchical Question-Image Co-Attention for Visual Question Answering Open
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant answering the question. In this paper, we argue that in addition modeling where l…
View article: Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Open
We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action…
View article: Multi-Modal Answer Validation for Knowledge-Based VQA
Multi-Modal Answer Validation for Knowledge-Based VQA Open
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and …
View article: Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Open
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region…
View article: ASC me to Do Anything: Multi-task Training for Embodied AI
ASC me to Do Anything: Multi-task Training for Embodied AI Open
Embodied AI has seen steady progress across a diverse set of independent tasks. While these varied tasks have different end goals, the basic skills required to complete them successfully overlap significantly. In this paper, our goal is to…
View article: MERLOT Reserve: Neural Script Knowledge through Vision and Language and\n Sound
MERLOT Reserve: Neural Script Knowledge through Vision and Language and\n Sound Open
As humans, we navigate a multimodal world, building a holistic understanding\nfrom all our senses. We introduce MERLOT Reserve, a model that represents\nvideos jointly over time -- through a new training objective that learns from\naudio, …
View article: A Simple Long-Tailed Recognition Baseline via Vision-Language Model
A Simple Long-Tailed Recognition Baseline via Vision-Language Model Open
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to …
View article: Container: Context Aggregation Network
Container: Context Aggregation Network Open
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted i…
View article: Multi-Modal Answer Validation for Knowledge-Based VQA
Multi-Modal Answer Validation for Knowledge-Based VQA Open
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and …
View article: X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Open
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…
View article: Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data
Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data Open
Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive …
View article: Spatially Aware Multimodal Transformers for TextVQA
Spatially Aware Multimodal Transformers for TextVQA Open
View article: X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Open
Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…
View article: 12-in-1: Multi-Task Vision and Language Representation Learning
12-in-1: Multi-Task Vision and Language Representation Learning Open
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tas…
View article: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Open
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing…
View article: Emergence of Compositional Language with Deep Generational Transmission
Emergence of Compositional Language with Deep Generational Transmission Open
Recent work has studied the emergence of language among deep reinforcement learning agents that must collaborate to solve a task. Of particular interest are the factors that cause language to be compositional -- i.e., express meaning by co…
View article: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation Open
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which inst…
View article: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation Open
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which inst…
View article: Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition Open
In an open-world setting, it is inevitable that an intelligent agent (e.g., a robot) will encounter visual objects, attributes or relationships it does not recognize. In this work, we develop an agent empowered with visual curiosity, i.e. …
View article: Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation Open
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with …
View article: Neural Baby Talk
Neural Baby Talk Open
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally be…
View article: Neural Baby Talk
Neural Baby Talk Open
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally be…
View article: Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model Open
We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the h…