Explanipedia

CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching Open

Pengsheng Guo, Jiasen Lu, Rui Qian, Tsu-Jui Fu, Wei Liu , et al. · 2025

Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) mod…

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation Open

鋭介安田, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu , et al. · 2025

We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervi…

One Diffusion to Generate Them All Open

Duong H. Le, Tuan Anh Pham, Dong-Yeop Lee, Christopher M. Clark, Aniruddha Kembhavi , et al. · 2024

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, lay…

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Open

Zhaofeng Wu, Xinyan Yu, Dani Yogatama, Jiasen Lu, Yoon Kim · 2024

Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages…

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Open

Matt Deitke, Christopher Clark, Sang-Ho Lee, R. S. Tripathi, Yue Yang , et al. · 2024

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open o…

SoupLM: Model Integration in Large Language and Multi-Modal Models Open

Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu · 2024

Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For insta…

Preserving Identity with Variational Score for General-purpose 3D Editing Open

Duong H. Le, T. Thao Pham, Aniruddha Kembhavi, Stephan Mandt, Wei-Chiu Ma , et al. · 2024

We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models. Specifically, our approach is inspired by the recently proposed method…

Hierarchical Question-Image Co-Attention for Visual Question Answering Open

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh · 2024

A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant answering the question. In this paper, we argue that in addition modeling where l…

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Open

Jiasen Lu, Christopher Clark, Sang-Ho Lee, Zichen Zhang, Savya Khosla , et al. · 2023

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action…

Multi-Modal Answer Validation for Knowledge-Based VQA Open

Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi · 2022

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and …

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Open

Jiasen Lu, Christopher M. Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi · 2022

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region…

ASC me to Do Anything: Multi-task Training for Embodied AI Open

Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi · 2022

Embodied AI has seen steady progress across a diverse set of independent tasks. While these varied tasks have different end goals, the basic skills required to complete them successfully overlap significantly. In this paper, our goal is to…

MERLOT Reserve: Neural Script Knowledge through Vision and Language and\n Sound Open

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao , et al. · 2022

As humans, we navigate a multimodal world, building a holistic understanding\nfrom all our senses. We introduce MERLOT Reserve, a model that represents\nvideos jointly over time -- through a new training objective that learns from\naudio, …

A Simple Long-Tailed Recognition Baseline via Vision-Language Model Open

Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu , et al. · 2021

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to …

Container: Context Aggregation Network Open

Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi · 2021

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted i…

Multi-Modal Answer Validation for Knowledge-Based VQA Open

Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi · 2021

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and …

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Open

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi · 2020

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data Open

Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh , et al. · 2020

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive …

Spatially Aware Multimodal Transformers for TextVQA Open

Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh , et al. · 2020

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Open

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi · 2020

Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…

12-in-1: Multi-Task Vision and Language Representation Learning Open

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee · 2019

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tas…

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Open

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee · 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing…

Emergence of Compositional Language with Deep Generational Transmission Open

Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, Dhruv Batra · 2019

Recent work has studied the emergence of language among deep reinforcement learning agents that must collaborate to solve a task. Of particular interest are the factors that cause language to be compositional -- i.e., express meaning by co…

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation Open

Chih‐Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira , et al. · 2019

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which inst…

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation Open

Chih‐Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira , et al. · 2019

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which inst…

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition Open

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh · 2018

In an open-world setting, it is inevitable that an intelligent agent (e.g., a robot) will encounter visual objects, attributes or relationships it does not recognize. In this work, we develop an agent empowered with visual curiosity, i.e. …

Graph R-CNN for Scene Graph Generation Open

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh · 2018

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with …

Neural Baby Talk Open

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh · 2018

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally be…

Neural Baby Talk Open

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh · 2018

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally be…

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model Open

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra · 2017

We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the h…

Jiasen Lu YOU? Author Swipe