Yushi Hu
YOU?
Author Swipe
View article: BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Open
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training re…
View article: Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation Open
Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To e…
View article: Acoustic Span Embeddings For Multilingual Query-By-Example Search
Acoustic Span Embeddings For Multilingual Query-By-Example Search Open
Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW). Rec…
View article: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Open
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such act…
View article: BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK: Multimodal Large Language Models Can See but Not Perceive Open
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative dep…
View article: Training Language Models to Generate Text with Citations via Fine-grained Rewards
Training Language Models to Generate Text with Citations via Fine-grained Rewards Open
While recent Large Language Models (LLMs) have proven useful in answering user queries, they are prone to hallucination, and their responses often lack credibility due to missing references to reliable sources. An intuitive solution to the…
View article: Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models Open
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by deco…
View article: DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback Open
Despite their wide-spread success, Text-to-Image models (T2I) still struggle to produce images that are both aesthetically pleasing and faithful to the user's input text. We introduce DreamSync, a model-agnostic training algorithm by desig…
View article: Two Watts is All You Need: Enabling In-Detector Real-Time Machine Learning for Neutrino Telescopes Via Edge Computing
Two Watts is All You Need: Enabling In-Detector Real-Time Machine Learning for Neutrino Telescopes Via Edge Computing Open
The use of machine learning techniques has significantly increased the physics discovery potential of neutrino telescopes. In the upcoming years, we are expecting upgrade of currently existing detectors and new telescopes with novel experi…
View article: Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation Open
Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically gene…
View article: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Open
QA-Feedback used in the paper: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
View article: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Open
QA-Feedback used in the paper: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
View article: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Open
QA-Feedback used in the paper: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
View article: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Open
Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are tra…
View article: TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering Open
Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faith…
View article: One Embedder, Any Task: Instruction-Finetuned Text Embeddings
One Embedder, Any Task: Instruction-Finetuned Text Embeddings Open
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu. Findings of the Association for Computational Linguistics: ACL 2023. 2023.
View article: One Embedder, Any Task: Instruction-Finetuned Text Embeddings
One Embedder, Any Task: Instruction-Finetuned Text Embeddings Open
We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior w…
View article: PromptCap: Prompt-Guided Task-Aware Image Captioning
PromptCap: Prompt-Guided Task-Aware Image Captioning Open
Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their st…
View article: Binding Language Models in Symbolic Languages
Binding Language Models in Symbolic Languages Open
Though end-to-end neural approaches have recently been dominating NLP tasks in both performance and ease-of-use, they lack interpretability and robustness. We propose Binder, a training-free neural-symbolic framework that maps the task inp…
View article: Unsupervised Learning of Hierarchical Conversation Structure
Unsupervised Learning of Hierarchical Conversation Structure Open
Human conversations can evolve in many different ways, creating challenges for automatic understanding and summarization. Goal-oriented conversations often have meaningful sub-dialogue structure, but it can be highly domain-dependent. This…
View article: In-Context Learning for Few-Shot Dialogue State Tracking
In-Context Learning for Few-Shot Dialogue State Tracking Open
Collecting and annotating task-oriented dialogues is time-consuming and costly; thus, zero and few shot learning could greatly benefit dialogue state tracking (DST). In this work, we propose an in-context learning (ICL) framework for zero-…
View article: A Collection of Creature Restoration Inaccuracies in the Jurassic Park Franchise and Their Implications
A Collection of Creature Restoration Inaccuracies in the Jurassic Park Franchise and Their Implications Open
The Jurassic Park franchise is one of the highest-grossing media franchises of all time. Here, I show that many representations of dinosaurs and other palaeo-organisms in the franchise are inaccurate, being vastly different from their real…
View article: Unsupervised Learning of Hierarchical Conversation Structure
Unsupervised Learning of Hierarchical Conversation Structure Open
Human conversations can evolve in many different ways, creating challenges for automatic understanding and summarization. Goal-oriented conversations often have meaningful sub-dialogue structure, but it can be highly domain-dependent. This…
View article: Freestanding Ferroelectric Bubble Domains
Freestanding Ferroelectric Bubble Domains Open
Bubble‐like domains, typically a precursor to the electrical skyrmions, arise in ultrathin complex oxide ferroelectric–dielectric–ferroelectric heterostructures epitaxially clamped with flat substrates. Here, it is reported that these spec…
View article: Acoustic span embeddings for multilingual query-by-example search
Acoustic span embeddings for multilingual query-by-example search Open
Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW). Rec…
View article: Multilingual Jointly Trained Acoustic and Written Word Embeddings
Multilingual Jointly Trained Acoustic and Written Word Embeddings Open
Acoustic word embeddings (AWEs) are vector representations of spoken word segments. AWEs can be learned jointly with embeddings of character sequences, to generate phonetically meaningful embeddings of written words, or acoustically ground…
View article: Adaptive multiple‐master strategy for power management of distributed generators in an islanded distribution subsystem
Adaptive multiple‐master strategy for power management of distributed generators in an islanded distribution subsystem Open
Power management related to the application of power system islanding is investigated. In an islanded power system, without the utility grid to regulate the island voltage and frequency, distributed generations (DGs) must switch their cont…
View article: A simple setup to measure muon lifetime and electron energy spectrum of muon decay and its Monte Carlo simulation
A simple setup to measure muon lifetime and electron energy spectrum of muon decay and its Monte Carlo simulation Open
We designed a simple setup to measure the muon lifetime and the electron energy spectra of muon decay. A low cost coincidental circuit was designed to select the signals of muon decay events detected by a plastic scintillator detector. It …