William Brannon
YOU?
Author Swipe
View article: Bridging the Data Provenance Gap Across Text, Speech and Video
Bridging the Data Provenance Gap Across Text, Speech and Video Open
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and f…
On the Relationship between Truth and Political Bias in Language Models Open
Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact…
View article: Consent in Crisis: The Rapid Decline of the AI Data Commons
Consent in Crisis: The Rapid Decline of the AI Data Commons Open
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the …
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them? Open
New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in tracing authenticity, verifying consen…
Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them? Open
New AI capabilities are owed in large part to massive, widely sourced, and underdocumented training data collections. Dubious collection practices have spurred crises in data transparency, authenticity, consent, privacy, representation, bi…
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI Open
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and unders…
ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings Open
Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are …
Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing Open
We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are …
Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing Open
We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are …