Dan Busbridge
YOU?
Author Swipe
View article: Scaling Laws for Optimal Data Mixtures
Scaling Laws for Optimal Data Mixtures Open
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…
View article: How PARTs assemble into wholes: Learning the relative composition of images
How PARTs assemble into wholes: Learning the relative composition of images Open
The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-superv…
View article: Distillation Scaling Laws
Distillation Scaling Laws Open
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enab…
View article: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection Open
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i)…
View article: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Open
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the…
View article: How Parts Assemble into Wholes:Learning the Relative Composition of Images
How Parts Assemble into Wholes:Learning the Relative Composition of Images Open
View article: Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Open
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between ke…
View article: Poly-View Contrastive Learning
Poly-View Contrastive Learning Open
Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views whi…
View article: Bootstrap Your Own Variance
Bootstrap Your Own Variance Open
Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop…
View article: REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation
REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation Open
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to distribution shifts between train and test data (1) without access to the training data, and (2) without knowledge of the model training procedure. In online F-TTA, a …
View article: How to Scale Your EMA
How to Scale Your EMA Open
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in …
View article: The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning
The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning Open
The mechanisms behind the success of multi-view self-supervised learning (MVSSL) are not yet fully understood. Contrastive MVSSL methods have been studied through the lens of InfoNCE, a lower bound of the Mutual Information (MI). However, …
View article: DUET: 2D Structured and Approximately Equivariant Representations
DUET: 2D Structured and Approximately Equivariant Representations Open
Multiview Self-Supervised Learning (MSSL) is based on learning invariances with respect to a set of input transformations. However, invariance partially or totally removes transformation-related information from the representations, which …
View article: Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Stabilizing Transformer Training by Preventing Attention Entropy Collapse Open
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attent…
View article: Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer
Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer Open
Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks. However, in numerous realistic scenarios, the downstream task might be biased with respect to the tar…
View article: Position Prediction as an Effective Pretraining Strategy
Position Prediction as an Effective Pretraining Strategy Open
Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing t…
View article: The Impact of Explanations on Layperson Trust in Artificial Intelligence–Driven Symptom Checker Apps: Experimental Study
The Impact of Explanations on Layperson Trust in Artificial Intelligence–Driven Symptom Checker Apps: Experimental Study Open
Background Artificial intelligence (AI)–driven symptom checkers are available to millions of users globally and are advocated as a tool to deliver health care more efficiently. To achieve the promoted benefits of a symptom checker, laypeop…
View article: Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?
Do Self-Supervised and Supervised Methods Learn Similar Visual Representations? Open
Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of ne…
View article: Stochastic Contrastive Learning
Stochastic Contrastive Learning Open
While state-of-the-art contrastive Self-Supervised Learning (SSL) models produce results competitive with their supervised counterparts, they lack the ability to infer latent variables. In contrast, prescribed latent variable (LV) models e…
View article: Evaluating the fairness of fine-tuning strategies in self-supervised learning
Evaluating the fairness of fine-tuning strategies in self-supervised learning Open
In this work we examine how fine-tuning impacts the fairness of contrastive Self-Supervised Learning (SSL) models. Our findings indicate that Batch Normalization (BN) statistics play a crucial role, and that updating only the BN statistics…
View article: Do Self-Supervised and Supervised Methods Learn Similar Visual\n Representations?
Do Self-Supervised and Supervised Methods Learn Similar Visual\n Representations? Open
Despite the success of a number of recent techniques for visual\nself-supervised deep learning, there has been limited investigation into the\nrepresentations that are ultimately learned. By leveraging recent advances in\nthe comparison of…
View article: Neural Temporal Point Processes For Modelling Electronic Health Records
Neural Temporal Point Processes For Modelling Electronic Health Records Open
The modelling of Electronic Health Records (EHRs) has the potential to drive more efficient allocation of healthcare resources, enabling early intervention strategies and advancing personalised healthcare. However, EHRs are challenging to …
View article: Learning medical triage from clinicians using Deep Q-Learning
Learning medical triage from clinicians using Deep Q-Learning Open
Medical Triage is of paramount importance to healthcare systems, allowing for the correct orientation of patients and allocation of the necessary resources to treat them adequately. While reliable decision-tree methods exist to triage pati…
View article: Correlations between Word Vector Sets
Correlations between Word Vector Sets Open
Similarity measures based purely on word embeddings are comfortably competing with much more sophisticated deep learning and expert-engineered systems on unsupervised semantic textual similarity (STS) tasks. In contrast to commonly used ge…
View article: Neural Language Priors
Neural Language Priors Open
The choice of sentence encoder architecture reflects assumptions about how a sentence's meaning is composed from its constituent words. We examine the contribution of these architectures by holding them randomly initialised and fixed, effe…
View article: Relational Graph Attention Networks
Relational Graph Attention Networks Open
We investigate Relational Graph Attention Networks, a class of models that extends non-relational graph attention mechanisms to incorporate relational information, opening up these methods to a wider variety of problems. A thorough evaluat…
View article: Correlations between Word Vector Sets
Correlations between Word Vector Sets Open
Vitalii Zhelezniak, April Shen, Daniel Busbridge, Aleksandar Savkov, Nils Hammerla. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Proce…
View article: Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks
Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks Open
Experimental evidence indicates that simple models outperform complex deep networks on many unsupervised similarity tasks. We provide a simple yet rigorous explanation for this behaviour by introducing the concept of an optimal representat…
View article: Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks
Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks Open
Experimental evidence indicates that simple models outperform complex deep networks on many unsupervised similarity tasks. We provide a simple yet rigorous explanation for this behaviour by introducing the concept of an optimal representat…
View article: Supersymmetric model building with Dirac gauginos
Supersymmetric model building with Dirac gauginos Open
\nWith the Large Hadron Collider about to start its second run, we are in an era of high-energy collider physics. The discovery of a Standard Model-like Higgs boson with a mass of 125 GeV is a fantastic achievement, but the non-observatio…