Stephen Pfohl
YOU?
Author Swipe
View article: Opening the ‘black box’ of the silent phase evaluation for artificial intelligence: a scoping review and critical analysis
Opening the ‘black box’ of the silent phase evaluation for artificial intelligence: a scoping review and critical analysis Open
‘Silent’ evaluation refers to the prospective, non-interventional testing of artificial intelligence (AI) model performance in the intended clinical setting without affecting patient care or institutional operations. The silent evaluation …
View article: Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025 Open
The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, C…
View article: Nteasee: Understanding Needs in AI for Health in Africa - A Mixed-Methods Study of Expert and General Population Perspectives
Nteasee: Understanding Needs in AI for Health in Africa - A Mixed-Methods Study of Expert and General Population Perspectives Open
View article: Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity
Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity Open
Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for interpreting granular ratings in pluralistic datasets. Specifically, we address t…
View article: Toward expert-level medical question answering with large language models
Toward expert-level medical question answering with large language models Open
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-for…
View article: Moving Beyond the Benchmarks: Five Foundational Principles for Meaningful AI Evaluation in Healthcare
Moving Beyond the Benchmarks: Five Foundational Principles for Meaningful AI Evaluation in Healthcare Open
View article: Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations
Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations Open
View article: A toolbox for surfacing health equity harms and biases in large language models
A toolbox for surfacing health equity harms and biases in large language models Open
View article: Nteasee: Understanding Needs in AI for Health in Africa -- A Mixed-Methods Study of Expert and General Population Perspectives
Nteasee: Understanding Needs in AI for Health in Africa -- A Mixed-Methods Study of Expert and General Population Perspectives Open
Artificial Intelligence (AI) for health has the potential to significantly change and improve healthcare. However in most African countries, identifying culturally and contextually attuned approaches for deploying these solutions is not we…
View article: A Causal Perspective on Label Bias
A Causal Perspective on Label Bias Open
Predictive models developed with machine learning techniques are commonly used to inform decision making and resource allocation in high-stakes contexts, such as healthcare and public health. One means through which this practice may propa…
View article: A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models Open
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step towar…
View article: Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study Open
View article: Proxy Methods for Domain Adaptation
Proxy Methods for Domain Adaptation Open
We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the co…
View article: The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa
The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa Open
With growing application of machine learning (ML) technologies in healthcare, there have been calls for developing techniques to understand and mitigate biases these systems may exhibit. Fair-ness considerations in the development of ML-ba…
View article: An intentional approach to managing bias in general purpose embedding models
An intentional approach to managing bias in general purpose embedding models Open
Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that me…
View article: Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking Open
Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phen…
View article: The value of standards for health datasets in artificial intelligence-based applications
The value of standards for health datasets in artificial intelligence-based applications Open
View article: Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks
Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks Open
Objective Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robus…
View article: Publisher Correction: Large language models encode clinical knowledge
Publisher Correction: Large language models encode clinical knowledge Open
View article: Large language models encode clinical knowledge
Large language models encode clinical knowledge Open
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks.…
View article: Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2022 Symposium
Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2022 Symposium Open
The second Machine Learning for Health (ML4H) symposium was held both virtually and in-person on November 28, 2022, in New Orleans, Louisiana, USA (Parziale et al., 2022). The symposium included research roundtable sessions to foster discu…
View article: Towards Expert-Level Medical Question Answering with Large Language Models
Towards Expert-Level Medical Question Answering with Large Language Models Open
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicia…
View article: EHR foundation models improve robustness in the presence of temporal distribution shift
EHR foundation models improve robustness in the presence of temporal distribution shift Open
Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informati…
View article: Large Language Models Encode Clinical Knowledge
Large Language Models Encode Clinical Knowledge Open
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledg…
View article: Adapting to Latent Subgroup Shifts via Concepts and Proxies
Adapting to Latent Subgroup Shifts via Concepts and Proxies Open
We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate s…
View article: Tackling bias in AI health datasets through the STANDING Together initiative
Tackling bias in AI health datasets through the STANDING Together initiative Open
View article: Considerations in the reliability and fairness audits of predictive models for advance care planning
Considerations in the reliability and fairness audits of predictive models for advance care planning Open
Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audi…
View article: Considerations in the Reliability and Fairness Audits of Predictive Models for Advance Care Planning
Considerations in the Reliability and Fairness Audits of Predictive Models for Advance Care Planning Open
Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audi…
View article: EHR Foundation Models Improve Robustness in the Presence of Temporal Distribution Shift
EHR Foundation Models Improve Robustness in the Presence of Temporal Distribution Shift Open
Background Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquirin…
View article: Evaluating algorithmic fairness in the presence of clinical guidelines: the case of atherosclerotic cardiovascular disease risk estimation
Evaluating algorithmic fairness in the presence of clinical guidelines: the case of atherosclerotic cardiovascular disease risk estimation Open
Objectives The American College of Cardiology and the American Heart Association guidelines on primary prevention of atherosclerotic cardiovascular disease (ASCVD) recommend using 10-year ASCVD risk estimation models to initiate statin tre…