Evaluating large language models in biomedical data science challenges through a classroom experiment Article Swipe

PDF

Related Concepts

No concepts available.

Cairui Yan , Zhicheng Ji , Tara Al-Hashimy , Austin Allen , Nan Cen , Ouchan Chen , Yongyin Chen , Yutian Chen , Tong Cheng , Yueqi Gu , Bing Ji , Xiaohui Jiang , Fengnan Li , P J Li , Y. T. Liang , Bin Liu , Coco Liu , E.C. Ma , Zhicheng Ma , Vicky Shao , Mengyao Shi , Jiang Shu , Leyi Sun , Rushi Tang , Hanyu Wang , Vivian Wang , Yuxin Wang , Kym Wilson , Ruobing Xue , Tianyi Yang , Anchi Yu , An Yuan , Haiqi Zhang , Vera Zhang , Y. Zhang ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.1073/pnas.2521062122 · OA: W4417251398

Large language models (LLMs) have shown remarkable capabilities in algorithm design, but their effectiveness in solving data science challenges in real-world settings remains poorly understood. We conducted a classroom experiment in which graduate students used LLMs to solve biomedical data science challenges on Kaggle, focusing on tabular data prediction. While their submissions did not top the leaderboards, their prediction scores were often close to those of leading human participants. LLMs frequently recommended gradient boosting methods, which were associated with better performance. Among prompting strategies, self-refinement, where the LLM improves its own initial solution, was the most effective, a result validated using additional LLMs. While LLMs are capable of handling more complex data science tasks beyond tabular data prediction, their performance is substantially worse. These findings demonstrate that LLMs have the potential to design competitive machine learning solutions, even when used by nonexperts.