Evaluating large language models in biomedical data science challenges through a classroom experiment Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1073/pnas.2521062122
· OA: W4417251398
Large language models (LLMs) have shown remarkable capabilities in algorithm design, but their effectiveness in solving data science challenges in real-world settings remains poorly understood. We conducted a classroom experiment in which graduate students used LLMs to solve biomedical data science challenges on Kaggle, focusing on tabular data prediction. While their submissions did not top the leaderboards, their prediction scores were often close to those of leading human participants. LLMs frequently recommended gradient boosting methods, which were associated with better performance. Among prompting strategies, self-refinement, where the LLM improves its own initial solution, was the most effective, a result validated using additional LLMs. While LLMs are capable of handling more complex data science tasks beyond tabular data prediction, their performance is substantially worse. These findings demonstrate that LLMs have the potential to design competitive machine learning solutions, even when used by nonexperts.