Exploring foci of:
arXiv (Cornell University)
On the redundancy in large material datasets: efficient and robust learning with less data
April 2023 • Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick‐Simpers
Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95 % of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-dist…
Redundancy (Engineering)
Computer Science
Machine Learning
Training, Validation, And Test Data Sets
Artificial Intelligence
Data Mining
Chemistry
Biochemistry