Should we worry about data leakage in materials property prediction?

Short answer: yes. Composition overlap across train and test sets can create a false sense of model robustness, especially when crystal prototypes are near-duplicates. We recently reran a published benchmark with composition-family splits and observed performance drops of 30-50% depending on target property.

Leakage is not always malicious; many datasets were never designed for ML benchmarking. But if we do not define clear split protocols, we cannot compare papers meaningfully. I would love to see a community-maintained suite of leakage-resistant evaluation splits.

Post anonymously

Posting as Anonymous Researcher

Comments

Composition-family splits should be mandatory in benchmark papers now. Random splits are no longer defensible for many targets.