Identifying and Avoiding Data Leakage Issues When Using Machine Learning in Organizational Research
Machine learning (ML) is increasingly used in organizational research (Putka et al., 2018). However, researcher degrees of freedom and lack of knowledge regarding best practices for ML may reduce replicability (Epskamp, 2019; Wenzel & Quaquebeke, 2017). One of the most common, yet subtle, problems in ML is data leakage (Kaufman et al., 2012; Yarkoni & Westfall, 2017). Data leakage happens when the information about the outcome variable “leaks” into the model training process, and such information is not available when the trained model model is later used to make predictions on new data (Ambroise & McLachlan, 2002). Researchers in other fields of psychology have demonstrated that the success of ML is sometimes illusory because of leakage (e.g., Shim et al., 2021). Data leakage often happens unexpectedly and can be hard to detect (Hastie et al. 2009). The current project describes two overarching forms of leakage, uses simulated data to demonstrate how leakage causes upwardly biased model performance estimates, and delineates the likely manifestations of leakage and how to prevent them.