Built In: A Solution to Leakage in Applied Machine Learning
In 2017, Andrew Ng, a widely recognized expert in machine learning, helped publish a paper in which he and his team used a deep learning model to detect pneumonia from chest X-ray images. In the initial publication, they inadvertently published overly optimistic results because they didn’t properly account for the fact that some patients appeared more than once in the data set because, several had more than one X-ray available). Although the researchers corrected the issue after Nick Roberts pointed it out, it goes to show that even experts and trained practitioners can fall victim to one of the biggest challenges often faced in applied machine learning: leakage.
In essence, data leakage (referred to as just leakage from this point on) refers to flaws in a machine learning pipeline that lead to overly optimistic results. Although leakage can be hard to detect, even for experts, “too good to be true” performance is often a dead giveaway! Similarly, leakage is a broad term, but Sayash Kapoor and Arvind Narayanan have created a taxonomy of several different types, as well as the model info sheet to help avoid leakage in practice.
Visit our knowledge hub
See what you can learn from our latest posts.