News Icon

News

1 MIN READ
News

Built In: A Solution to Leakage in Applied Machine Learning

In 2017, Andrew Ng, a widely recognized expert in machine learning, helped publish a paper in which he and his team used a deep learning model to detect pneumonia from chest X-ray images. In the initial publication, they inadvertently published overly optimistic results because they didn’t properly account for the fact that some patients appeared more than once in the data set because, several had more than one X-ray available). Although the researchers corrected the issue after Nick Roberts pointed it out, it goes to show that even experts and trained practitioners can fall victim to one of the biggest challenges often faced in applied machine learning: leakage.

In essence, data leakage (referred to as just leakage from this point on) refers to flaws in a machine learning pipeline that lead to overly optimistic results. Although leakage can be hard to detect, even for experts, “too good to be true” performance is often a dead giveaway! Similarly, leakage is a broad term, but Sayash Kapoor and Arvind Narayanan have created a taxonomy of several different types, as well as the model info sheet to help avoid leakage in practice.

FULL ARTICLE

We’re leading a data revolution in the retail business, and we’re looking for partners who are ready for a deeper, more personal approach to customer engagement.

Let’s connect