6 MIN READ
Blog
Shutterstock 1505994923 M
6 MIN READ
Blog

Introducing Labrea: Streamlining Python Development with Datasets and Options

Warner Austin
By: Austin Warner, Lead Data Scientist
Stoepel Michael
By: Michael Stoepel, Director, Data Science
Shutterstock 1505994923 M

Labrea – an homage to the 2006 paper "Out of the Tar Pit" and in reference to the La Brea Tar Pits in Los Angelas, Calif. – is a Python library that exposes a functional, declarative approach to organizing your Python codebase, with a focus on readability, composability, and ease of refactoring. It seeks to provide a practical implementation of some of the ideas presented by Ben Moseley and Peter Marks in that paper.

Background 

Labrea was created in 2022 when data scientists working on 84.51°'s Best Customer Communications program (BCC) were looking to redesign their codebase to make it more flexible to changing business requests. A key issue that we had identified was that when refactoring our code (e.g. a function takes a new argument), it was very difficult to identify all of the places in the rest of our codebase that needed to be refactored to handle that change. We also found that despite following functional programming best practices which promote code reusability, in practice the amount of boilerplate necessary to enable that reuse cluttered up our codebase and hindered development speed. 

To solve these issues, Labrea introduces two new concepts, datasets and options, which when combined with functions allows us to write Python code that is simple, expressive and flexible without the typical headaches of refactoring or large amounts of boilerplate. We will introduce these concepts in an example, where we convert a small program that creates a scatterplot into the equivalent program using Labrea, highlighting how this conversion can help us avoid some typical pain points. 

An example 

At 84.51°, we think a lot about retail data – transactions, products, customers, etc. In our example, we will create a scatterplot showing how one metric for customer behavior (e.g. sales) relates to another (e.g. baskets). 

NOTE: This example, including the code, data, and field names are fabricated for this article. 

The initial ask 

Your boss comes to you and says that they would like to see a visualization that compares shopper frequency versus their typical basket size in dollars spent. They have a suspicion that shoppers tend to fall into one of two groups – either those who shop frequently in small baskets, or those who shop infrequently in large baskets. This is the data you have available to you, where each row represents a single basket for a customer. 

DS Image 1

You land on a scatterplot where each point is a customer, the X-axis is the number of baskets (transactions) and the Y-axis is the median basket size in dollars spent (sales).

Without Labrea

A typical way to complete this task would look like this:

DS Image 2
DS Image 3

This seems pretty straight forward, right? We write some functions and call them in a sequence. The potential downfalls of this approach are not apparent at this point. They will, however, become more apparent as we make incremental changes to this program. Next, we will show how the same program could be written with Labrea, and then demonstrate how it is more flexible to future changes.

With Labrea

Here is the same program written with Labrea. We will walk through the pieces step-by-step to explain how it works.

DS Image 4

Our first dataset is transactions, which represents the tabular transaction data. It has one parameter, path, which we say should be provided by the option TRANSACTIONS.PATH, which refers to the nested location within the dictionary of options we provide to it. If we want to look at the transactions, we can call it like a function, providing a single dictionary of options that includes TRANSACTIONS.PATH, and Labrea will automatically retrieve that option, pass it in as the path argument, and run the function body.

DS Image 5

This might seem like a somewhat roundabout way of providing an argument to a function, but the power comes when we create our next dataset, customer_txn_metrics. This dataset has one parameter, txns, which we say is our transactions table. When we call customer_txn_metrics, it automatically evaluates the transactions dataset using the provided options dictionary, passes the result as the txns argument, and runs the function body.

DS Image 6

Finally, our third dataset, plot, takes the customer_txn_metrics dataset and the PLOT.X and PLOT.Y options as input, and creates a scatter plot.

DS Image 10

So, to summarize, options are parameters that come from the single input, the options dictionary. Datasets are computations that can take options or other datasets as inputs.

This might not seem to be very helpful yet – which makes sense! This is a small program, so it is very easy to write in the traditional way. The issues appear when we make incremental changes. What's important to note, though, is that it was about as easy to write this simple example using Labrea as it was without Labrea. A central goal of Labrea is to have minimal up-front cost while greatly reducing the time and energy required as a project grows.

Change 1 - Date Filtering

You show your boss this scatterplot and they love it. They wonder if the holidays might have any impact on that behavior and ask if you can create the same plot for only November and December. You realize that this type of request will probably happen again, so you add some extra parameters for date filtering.

Without Labrea

Image 11
Image 12

With Labrea

Image 13
Image 14

Without Labrea, when we add the extra parameters, we must find all the places in our code that we need to add those parameters. In a small example that can be easy, but as a project grows it can be incredibly difficult to keep track of all the places where those changes are required. With Labrea, all we need to do is add an extra option to one of our datasets, and the rest of our codebase can be completely unchanged. All we do is update our options dictionary and Labrea handles the rest. Suddenly, something that can be a headache is trivial.

Change 2 - New Data Source

Your boss loves this plot even more. They ask you if there is any way you can make this into a dashboard for them so they can check it regularly?

So far this has all been built on a CSV file on your laptop, but that won't work if this is going to be an actual dashboard since your boss will want to see the most recent data as it enters your database. You decide to add the ability to read the database as well as the CSV.

Image 24
Image 25
Image 26

Now you have two ways to load in the transactions: via a CSV or the database. This makes it very easy to write this code on your laptop and iterate, then when you are ready to deploy it to a dashboard just change the entries in the options to point to the database rather than the CSV file on your laptop.

With Labrea

Labrea has the idea of dataset overloads for this exact purpose. Datasets can have multiple implementations (overloads), and we can use the options dictionary to decide which one to use. Here is how we would add the same logic using an overload.

Image 27
Image 31
Image 32

Now you have two ways to load in the transactions: via a CSV or the database. This makes it very easy to write this code on your laptop and iterate, then when you are ready to deploy it to a dashboard just change the entries in the options to point to the database rather than the CSV file on your laptop.

Takeaways

We believe that Labrea makes this code easier to read and maintain. Firstly, all of the parameters live in one place (the options dictionary), and it is easy to see where each dataset gets its inputs from (other datasets or options). Secondly, we can freely add new inputs to a dataset at any time without worry, as it has no impact on the other datasets in our program, which makes refactoring a breeze. Thirdly, the ability to overload a dataset lets us avoid writing long strings of if/else checks and focus more on the actual business problem.

This has been a light introduction into Labrea, what it does, and how it can simplify your Python code. While this was a toy example, we want to note that Labrea is used at 84.51° both for simple projects that can fit in one notebook, as well as large complex codebases like the one used for targeting BCC campaigns. While it won't be a perfect fit for every project, we believe that many people writing Python code (within and outside 84.51° alike) can benefit from Labrea. We hope you give it a try!

Warner Austin
Austin Warner, Lead Data Scientist
Austin Warner is a lead data scientist at 84.51°. As a technical lead, he is responsible for the targeting and allocation framework behind Kroger’s Best Customer Communications program.
Stoepel Michael
Michael Stoepel, Director, Data Science
Michael Stoepel is a director of data science at 84.51°. He leads technical data science across the Personalization & Loyalty Strategy teams by providing tech mentorship and innovative thought leadership.

We’re leading a data revolution in the retail business, and we’re looking for partners who are ready for a deeper, more personal approach to customer engagement.

Let’s connect