Background

I have been working on a really fun & interesting binary classification model at work. We have a training process that runs weekly, but our batch prediction process (let's call it "test-time inference" process like Rachel Thomas does here) runs multiple times per day. Much of the feature data changes intraday, so I have to be sure that the test-time inference data that we feed into our weekly trained model (a pickle file) has the same charateristics as the data that it trained on. I have developed a script in BigQuery to use BQML that runs every time my test-time inference process runs to check that my "training" data looks similar enough to my "test-time inference" data. If a simple model can figure it out, then you have an issue! and you should fail/stop your process!

LEFT SIDE OF IMAGE BELOW: When we learn about machine learning, we typically do the left side of the figure below. We get the data, we add features, we train, we hold out on some data, and see how we would do on that holdout set. What we don't learn about is how we should SAVE down that model file somewhere so that your "test-inference" process can pick it up later.

RIGHT SIDE OF IMAGE BELOW: Before my process grabs the model file to make predictions on my test-inference data, I need to be sure that this data looks similar enough to the data that the model trained on. If you do not check this, you run the risk of decreasing your accuracy. I will explain how I do this below in BigQuery.


Simple BQML Script & Model to Alert You To Data Issues

I have been wrangling data and data storytelling for over 13 years. I love BigQuery. I love how fast & powerful it is. I also love the work they are putting into BQML.

For this demonstration, I will pull in a dataset from a BigQuery public dataset -- Daily Liquor Sales in Iowa. We can pretend that we work at a government agency in Iowa and our boss needs us to predict liquor sales by THE GALLON haha. Not realistic, but fun to think about.

The script below will roughly follow these steps:

  1. Build a table
  2. Split table into chunks for train / test / validation / test-inference. In real life, you'll have a process to update the test-inference dataset. The "train / test / validation" is the left side of image above. The "test-inference" is right side.
  3. Grab 10k records from train and from test-inference.
  4. Build simple BQML model to predict which records are from "train" (0) or from "test-inference" (1)
  5. Look at AUC of simple model to determine if there are data issues. Fail query if AUC is above your threshold. I have found .6 or .65 work as a good threshold. AUC of .5 is basically completely random guesses, which implies that the model cannot discern which records are train and which are test-inference.
  6. If issue, then return the features that are contributing to the data issue.

Note: At end of this post, you’ll find the links to a colab notebook where you can run the code yourself.

Steps 1 & 2: Query, split the data

Output from query above to help you understand the data:


Steps 3 & 4: Create model in BQML


Step 5: Check AUC, fail if above threshold


Steps 6: Return top contributing features

if AUC is below .65, then no issues.


In Closing: Sleep well David, BQML checked, your data is good to go

I have implemented this process at work. I honestly sleep better knowing that I have this failsafe in place. We have to pull from over 25+ different sources for 120 or so features. If we mess something up or if an ETL process fails upstream before it gets to our process, it is wonderful to know right away instead of just watching performance plummet or watch it bounce around a day or two later.

Please reach out on Twitter or LinkedIn if you have any questions.

If you would like to look at the full script, please follow this link to run in colab. Here is a link to a .sql file to copy/paste into BQ console.