Now you are ready for the fun part!
In this tutorial, we will explore the problem of churn classification using Unbox.
Let’s say that we have an online platform with lots of active users. We know for a fact that some users love our platform and intend to continue using it indefinitely. However, after some time, other users exit our platform to never come back, i.e., churn.
The idea is that by observing some of the users’ characteristics, such as age, gender, geography, and others, we can train an ML model that predicts whether a given user will be retained or exit. This binary classifier can be quite useful for different teams inside our organization and hopefully, if our model is good enough, we can take specific actions in time to retain users that were likely to churn, thus, continually enjoying a healthy growth rate.
As a data scientist or ML engineer, it’s all in your hands now.
Let’s train a model to see what happens.
To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.
We are going to use the open-source Churn Modelling dataset available on Kaggle. We also took the liberty of writing all the code that loads the dataset, applies a one-hot-encoding to the categorical features, splits the dataset into training and validation sets, and trains a gradient boosting classifier (which is our model of choice). We added comments on the notebook to guide you throughout this process.
Running the notebook cells
Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy?
Our model’s accuracy on the validation set is approximately 86%. Pretty good huh?
Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are a good first metric to look at, but they help little to answer questions such as:
- How does our model perform for different user groups? For example, what’s the performance for users aged between 25-35? What about for users from different countries?
- Are there common errors our model is making that could be easily fixed if we had a little bit more data?
- Are there biases hidden in our model?
- Why is our model predicting a user will churn? Is it doing something reasonable or simply over-indexing to certain features?
The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.
The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.
The first step is giving the model and the validation set a new home: the Unbox platform. To upload models and datasets to Unbox, we are going to use our API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.
When you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Unbox account.
Therefore, before uploading anything to Unbox, you need to instantiate the client with your API key.
Instantiating the client
Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Unbox Client and you will replace
‘YOUR_API_KEY_HERE’with your API key.
import unboxapi client = unboxapi.UnboxClient('YOUR_API_KEY_HERE')
If you don’t know what’s your API key or if you get a
ModuleNotFoundError when trying to import
unboxapi, check out the installation part of the tutorial and verify if the
unboxapi is successfully installed.
Now that we have instantiated the Unbox client with the correct API key, let’s briefly talk about uploading the model.
The gradient boosting classifier we trained on the notebook is a
scikit-learn model. Currently, we support models from the following frameworks:
️ Reach out
Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.
Let us know if you use a different framework!
To be able to upload our model to Unbox, we first need to package it into a
predict_proba function. This function needs to receive the model object and the model’s input as arguments and it should output an array-like with class probabilities. There are other optional arguments you can use to apply any necessary transformations, but as long as the
predict_proba function receives the model and its inputs as arguments and outputs class probabilities, it is compatible with Unbox.
sci-kit learn models, this is basically a wrapper around the
predict_proba method, which receives an array-like of shape
(n_samples, n_features) as an input and outputs an array-like with class probabilities of shape
Therefore, in our case, the predict function simply looks like this:
def predict_proba(model, input_features: np.ndarray, col_names: list, one_hot_encoder, encoders): """Convert the raw input_features into one-hot encoded features using our one hot encoder and each feature's encoder. """ df = pd.DataFrame(input_features, columns=col_names) encoded_df = one_hot_encoder(df, encoders) return model.predict_proba(encoded_df.to_numpy())
Now that we have our model’s predict function, we are ready to upload it to Unbox. The model upload is done with the
from unboxapi.tasks import TaskType from unboxapi.models import ModelType model = client.add_model( function=predict_proba, model=sklearn_model, model_type=ModelType.sklearn, task_type=TaskType.TabularClassification, class_names=class_names, name='Churn Classifier', description='this is my churn classification model', feature_names=feature_names, train_sample_df=training_set[:3000], train_sample_label_column_name='Exited', categorical_features_map=categorical_map, col_names=feature_names, one_hot_encoder=data_encode_one_hot, encoders=encoders, ) model.to_dict()
There are other optional arguments you can pass when uploading a model, but the above is enough for our purposes. For a complete reference on the
add_model method, check our API reference page.
It’s time to upload our dataset as well.
In our example, the validation set is a single
pandas data frame. That’s the data frame that we will upload, which can be done with the
from unboxapi.tasks import TaskType dataset = client.add_dataframe( df=validation_set, class_names=class_names, label_column_name='Exited', name="Churn Validation", description='this is my churn dataset', task_type=TaskType.TabularClassification, feature_names=feature_names, categorical_features_map=categorical_map, ) dataset.to_dict()
There are other optional arguments you can pass when uploading a dataset. For a complete reference on the
add_dataframe method, check our API reference page.
After following the previous steps, if you log in to Unbox, you should be able to see the model and the dataset that you just uploaded.
Click on Models under Registry, on the sidebar, to check if our Churn Classifier is indeed there.
Click on Datasets under Registry, on the sidebar, to check if the Churn Validation set is indeed there.
If both are there, you are good to move on to the next part of the tutorial!
If you encountered errors while running the previous steps, here are some common issues worth double-checking:
- check if you installed the most recent version of
unboxapi. The current version is 0.0.2. You can which version you have installed by opening your shell and typing:
$ pip show unboxapi
- verify if you imported the
TaskTypeand you are passing the correct model type and task type as arguments;
- verify that you are passing all other arguments correctly, as in the code samples we provided.
If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.
Updated about 1 month ago