Uploading your first model and dataset

It's time to give your models and datasets a new home: Unbox

Now you are ready for the fun part!

In this tutorial, we will explore the problem of churn classification using Unbox.

Let’s say that we have an online platform with lots of active users. We know for a fact that some users love our platform and intend to continue using it indefinitely. However, after some time, other users exit our platform to never come back, i.e., churn.

The idea is that by observing some of the users’ characteristics, such as age, gender, geography, and others, we can train an ML model that predicts whether a given user will be retained or exit. This binary classifier can be quite useful for different teams inside our organization and hopefully, if our model is good enough, we can take specific actions in time to retain users that were likely to churn, thus, continually enjoying a healthy growth rate.

As a data scientist or ML engineer, it’s all in your hands now.

Let’s train a model to see what happens.

Training the model

To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.

We are going to use the open-source Churn Modelling dataset available on Kaggle. We also took the liberty of writing all the code that loads the dataset, applies a one-hot-encoding to the categorical features, splits the dataset into training and validation sets, and trains a gradient boosting classifier (which is our model of choice). We added comments on the notebook to guide you throughout this process.

👍

Running the notebook cells

Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy?

Our model’s accuracy on the validation set is approximately 86%. Pretty good huh?

Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are a good first metric to look at, but they help little to answer questions such as:

  • How does our model perform for different user groups? For example, what’s the performance for users aged between 25-35? What about for users from different countries?
  • Are there common errors our model is making that could be easily fixed if we had a little bit more data?
  • Are there biases hidden in our model?
  • Why is our model predicting a user will churn? Is it doing something reasonable or simply over-indexing to certain features?

The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.

The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.

The first step is giving the model and the validation set a new home: the Unbox platform. To upload models and datasets to Unbox, we are going to use our API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.

Instantiating the client

When you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Unbox account.

Therefore, before uploading anything to Unbox, you need to instantiate the client with your API key.

👍

Instantiating the client

Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Unbox Client and you will replace ‘YOUR_API_KEY_HERE’ with your API key.

import unboxapi

client = unboxapi.UnboxClient('YOUR_API_KEY_HERE')

If you don’t know what’s your API key or if you get a ModuleNotFoundError when trying to import unboxapi, check out the installation part of the tutorial and verify if the unboxapi is successfully installed.

Uploading the model

Now that we have instantiated the Unbox client with the correct API key, let’s briefly talk about uploading the model.

The gradient boosting classifier we trained on the notebook is a scikit-learn model. Currently, we support models from the following frameworks:

🛠

️ Reach out

Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.
Let us know if you use a different framework!
[email protected]

To be able to upload our model to Unbox, we first need to package it into a predict_proba function. This function needs to receive the model object and the model’s input as arguments and it should output an array-like with class probabilities. There are other optional arguments you can use to apply any necessary transformations, but as long as the predict_proba function receives the model and its inputs as arguments and outputs class probabilities, it is compatible with Unbox.

For sci-kit learn models, this is basically a wrapper around the predict_proba method, which receives an array-like of shape (n_samples, n_features) as an input and outputs an array-like with class probabilities of shape (n_samples, n_classes).

Therefore, in our case, the predict function simply looks like this:

def predict_proba(model, input_features: np.ndarray, col_names: list, one_hot_encoder, encoders):
    """Convert the raw input_features into one-hot encoded features
    using our one hot encoder and each feature's encoder. """
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    
    return model.predict_proba(encoded_df.to_numpy())

Now that we have our model’s predict function, we are ready to upload it to Unbox. The model upload is done with the add_model method.

from unboxapi.tasks import TaskType
from unboxapi.models import ModelType

model = client.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    task_type=TaskType.TabularClassification,
    class_names=class_names,
    name='Churn Classifier',
    description='this is my churn classification model',
    feature_names=feature_names,
    train_sample_df=training_set[:3000],
    train_sample_label_column_name='Exited',
    categorical_features_map=categorical_map,
    col_names=feature_names,
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)

model.to_dict()

There are other optional arguments you can pass when uploading a model, but the above is enough for our purposes. For a complete reference on the add_model method, check our API reference page.

Uploading the dataset

It’s time to upload our dataset as well.

In our example, the validation set is a single pandas data frame. That’s the data frame that we will upload, which can be done with the add_dataframe method:

from unboxapi.tasks import TaskType

dataset = client.add_dataframe(
    df=validation_set,
    class_names=class_names,
    label_column_name='Exited',
    name="Churn Validation",
    description='this is my churn dataset',
    task_type=TaskType.TabularClassification,
    feature_names=feature_names,
    categorical_features_map=categorical_map,
)

dataset.to_dict()

There are other optional arguments you can pass when uploading a dataset. For a complete reference on the add_dataframe method, check our API reference page.

Verifying the upload

After following the previous steps, if you log in to Unbox, you should be able to see the model and the dataset that you just uploaded.

Click on Models under Registry, on the sidebar, to check if our Churn Classifier is indeed there.

Click on Datasets under Registry, on the sidebar, to check if the Churn Validation set is indeed there.

If both are there, you are good to move on to the next part of the tutorial!

Something went wrong with the upload?

If you encountered errors while running the previous steps, here are some common issues worth double-checking:

  • check if you installed the most recent version of unboxapi. The current version is 0.0.2. You can which version you have installed by opening your shell and typing:
$ pip show unboxapi
  • verify if you imported the ModelType and TaskType and you are passing the correct model type and task type as arguments;
  • verify that you are passing all other arguments correctly, as in the code samples we provided.

If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.


Did this page help you?