Uploading your first model and dataset

It's time to give your models and datasets a new home: Unbox

Now you are ready for the fun part!

In this tutorial, we will explore the development of a chatbot for banking using Unbox.

Let’s say that we work at a bank and when clients have questions, they send messages via chat to customer support. You know that the customer support department would benefit greatly from a chatbot to either respond automatically to frequently asked questions or at least to automatically label the kind of inquiry the clients are making so that the messages can be directed to the correct person within the bank.

Equipped with your ML knowledge, you want to train a model that categorizes a client message into a category that represents the kind of question they are making. For example, a message such as “I ordered my card a couple of weeks ago and haven’t received it yet. When can I expect it?” belongs to the class card_delivery_estimate. A message like “Tell me how to reset the passcode” belongs to the class passcode_forgotten. There are many more classes, as the client’s inquiries can be quite diverse.

This multi-class classifier can be quite useful for different teams inside our organization and hopefully, if our model is good enough, we can improve the bank’s customer support rating.

As a data scientist or ML engineer, it’s all in your hands now.

Let’s train a model to see what happens.

Training the model

To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.

We are going to use the 'banking77' dataset available on Hugging Face, which contains excerpts of banking chat messages that fall into one of 62 categories. We also took the liberty of writing all the code that loads the dataset, tokenizes the messages, splits the dataset into training and validation sets, and trains a logistic regression (which is our model of choice). We added comments on the notebook to guide you throughout this process.

👍

Running the notebook cells

Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy?

Our model’s accuracy on the validation set is almost 85%. Pretty good huh?

Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are a good first metric to look at, but they help little to answer questions such as:

  • How does our model perform for different groups of the data? For example, what’s the performance for messages where the overarching theme is the users credit card? What about for messages from users that complain about refunds?
  • Are there common errors our model is making that could be easily fixed if we had a little bit more data?
  • Are there biases hidden in our model?
  • Why is our model making predictions like this? Is it doing something reasonable or simply over-indexing to certain tokens and stopwords?

The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.

The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.

The first step is giving the model and the validation set a new home: the Unbox platform. To upload models and datasets to Unbox, we are going to use our API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.

Instantiating the client

When you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Unbox account.

Therefore, before uploading anything to Unbox, you need to instantiate the client with your API key.

👍

Instantiating the client

Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Unbox Client and you will replace ‘YOUR_API_KEY_HERE’ with your API key.

import unboxapi

client = unboxapi.UnboxClient('YOUR_API_KEY_HERE')

If you don’t know what’s your API key or if you get a ModuleNotFoundError when trying to import unboxapi, check out the installation part of the tutorial and verify if the unboxapi is successfully installed.

Uploading the model

Now that we have instantiated the Unbox client with the correct API key, let’s briefly talk about uploading the model.

The gradient boosting classifier we trained on the notebook is a scikit learn model. Currently, we support models from the following frameworks:

🛠

️ Reach out

Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.

Let us know if you use a different framework!
[email protected]

To be able to upload our model to Unbox, we first need to package it into a predict_proba function. This function needs to receive the model object and the model’s input as arguments and it should output an array-like with class probabilities. There are other optional arguments you can use to apply any necessary transformations, but as long as the predict_proba function receives the model and its inputs as arguments and outputs class probabilities, it is compatible with Unbox.

For sci-kit learn models, this is basically a wrapper around the predict_proba method, which receives an array-like of shape (n_samples, n_features) as an input and outputs an array-like with class probabilities of shape (n_samples, n_classes).

Therefore, in our case, the predict function simply looks like this:

def predict_function(model, text_list):
    return model.predict_proba(text_list)

Now that we have our model’s predict function, we are ready to upload it to Unbox. The model upload is done with the add_model method.

from unboxapi.tasks import TaskType
from unboxapi.models import ModelType

unbox_model = client.add_model(
    function=predict_function, 
    model=model,
    model_type=ModelType.sklearn,
    task_type=TaskType.TextClassification,
    class_names=label_list,
    name="Banking Classifier",
    description="this is my sklearn banking model"
)

model.to_dict()

There are other optional arguments you can pass when uploading a model, but the above is enough for our purposes. For a complete reference on the add_model method, check our API reference page.

Uploading the dataset

It’s time to upload our dataset as well.

In our example, the validation set is a single pandas data frame. That’s the data frame that we will upload, which can be done with the add_dataframe method:

from unboxapi.tasks import TaskType

dataset = client.add_dataframe(
    df=validation_set,
    class_names=label_list,
    label_column_name="label_code",
    text_column_name="text",
    task_type=TaskType.TextClassification,
    name="Banking Validation",
    description="my banking validation dataset"
)

dataset.to_dict()

There are other optional arguments you can pass when uploading a dataset. For a complete reference on the add_dataframe method, check our API reference page.

Verifying the upload

After following the previous steps, if you log in to Unbox, you should be able to see the model and the dataset that you just uploaded.

Click on Models under Registry, on the sidebar, to check if our Banking model is indeed there.

Click on Datasets under Registry, on the sidebar, to check if the Banking test dataset is indeed there.

If both are there, you are good to move on to the next part of the tutorial!

Something went wrong with the upload?

If you encountered errors while running the previous steps, here are some common issues worth double-checking:

  • check if you installed the most recent version of unboxapi. The current version is 0.0.2. You can which version you have installed by opening your shell and typing:
$ pip show unboxapi
  • verify if you imported the ModelType and TaskType and you are passing the correct model type and task type as arguments;
  • verify that you are passing all other arguments correctly, as in the code samples we provided.

If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.


Did this page help you?