Evaluate your models

In this page, we are going to explain how you can leverage the Picsellia Python SDK and the platform to evaluate your models and identify edge-cases easily.

For each image, in the 'Labels' column, the colored squares tells you how many object of each class are present in the ground truth (annotations). The grey square tells you how many object has not been detected during evaluation (False Negatives). Evaluation and validation are mandatory steps when creating AI models. Indeed, it's important to be able to analyze the performance of your model on as many data as you can in order to obtain relevant metrics but also to identify edge-cases and see how they are adressed by the model.

But to perform and store all the important information from your evaluation and validation steps is complicated because you don't have a nice and interactive way to visualize and explore your results. Hopefully, we have developed a custom interface that is accessible within your experiments dashboard that allows you to do all this easily !

The interface

Here is what the validation interface looks like :

Let's look at each element independently;

Confusion matrix

A common metric that allows you to quickly visualize the performances is the confusion matrix, it tells you how many time a certain class has been predicted correctly (True Positives) or instead of another class (False Positives) and what are the classes that has the most undetected objects (False Negatives).

The z-axis represented by the intensity of the color indicates you the number of objects predicted for each cases.

The particularity of our confusion matrix is that you can click on any cell to visualize all the images corresponding to the objects in the cell.

For example, if I click on the top-left cell in my matrix, all the images where cars has been correctly predicted as cars will be displayed in the table below.

Image list

That's where you can see the details of what have been predicted on each images of your dataset and compare it to your ground truth.

This list changes when you click on the confusion matrix (see above) so you can filter and dig through your dataset easily.


In the 'Predictions' column, the colored squares tells you how many objects has been correctly detected by your model (True Positives) for each class. The red square tells you how many false detection your models has done, it can be classification error or detection error (False Positives.

If you click on the 'Visualize' green button, you will be able to compare your ground truth and the predictions of your model

Predictions class split

This bar chart tells you how many predictions have been made for each classes.

Evaluation metrics

This section of the page is the place where you can store any additional information regarding the whole evaluation process (such as your metrics).

How to log to the interface ?

Now, we will learn how you can send the information from your evaluation process in Picsellia to have everything displayed as above and be sure that it is fully interactive.

We will use the Picsellia Python SDK for the rest of the tutorial, if you don't know what it is or how to use it, we invite you to check this page of the documentation first :)

First we will initialize the Client and checkout an experiment that we assume has a dataset attached (if not you will have nothing to compare your evaluation with).

If you don't have your API Token stored in your environments variables as I do, you can specify it when initializing the client.

from picsellia.client import Client
project_token = '3a6a8f6c-ba63-451a-8b30-318426bc1755'
experiment_name = 'test-eval'
experiment = Client.Experiment(project_token=project_token)
exp = experiment.checkout(experiment_name)

First, we have to log our confusion matrix, it is a simple array where each cells corresponds to the number of objects detected (or not) for each class.

The format of the data to send to Picsellia is the following (note that for our example, we have two classes in our dataset so our confusion matrix is a 3x3 matrix).

confusion = {
'categories': ["car", "pedestrian"],
'values': [
[33., 0., 13.],
[ 2., 7., 16.],
[10., 0., 0.]

The we just have to log the following object to Picsellia with the name 'confusion-matrix' and the type 'heatmap'.

exp.log('confusion-matrix', confusion, 'heatmap')

Now you should already see you confusion matrix in your experiment dashboard like this

Ok that was the easy part. Now we have to send everything needed to perform evaluation like in the interface above.

Let's take a step back to see how we need to format our evaluation data to send it to Picsellia.

data = {
'categories': ["1", "2"],
'labels': ["car", "pedestrian"],
'total_objects': [22, 10],
'per_image_eval': [
'filename': '9999953_00000_d_0000058.jpg',
'nb_objects': [9.0, 0.0],
'nb_preds': [10.0, 0.0],
'TP': [9.0, 0.0],
'FP': [1.0, 0.0],
'FN': [0.0, 0.0],
'bbox': [
[0.5110850930213928, 0.0, 0.7850839495658875, 0.06849199533462524],
[0.0, 0.1792403906583786, 0.07107414305210114, 0.28744614124298096],
'classes': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
'filename': '9999967_00000_d_0000018.jpg',
'filename': '9999942_00000_d_0000234.jpg',
'filename_matrix': [
'metrics': {
'iou': 0.5,
'confidence': 0.5,

There are a lot of information in this dictionary let's decompose it and explain each key.

For each attribute, we will indicate the type and the dimension in parenthesis. n corresponds to the number of classes (in our case above we consequently have n=2)

  • categories (list, (1,n)), categorical labels (should correspond to your labelmap)

  • labels (list, (1,n)), list of your label names

  • total_objects (list, (1,n)), total number of predicted objects for each class,

  • per_image_eval (list, (1, #images)), list of evaluation dictionaries for each image

    • filename (str) name of the image evaluated (must correspond to the name in the platform)

    • nb_objects (list, (1,n)) number of object in ground truth for each class

    • nb_preds (list, (1,n)) number of predicted objects for each class

    • TP (list, (1,n)) number of True Positives (objects predicted correctly) for each class

    • FP (list, (1,n)) number of False Positives (objects predicted wrongly) for each class

    • FN (list, (1,n)) number of False Negatives (objects not detected) for each class

    • bbox (array, (#predictions, 4)) list of bounding-box coordinates for each prediction, the coordinates must be normalized according to the dimensions of the image and are sorted this way :

      • [ymin, xmin, ymax, xmax]

    • classes (list, (1, #predictions)) predicted classes for each bounding-box in the bbox list

  • filename_matrix (array, (n+1, n+1)) list of filenames of images that has objects corresponding to the cells of your confusion matrix (the shape is (n+1, n+1) to cover the cases of false positives and false negatives)

  • metrics (dict) any additional evaluation metrics that you would like to display in Picsellia, each key is the name of the metric you want to display and the value is the value of the metric


Let's take a breath 😌 The process and amount of data to send can seem a bit tedious but your effort will be rewarded by a wonderful evaluation visualization that you will be proud of !

Now you just have to log the data dictionary with the name AND the type set as 'evaluation' so it's processed correctly by the platform :

exp.log('evaluation', data, 'evaluation')

Now, if we go to the 'eval' tab in our experiment dashboard, we should see something like this :

Everything looks correct πŸŽ‰ As we have performed evaluation on three images, we have only three rows in our table, but now you know how to do the whole process !

Good thing is, once you have done it, you will automatically have this awesome interactive visualization for each of your experiment, and that's the magic of Picsellia πŸ₯‘