Pytorch tutorial Deep Learning with Python


When I wrote this blog post (this Pytorch tutorial), I remembered the challenge I set for myself at the beginning of the year to learn deep learning, I did not even know Python at the time. What makes things difficult is not necessarily the complexity of the concepts, but it starts with questions like: What framework to use for deep learning? Which activation function should I choose? Which cost function is best suited for my problem?

My personal experience has been to study the PyTorch framework and especially for the theory the online course made available for free by Yan LeCun whom I thank (link here . I still have to learn and work in the field but through this blog post, I would like to share and give you an overview of what I have learned about deep learning this year.

Deep Learning: Which activation and cost function to choose?

The objective of this article is to sweep through this central topic in DeepLearning of choosing the activation and cost (loss) function according to the problem you are looking to solve by means of a DeepLearning algorithm.

We are talking about the activation function of the last layer of your model, that is to say the one that gives you the result.

This result is used in the algorithm which checks for each learning the difference between the result predicted by the neural network and the real result (gradient descent algorithm), for this we apply to the result a cost function (loss function) which represents this difference and which you seek to minimize as you practice in order to reduce the error. (See this article here to learn more :

Source :
L’algorithe de descente de Gradient permet de trouver le minimum de n’importe quelle fonction convexe pas à pas. – Pytorch tutorial

As a reminder, the machine learns by minimizing the cost function, iteratively by successive training steps, the result of the cost function and taken into account for the adjustment of the parameters of the neurons (weight and bias for example for linear layers) .

This choice of activation function on the last layer and the cost function to be minimized (loss function) is therefore crucial since these two elements combined will determine how you are going to solve a problem using your neural network. – Pytorch tutorial

This article also presents simple examples that use the Pytorch framework in my opinion a very good tool for machine learning.

The question here to ask is the following:

  • Am I looking to create a model that performs binary classification? (In this case you are trying to predict a probability that the result of an entry will be 0 or 1)
    Am I looking to calculate / predict a numeric value with my neural network? (In this case you are trying to predict a decimal value for example at the output that corresponds to your input)
  • Am I looking to create a model that performs single or multiple label classification for a set of classes? (In this case, you are trying to predict for each output class the probability that an input matches this class)
  • Am I looking to create a model that searches for multiple classes within a set of possible classes? (In this case, you are trying to predict for each class at the exit its attendance rate at the entrance).

Binary classification problem:

You are trying to predict by means of your DeepLearning algorithm whether a result is true or false, and more precisely it is a probability that a result of 1 or that of a result of 0 that you will get.

The output size of your neural network is 1 (final layer) and you seek to obtain a result between 0 and 1 which will be assimilated to the probability that the result is 1 (example at the output of the network if you obtain 0.65 this will correspond to 65% chance that the result is true).
Application example: Predicting the probability that the home team in a football match will win. The closer the value is to 1 the more the home team has a chance of winning, and conversely the closer the score is to 0 the more chance the home team has of losing.
It is possible to qualify this result using a threshold, for example by admitting that if the output is> 0.5 then the result is true if not false.

Final activation function (case of binary classification):

The final activation function should return a result between 0 and 1, the correct choice in this case may be the sigmoid function. The sigmoid function will easily translate a result between 0 and 1 and therefore is ideal for translating the probability that we are trying to predict.

If you want to plot the sigmoid function in Python here is some code that should help you (see also :

import math
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
    a = []
    for item in x:
    return a
x = np.arange(-10., 10., 0.2)
sig = sigmoid(x)
Tracé de la fonction sigmoid, fonction utilisée comme fonction d’activation finale dans le cas d’un algorithme de Deep Learning – Pytorch tutorial
The cost function – Loss function (case of binary classification):

You have to determine during training the difference between the probability that the model predicts (translated via the final sigmoid function) and the true and known response (0 or 1). The function to use is Binary Cross Entrropy because this function allows to calculate the difference between 2 probability.
The optimizer will then use this result to adjust the weights and biases in your model (or other parameters depending on the architecture of your model).

Example :

In this example I will create a neural network with 1 linear layer and a final sigmoid activation function.

First, I perform all the imports that we will use in this post:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from collections import OrderedDict
import math
from random import randrange
import torch 
from torch.autograd import Variable
import torch.nn as nn 
from torch.autograd import Function 
from torch.nn.parameter import Parameter 
from torch import optim 
import torch.nn.functional as F 
from torchvision import datasets, transforms
from echoAI.Activation.Torch.bent_id import BentID
from sklearn.model_selection import train_test_split
import sklearn.datasets
from sklearn.metrics import accuracy_score
import hiddenlayer as hl
import warnings
from torchviz import make_dot, make_dot_from_trace

The following lines allow you not to display the warnings in python:

import warnings

I will generate 1000 samples generated by the sklearn library allowing me to have a set of tests for a binary classification problem. The test set covers 2 decimal inputs and a binary output (0 or 1). I also display the result as a point cloud:

inputData,outputData = sklearn.datasets.make_moons(1000,noise=0.3) 
plt.scatter(inputData[:,0],inputData[:,1],s=40,c=outputData,cmap ="Spectral"))
Creation of 1000 samples with a result of 0 or 1 via sklearn

The test set generated here corresponds to a data set with 2 classes (class 0 and 1).

The goal of my neural network is therefore a binary classification of the input.

I will take 20% of this test set as test data and the remaining 80% as training data. I split the test set again to create a validation set on the training dataset.

X_train, X_test, y_train, y_test = train_test_split(inputData, outputData, test_size=0.20, random_state=42)
Input Data
Output data

In my example I’m going to create a network with any architecture what interests us is to show how the last layer works:

fc1 = nn.Linear(2, 2)
fc2 = nn.Linear(2, 1)

model = nn.Sequential(OrderedDict([
                      ('lin1', fc1),
                      ('BentID1', BentID()),
                      ('lin2', fc2),
                      ('sigmoid', nn.Sigmoid())
model = model.double()

I added a line to transform the pytorch model to use the double type. This avoids an error of the type:

Expected object of scalar type Double but got scalar type Float for argument

Also it will be necessary to use dtype = torch.double when creating torch.tensor at each training.

As indicated in my documents above the cost function is the Binary Cross Entropy function in this case (BCELoss : :

loss = nn.BCELoss()

I am using the Adam optimizer with a learning rate of 0.01:
Deep learning optimizers are algorithms or methods used to modify attributes of your neural network such as weights and bias in such a way that the loss function is minimized.

The learning rate is a hyperparameter that controls how much the model should be changed in response to the estimated error each time the model weights are updated.

Here I am using the Adam optimization algorithm. In Deep Learning Adam is a stochastic gradient descent method that calculates individual adaptive learning rates for different parameters from estimates of the first and second order moments of the gradients.
More information on optimizing Adam:
and in PyToch doc :

optimizer = optim.Adam(model.parameters(), lr=0.01)

I previously split the training data a second time to take 20% of the data as validation data. They allow us at each training to validate that we are not doing over-training (over-adjustment, or over-interpretation, or simply in English overfitting).

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.20, random_state=42)

In this example I have chosen to implement the EarlyStopping algorithm with a patience of 5. This means that if the cost function of the validation data increases during 15 training sessions (ie the distance between the prediction and the true data). After 15 training sessions with an increasing distance, we reload the previous data which represents the configuration of the neural network producing a distance of the minimum cost function (in the case of our example this consists in reloading the weight and the bias for the 2 layers linear). As long as the function decreases, the configuration of the neural network is saved (weight and bias of the linear layers in our example).

Here I have chosen to display the weights and bias of the 2 linear layers every 10 workouts, to give you an idea of how the Adam optimization algorithm works:

I used to display various graphs around training and validation metrics. The graphics are thus refreshed during use.

history1 = hl.History()
canvas1 = hl.Canvas()

Here is the main code of the learning loop:

bestvalidloss = np.inf
saveWeight1 = fc1.weight 
saveWeight2 = fc2.weight 
saveBias1 = fc1.bias
saveBias2 = fc2.bias
wait = 0
for epoch in range(1000):
    result = model(torch.tensor(X_train,dtype=torch.double))
    lossoutput = loss(result, torch.tensor(y_train,dtype=torch.double))
    print("EPOCH " + str(epoch) + "- train loss = " + str(lossoutput.item()))
        print("*************** PARAMETERS WEIGHT & BIAS *********************")
        print("weight linear1= " + str(fc1.weight))
        print('Bias linear1=' + str(fc1.bias))
        print("weight linear2= " + str(fc2.weight))
        print('bias linear2=' + str(fc2.bias))
    validpred = model(torch.tensor(X_valid,dtype=torch.double))
    validloss = loss(validpred, torch.tensor(y_valid,dtype=torch.double))
    print("EPOCH " + str(epoch) + "- valid loss = " + str(validloss.item()))
    # Store and plot train and valid loss.
    history1.log(epoch, trainloss=lossoutput.item(),validloss=validloss.item())
    canvas1.draw_plot([history1["trainloss"], history1["validloss"]])
    if(validloss.item() < bestvalidloss):
        bestvalidloss = validloss.item()
        saveWeight1 = fc1.weight 
        saveWeight2 = fc2.weight 
        saveBias1 = fc1.bias
        saveBias2 = fc2.bias
        wait = 0
        wait += 1
        if(wait > 15):
            #Restauration des mailleurs parametre et early stopping
            fc1.weight = saveWeight1
            fc2.weight = saveWeight2
            fc1.bias = saveBias1
            fc2.bias = saveBias2
            print("stop valid because loss is increasing (EARLY STOPPING) afte EPOCH=" + str(epoch))
            print("BEST VALID LOSS = " + str(bestvalidloss))
            print("BEST PARAMETERS WHEIGHT AND BIAS = ")
            print("FIRST LINEAR : WEIGHT=" + str(saveWeight1) + " BIAS = " + str(saveBias1))
            print("SECOND LINEAR : WEIGHT=" + str(saveWeight2) + " BIAS = " + str(saveBias2))

Here is an overview of the result, training stops after about 200 epochs (an “epoch” is a term used in machine learning to refer to a passage of the complete training data set). In my example I didn’t use a batch to load the data so the epoch count is the iteration count.

Evolution of the cost function on the training test and the validation test

Here’s a look at the prediction results:

result = model(torch.tensor(X_test,dtype=torch.double))
plt.scatter(X_test[:,0],X_test[:,1],s=40,c=y_test,cmap ="Spectral"))
plt.scatter(X_test[:,0],X_test[:,1],s=40,,cmap ="Spectral"))
Results: The first graph shows the “true” answer the second shows the predicted answer (the color varies according to the predicted probability) – Pytorch tutorial

To display the architecture of the neural network I used the hidden layer library which requires the installation of graphviz to avoid the following error:

RuntimeError: Make sure the Graphviz executables are on your system's path

To resolve this error see

On Mac OS X, for example, I installed via Homebrew:

brew install graphviz

Here is an overview of the hidden layer graph:

# HiddenLayer graph
hl.build_graph(model, torch.zeros(2,dtype=torch.double))
Hidden Layer architecture pour le classifier binaire – Pytorch tutorial

I also used PyTorchViz to display a graph of Pytorch operations during the forward of an entry and thus we can clearly see what will happen pass during the backward (ie the application the calculation of the dols / dx differential for all the parameters x of the network which has requires_grad = True).

make_dot(model(torch.ones(2,dtype=torch.double)), params=dict(model.named_parameters()))
Architecture diagram of the neural network created with PyTorch – Pytorch tutorial

Regression problem (numeric / decimal value calculation):

In this case, you are trying to predict a continuous numerical quantity using your DeepLearning algorithm.
In this case, the gradient descent algorithm consists in comparing the difference between the numerical value predicted by the network and the true value.
The output size of the network will be 1 since there is only one value to predict.

Final activation function (decimal or numeric value case):

The activation function to use in this case depends on the range in which your data is located.

For a data between -infinite and + infinite then you can use a linear function at the output of your network.

Linear function graph function. The particularity is that this linear function (of the type y = ax + b) contains learnable parameters (weight and bias which are modified by the optimizer over the training sessions, i.e. y = weight * x + bias).

Source Linear functions.

You can also use the Relu function if your value to predict is strictly positive
the output of ReLu is the maximum value between zero and the input value. An output is zero when the input value is negative and the input value when the input is positive.
Note that unlike the rectified linear activation function Relu does not have an adjustable parameter (learnable).

Fonction d’activation ReLu Source =

Another possible function is PReLu (parametric linear rectification unit):

PReLu activation function

Also another possible final activation function is Bent identity.

Bent identity activation function Source

Finally I recommend the possible use of Parametric Soft Exponential, I use the echoAi implementation because it is not natively in PyTorch

Soft Exponential
The cost function (regression case, calculation of numerical value):

The cost function which makes it possible to determine the distance between the predicted value and the real value is the average of the squared distance between the 2 predictions. The function to use is mean squared error (MSE):

Here is a simple example that illustrates the use of each of the final functions:

I started off with the example of house prices which I then adapted to test with each of the functions.

Example: Loading test data

First, I import the necessary libraries:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from collections import OrderedDict
import math
from random import randrange
import torch 
from torch.autograd import Variable
import torch.nn as nn 
from torch.autograd import Function 
from torch.nn.parameter import Parameter 
from torch import optim 
import torch.nn.functional as F 
from torchvision import datasets, transforms
from echoAI.Activation.Torch.bent_id import BentID
from echoAI.Activation.Torch.soft_exponential import SoftExponential
from sklearn.model_selection import train_test_split
import sklearn.datasets
from sklearn.metrics import accuracy_score
import hiddenlayer as hl
import warnings
from torchviz import make_dot, make_dot_from_trace
from sklearn.datasets import make_regression
from torch.autograd import Variable

I create the dataset which is quite simple and display it:

house_prices_array = [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140]
house_price_np = np.array(house_prices_array, dtype=np.float32)
house_price_np = house_price_np.reshape(-1,1)
house_price_tensor = Variable(torch.from_numpy(house_price_np))
house_size = [ 7.5, 7, 6.5, 6.0, 5.5, 5.0, 4.5,3.5,3.2,2.8,3.0,2.5]
house_size_np = np.array(house_size, dtype=np.float32)
house_size_np = house_size_np.reshape(-1, 1)
house_size_tensor = Variable(torch.from_numpy(house_size_np))
import matplotlib.pyplot as plt
plt.scatter(house_prices_array, house_size_np)
plt.xlabel("House Price $")
plt.ylabel("House Sizes")
plt.title("House Price $ VS House Size")

In the example, we see that the function to find is close to
f (x) = – 0.05 * x + 9
Example: – 0.05 * 40 + 9 = 7 and -0.05 * 30 + 9 = 7.5

I set the bias and the weight to -0.01 and 8 to limit the training time.

Linear activation function (Solving regression problem):
fc1 = nn.Linear(1, 1)

model = nn.Sequential(OrderedDict([
                       ('lin', fc1)

model = model.float()

I declare MSELoss and an Adam optimizer tuned with a learning rate 0.01

loss = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

Here is the training loop:

history1 = hl.History()
canvas1 = hl.Canvas()
for epoch in range(100):
    result = model(house_price_tensor)
    lossoutput = loss(result, house_size_tensor)
    print("EPOCH " + str(epoch) + "- train loss = " + str(lossoutput.item()))
    history1.log(epoch, trainloss=lossoutput.item())

After a hundred training sessions we obtain an MSELoss = 0.13

If I display the weights and biases found by the model I get in my example:

print("weight" + str(fc1.weight))
print("bias" + str(fc1.bias))

The network therefore executes here the function f (x) = -0.0408 * x + 8.1321

I then display the predicted result to compare it with the actual result:

result = model(house_price_tensor)
plt.scatter(house_prices_array, house_size_np)

Here is a preview of the architecture diagram for a simple Pytorch neuron network with Linear function, we see the multiplication operator and the addition operator to execute y = ax + b.

hl.build_graph(model, torch.ones(1,dtype=torch.float))
Diagramme of the neural network pytorch – Pytorch tutorial

ReLu activation function (Regression problem solving):

ReLu is not an activation function with “learnable” parameters (modified by the optimizer) so I add a linear layer upstream for the test:

The result is identical, which is logical since in my case reLu only passes the result which is strictly positive

In the previous example, a ReLu layer must be added at the output after the linear layer:

fc1 = nn.Linear(1, 1)

model = nn.Sequential(OrderedDict([
                       ('lin', fc1),

model = model.float()
Model architecture with linear layer + ReLu layer

The same for PrRelu in my case:
These two functions simply make a flat pass between the linear function and the output. ReLu remains interesting if you want to set to 0 all the outputs concerning a negative input.

Bent Identity Function (Regression Problem Solving):

Bent identity there is no learning parameter but the function does not just forward the input it applies the curved identity function to it which slightly modifies the result:

fc1 = nn.Linear(1, 1)

model = nn.Sequential(OrderedDict([
                       ('lin', fc1),

model = model.float()

In our case after 100 training sessions the network found the following weights for the linear function upstream of bent identity the final loss is 0.78 so less good than with the linear function (the network converges less quickly):

In our case, the function applied by the network will therefore be:

-0.0481 ((math.sqt(x**2 + 1) -1)/2 + x) + 7.7386

Result obtained with Bent Identity:

hl.build_graph(model, torch.ones(1,dtype=torch.float))

I display the architecture of the Pytorch network which is now linear layer + bent identity layer :

Also :

make_dot(model(torch.ones(1,dtype=torch.float)), params=dict(model.named_parameters()))
Soft Exponential – (Regression problem solving)

Soft Exponential has a trainable alpha parameter.We are going to place a linear function upstream and softExponential in output and check the approximation performed in this case:

fc1 = nn.Linear(1, 1)
fse = SoftExponential(1,torch.tensor([-0.9],dtype=torch.float))
model = nn.Sequential(OrderedDict([
                       ('lin', fc1),

model = model.float()

After 100 training sessions the results on the linear layer parameters and the soft exponential layer parameters are as follows:

print("weight" + str(fc1.weight))
print("bias" + str(fc1.bias))
print("fse alpha" + str(fse.alpha))

The result is not really good since the approximation is not done in the right direction,

We therefore notice that we could invert the function in our case to define a Soft Exponential customize activation function with a bias criterion and that is what we will do here.

Custom function – (Regression problem solving)

We declare a custom activation function to PyTorch compared to the original soft exponential I added the beta bias criterion as well as the torch.div inversion (1, …)

class SoftExponential2(nn.Module):
    def __init__(self, in_features, alpha=None,beta=None):
        super(SoftExponential2, self).__init__()
        self.in_features = in_features
        # initialize alpha
        if alpha is None:
            self.alpha = Parameter(torch.tensor(0.0))  # create a tensor out of alpha
            self.alpha = Parameter(torch.tensor(alpha))  # create a tensor out of alpha
        if beta is None:
            self.beta = Parameter(torch.tensor(0.0))  # create a tensor out of alpha
            self.beta = Parameter(torch.tensor(beta))  # create a tensor out of alpha

        self.alpha.requiresGrad = True  # set requiresGrad to true!
        self.beta.requiresGrad = True  # set requiresGrad to true!

    def forward(self, x):
        if self.alpha == 0.0:
            return x

        if self.alpha < 0.0:
            return torch.add(torch.div(1,(-torch.log(1 - self.alpha * (x + self.alpha)) / self.alpha)),self.beta)

        if self.alpha > 0.0:
            return torch.add(torch.div(1,((torch.exp(self.alpha * x) - 1) / self.alpha + self.alpha)),self.beta)

We now have 2 parameters that can be trained in this custom function in Pytorch.

By also lowering the learning rate to 0.01 after 100 training sessions and initializing alpha = 0 .1 and beta = 0.7 I arrive at a loss <5

fse = SoftExponential2(1,torch.tensor([0.1],dtype=torch.float),torch.tensor([7.0],dtype=torch.float))

model = nn.Sequential(OrderedDict([

model = model.float()
print("fse alpha" + str(fse.alpha))
print("fse beta" + str(fse.beta))

Network architecture applied for my custom soft exponentialt inversion function with addition of a bias criterion.

Categorization problem (predict a class among several classes possible) – single-label classifier with pytorch

You are looking to predict the probability that your entry will match a single class and exit betting multiple possible classes.
For example you want to predict the type of vehicle on an image allowed the classes: car, truck, train

The dimensioning of your output network corresponds to n neurons for n classes. And for each output neurons corresponding to a possible class to be predicted, we will obtain a probability between 0 and 1 which will represent the probability that the input corresponds to this class at the output. So for each class this comes down to solving a binary classification problem in part 1 but in addition to which it must be considered that an input can only correspond to a single output class.

Activation function Categorization problem (predict a class among several possible classes):

The activation function to be used on the final layer is a Softfmax function with n dimensions corresponding to the number of classes to be predicted.
Softmax is a mathematical function that converts a vector of numbers (tensor) into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.


In other words, for an output tensor it will return a probability for each class by scaling each of them so that their sum is equal to one.

Cost function – Categorization problem (predict a class among several possible classes):

The cost function to be used is close to that used in the case of binary classification but with this notion of probability vector.
Cross Entropy will evaluate the difference between 2 probability distributions. We will therefore use it to compare the predicted value and the true value.


Based on the tutorial and the data set on this page we have:

For info if you get the following error:

ImportError: IProgress not found. Please update jupyter and ipywidgets. See

You need to setup Iprogress on jupyter lab

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

In our example we will first retrieve the input dataset
The dataset used is CIFAR-10, more information here:

This is already integrated with Pytorch to allow us to perform certain tests.

Exemple (problème de catégorisation) :

Preparation of the dataset:

import torch
import torchvision
import torchvision.transforms as transforms

transform = transforms.Compose(
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader =, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader =, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

We then define a function which allows to visualize the imshow image and we display for each image the unique label associated with it:

import matplotlib.pyplot as plt
import numpy as np

# functions to show an image

def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))

# get some random training images
dataiter = iter(trainloader)
images, labels =

# show images
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

Create a convolutional neural network (more information here:

A convolutional neural network (ConvNet or CNN) is a DeepLearning algorithm that can take an image as input, it sets a score (weight and bias which are learnable parameters) to various aspects / objects of the image and be able to differentiate one from the other.


The architecture of a convolutional neuron network is close to that of the model of neuron connectivity in the human brain and was inspired by the organization of the visual cortex.

Individual neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. A collection of these fields overlap to cover the entire visual area.

In our case we are going to use with Pytorch a Conv2d layer and a pooling layer. : The Conv2d layer is the 2D convolution layer.

Source :

A filter in a conv2D layer has a height and a width. They are often smaller than the input image. This filter therefore moves over the entire image during training (this area is called the receptive field).

The Max Pooling layer is a sampling process. The objective is to sub-sample an input representation (image for example), by reducing its size and by making assumptions on the characteristics contained in the grouped sub-regions.

In my example with PyTorch the declaration is made :

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

Define the criterion of the cost function with CrossEntropyLoss, here we use the SGD opimizer rather than adam (more info here on the comparison between optimizer

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

The training loop:

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

The Pytorch neuron network is saved

PATH = './cifar_net.pth', PATH)

We load a test and use the network to predict the outcome.

dataiter = iter(testloader)
images, labels =

# print images
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

In this example the “true” labels are “Cat, ship, ship, plane”, we then launch the network to make a prediction:

net = Net()
outputs = net(images)
_, predicted = torch.max(outputs, 1)
print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

Here is an overview of the architecture of this convolution network.

import hiddenlayer as hl
from torchviz import make_dot, make_dot_from_trace
make_dot(net(images), params=dict(net.named_parameters()))

Categorization problem (predict several class among several classes possible) – multiple-label classifier with pytorch – Pytorch tutorial

Overall, it is about predicting several probabilities for each of the classes to indicate their probabilities of presence in the entry. One possible use is to indicate the presence of an object in an image.

The problem then comes back to a problem of binary classification for n classes.

The final activation function is sigmoid and the loss function is Binary cross entropy

This internet example perfectly illustrates the use of BCELoss in the case of the prediction of several classes among several possible classes.

In this example: The image dataset used is the CelebFaces Large Scale Attribute Dataset (CelebA).

In this data dataset there are 200K images with 40 different class labels and each image has a different background footprint and there are a lot of different variations making it difficult for a model to classify each label effectively class.

I suggest you follow the tutorial from the article

  • Step1: Download the file hard site (it is in a google drive)
  • Step2: Download the list_attr_celeba.txt file which contains the annotations for each image on the same site
  • Step3: Opening the file annotations you can see the 40 labels: 5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young
    For each of the images present in the test set there is a value of 1 or -1 specifying for image whether the class is present in the image.
    The objective here will be to predict for each of the classes the probability of its presence in the image.
  • Step4: you can load the notebook which is present on

Here is the results:

Load of the dataset :

Use of the imshow function (see previous example for single label prediction)

Architecture of the convolutional neuron network:

import torch.nn.functional as F
class MultiClassifier(nn.Module):
    def __init__(self):
        super(MultiClassifier, self).__init__()
        self.ConvLayer1 = nn.Sequential(
            nn.Conv2d(3, 64, 3), # 3, 256, 256
            nn.MaxPool2d(2), # op: 16, 127, 127
            nn.ReLU(), # op: 64, 127, 127
        self.ConvLayer2 = nn.Sequential(
            nn.Conv2d(64, 128, 3), # 64, 127, 127   
            nn.MaxPool2d(2), #op: 128, 63, 63
            nn.ReLU() # op: 128, 63, 63
        self.ConvLayer3 = nn.Sequential(
            nn.Conv2d(128, 256, 3), # 128, 63, 63
            nn.MaxPool2d(2), #op: 256, 30, 30
            nn.ReLU() #op: 256, 30, 30
        self.ConvLayer4 = nn.Sequential(
            nn.Conv2d(256, 512, 3), # 256, 30, 30
            nn.MaxPool2d(2), #op: 512, 14, 14
            nn.ReLU(), #op: 512, 14, 14
        self.Linear1 = nn.Linear(512 * 14 * 14, 1024)
        self.Linear2 = nn.Linear(1024, 256)
        self.Linear3 = nn.Linear(256, 40)
    def forward(self, x):
        x = self.ConvLayer1(x)
        x = self.ConvLayer2(x)
        x = self.ConvLayer3(x)
        x = self.ConvLayer4(x)
        x = x.view(x.size(0), -1)
        x = self.Linear1(x)
        x = self.Linear2(x)
        x = self.Linear3(x)
        return F.sigmoid(x)

An overview of the network architecture:

SUMMARY – Pytorch tutorial :

Here is a summary of my pytorch tutorial : sheet that I created to allow you to choose the right activation function and the right cost function more quickly according to your problem to be solved.

Pytorch tutorial : THE 5 BEST DEEP LEARNING LINKS : Dree courses Yann LeCun : Official web site PyToch : Advanced on deep learning : Free MOOC

Pytorch tutorial – internal links