When I wrote this blog post (this Pytorch tutorial), I remembered the challenge I set for myself at the beginning of the year to learn deep learning, I did not even know Python at the time. What makes things difficult is not necessarily the complexity of the concepts, but it starts with questions like: What framework to use for deep learning? Which activation function should I choose? Which cost function is best suited for my problem?
My personal experience has been to study the PyTorch framework and especially for the theory the online course made available for free by Yan LeCun whom I thank (link here Website fr.wikipedia.org) . I still have to learn and work in the field but through this blog post, I would like to share and give you an overview of what I have learned about deep learning this year.
Deep Learning: Which activation and cost function to choose?
The objective of this article is to sweep through this central topic in DeepLearning of choosing the activation and cost (loss) function according to the problem you are looking to solve by means of a DeepLearning algorithm.
We are talking about the activation function of the last layer of your model, that is to say the one that gives you the result.
This result is used in the algorithm which checks for each learning the difference between the result predicted by the neural network and the real result (gradient descent algorithm), for this we apply to the result a cost function (loss function) which represents this difference and which you seek to minimize as you practice in order to reduce the error. (See this article here to learn more : Website machinelearnia.com)
As a reminder, the machine learns by minimizing the cost function, iteratively by successive training steps, the result of the cost function and taken into account for the adjustment of the parameters of the neurons (weight and bias for example for linear layers) .
This choice of activation function on the last layer and the cost function to be minimized (loss function) is therefore crucial since these two elements combined will determine how you are going to solve a problem using your neural network.
This article also presents simple examples that use the Pytorch framework in my opinion a very good tool for machine learning.
The question here to ask is the following:
- Am I looking to create a model that performs binary classification? (In this case you are trying to predict a probability that the result of an entry will be 0 or 1)
Am I looking to calculate / predict a numeric value with my neural network? (In this case you are trying to predict a decimal value for example at the output that corresponds to your input)
- Am I looking to create a model that performs single or multiple label classification for a set of classes? (In this case, you are trying to predict for each output class the probability that an input matches this class)
- Am I looking to create a model that searches for multiple classes within a set of possible classes? (In this case, you are trying to predict for each class at the exit its attendance rate at the entrance).
Binary classification problem:
You are trying to predict by means of your DeepLearning algorithm whether a result is true or false, and more precisely it is a probability that a result of 1 or that of a result of 0 that you will get.
The output size of your neural network is 1 (final layer) and you seek to obtain a result between 0 and 1 which will be assimilated to the probability that the result is 1 (example at the output of the network if you obtain 0.65 this will correspond to 65% chance that the result is true).
Application example: Predicting the probability that the home team in a football match will win. The closer the value is to 1 the more the home team has a chance of winning, and conversely the closer the score is to 0 the more chance the home team has of losing.
It is possible to qualify this result using a threshold, for example by admitting that if the output is> 0.5 then the result is true if not false.
Final activation function (case of binary classification):
The final activation function should return a result between 0 and 1, the correct choice in this case may be the sigmoid function. The sigmoid function will easily translate a result between 0 and 1 and therefore is ideal for translating the probability that we are trying to predict.
If you want to plot the sigmoid function in Python here is some code that should help you (see alsoWebsite squall0032.tumblr.com) :
import math import numpy as np import matplotlib.pyplot as plt def sigmoid(x): a =  for item in x: a.append(1/(1+math.exp(-item))) return a x = np.arange(-10., 10., 0.2) sig = sigmoid(x) plt.plot(x,sig) plt.show()
The cost function – Loss function (case of binary classification):
You have to determine during training the difference between the probability that the model predicts (translated via the final sigmoid function) and the true and known response (0 or 1). The function to use is Binary Cross Entrropy because this function allows to calculate the difference between 2 probability.
The optimizer will then use this result to adjust the weights and biases in your model (or other parameters depending on the architecture of your model).
In this example I will create a neural network with 1 linear layer and a final sigmoid activation function.
First, I perform all the imports that we will use in this post:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from collections import OrderedDict import math from random import randrange import torch from torch.autograd import Variable import torch.nn as nn from torch.autograd import Function from torch.nn.parameter import Parameter from torch import optim import torch.nn.functional as F from torchvision import datasets, transforms from echoAI.Activation.Torch.bent_id import BentID from sklearn.model_selection import train_test_split import sklearn.datasets from sklearn.metrics import accuracy_score import hiddenlayer as hl import warnings warnings.filterwarnings("ignore") from torchviz import make_dot, make_dot_from_trace
The following lines allow you not to display the warnings in python:
import warnings warnings.filterwarnings("ignore")
I will generate 1000 samples generated by the sklearn library allowing me to have a set of tests for a binary classification problem. The test set covers 2 decimal inputs and a binary output (0 or 1). I also display the result as a point cloud:
inputData,outputData = sklearn.datasets.make_moons(1000,noise=0.3) plt.scatter(inputData[:,0],inputData[:,1],s=40,c=outputData,cmap = plt.cm.get_cmap("Spectral"))
The test set generated here corresponds to a data set with 2 classes (class 0 and 1).
The goal of my neural network is therefore a binary classification of the input.
I will take 20% of this test set as test data and the remaining 80% as training data. I split the test set again to create a validation set on the training dataset.
X_train, X_test, y_train, y_test = train_test_split(inputData, outputData, test_size=0.20, random_state=42)
In my example I’m going to create a network with any architecture what interests us is to show how the last layer works:
- input layer: linear activation function
- hidden layers: activation function: bent identity
- activation function that I import from the EchoAI package which implements additional activation functions for Pytorch (Website en.wikipedia.org and Website en.wikipedia.org).
linear activation function: output layer with an activation function chosen sigmoid as explained above, the objective is to reduce the result to a probability that the input is of class 0 or 1. (Website pytorch.org)
fc1 = nn.Linear(2, 2) fc2 = nn.Linear(2, 1) model = nn.Sequential(OrderedDict([ ('lin1', fc1), ('BentID1', BentID()), ('lin2', fc2), ('sigmoid', nn.Sigmoid()) ])) model = model.double()
I added a line to transform the pytorch model to use the double type. This avoids an error of the type:
Expected object of scalar type Double but got scalar type Float for argument
Also it will be necessary to use dtype = torch.double when creating torch.tensor at each training.
As indicated in my documents above the cost function is the Binary Cross Entropy function in this case (BCELoss : Website pytorch.org) :
loss = nn.BCELoss()
I am using the Adam optimizer with a learning rate of 0.01:
Deep learning optimizers are algorithms or methods used to modify attributes of your neural network such as weights and bias in such a way that the loss function is minimized.
The learning rate is a hyperparameter that controls how much the model should be changed in response to the estimated error each time the model weights are updated.
Here I am using the Adam optimization algorithm. In Deep Learning Adam is a stochastic gradient descent method that calculates individual adaptive learning rates for different parameters from estimates of the first and second order moments of the gradients.
More information on optimizing Adam: Website machinelearningmastery.com
and in PyToch doc : Website pytorch.org
optimizer = optim.Adam(model.parameters(), lr=0.01)
I previously split the training data a second time to take 20% of the data as validation data. They allow us at each training to validate that we are not doing over-training (over-adjustment, or over-interpretation, or simply in English overfitting).
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.20, random_state=42)
In this example I have chosen to implement the EarlyStopping algorithm with a patience of 5. This means that if the cost function of the validation data increases during 15 training sessions (ie the distance between the prediction and the true data). After 15 training sessions with an increasing distance, we reload the previous data which represents the configuration of the neural network producing a distance of the minimum cost function (in the case of our example this consists in reloading the weight and the bias for the 2 layers linear). As long as the function decreases, the configuration of the neural network is saved (weight and bias of the linear layers in our example).
Here I have chosen to display the weights and bias of the 2 linear layers every 10 workouts, to give you an idea of how the Adam optimization algorithm works:
I used Website github.com to display various graphs around training and validation metrics. The graphics are thus refreshed during use.
history1 = hl.History() canvas1 = hl.Canvas()
Here is the main code of the learning loop:
bestvalidloss = np.inf saveWeight1 = fc1.weight saveWeight2 = fc2.weight saveBias1 = fc1.bias saveBias2 = fc2.bias wait = 0 for epoch in range(1000): model.train() optimizer.zero_grad() result = model(torch.tensor(X_train,dtype=torch.double)) lossoutput = loss(result, torch.tensor(y_train,dtype=torch.double)) lossoutput.backward() optimizer.step() print("EPOCH " + str(epoch) + "- train loss = " + str(lossoutput.item())) if((epoch+1)%10==0): #AFFICHE LES PARAMETRES DU RESEAU TOUT LES 10 entrainements print("*************** PARAMETERS WEIGHT & BIAS *********************") print("weight linear1= " + str(fc1.weight)) print('Bias linear1=' + str(fc1.bias)) print("weight linear2= " + str(fc2.weight)) print('bias linear2=' + str(fc2.bias)) print("**************************************************************") model.eval() validpred = model(torch.tensor(X_valid,dtype=torch.double)) validloss = loss(validpred, torch.tensor(y_valid,dtype=torch.double)) print("EPOCH " + str(epoch) + "- valid loss = " + str(validloss.item())) # Store and plot train and valid loss. history1.log(epoch, trainloss=lossoutput.item(),validloss=validloss.item()) canvas1.draw_plot([history1["trainloss"], history1["validloss"]]) if(validloss.item() < bestvalidloss): bestvalidloss = validloss.item() # saveWeight1 = fc1.weight saveWeight2 = fc2.weight saveBias1 = fc1.bias saveBias2 = fc2.bias wait = 0 else: wait += 1 if(wait > 15): #Restauration des mailleurs parametre et early stopping fc1.weight = saveWeight1 fc2.weight = saveWeight2 fc1.bias = saveBias1 fc2.bias = saveBias2 print("##############################################################") print("stop valid because loss is increasing (EARLY STOPPING) afte EPOCH=" + str(epoch)) print("BEST VALID LOSS = " + str(bestvalidloss)) print("BEST PARAMETERS WHEIGHT AND BIAS = ") print("FIRST LINEAR : WEIGHT=" + str(saveWeight1) + " BIAS = " + str(saveBias1)) print("SECOND LINEAR : WEIGHT=" + str(saveWeight2) + " BIAS = " + str(saveBias2)) print("##############################################################") break else: continue
Here is an overview of the result, training stops after about 200 epochs (an « epoch » is a term used in machine learning to refer to a passage of the complete training data set). In my example I didn’t use a batch to load the data so the epoch count is the iteration count.
Evolution of the cost function on the training test and the validation test
Here’s a look at the prediction results:
result = model(torch.tensor(X_test,dtype=torch.double)) plt.scatter(X_test[:,0],X_test[:,1],s=40,c=y_test,cmap = plt.cm.get_cmap("Spectral")) plt.title("True") plt.colorbar() plt.show() plt.scatter(X_test[:,0],X_test[:,1],s=40,c=result.data.numpy(),cmap = plt.cm.get_cmap("Spectral")) plt.title("predicted") plt.colorbar() plt.show()
To display the architecture of the neural network I used the hidden layer library which requires the installation of graphviz to avoid the following error:
RuntimeError: Make sure the Graphviz executables are on your system's path
To resolve this error see Website graphviz.org
On Mac OS X, for example, I installed via Homebrew:
brew install graphviz
Here is an overview of the hidden layer graph:
# HiddenLayer graph hl.build_graph(model, torch.zeros(2,dtype=torch.double))
I also used PyTorchViz Website github.com to display a graph of Pytorch operations during the forward of an entry and thus we can clearly see what will happen pass during the backward (ie the application the calculation of the dols / dx differential for all the parameters x of the network which has requires_grad = True).
Regression problem (numeric / decimal value calculation):
In this case, you are trying to predict a continuous numerical quantity using your DeepLearning algorithm.
In this case, the gradient descent algorithm consists in comparing the difference between the numerical value predicted by the network and the true value.
The output size of the network will be 1 since there is only one value to predict.
Final activation function (decimal or numeric value case):
The activation function to use in this case depends on the range in which your data is located.
For a data between -infinite and + infinite then you can use a linear function at the output of your network.
Linear function graph function. Website pytorch.org. The particularity is that this linear function (of the type y = ax + b) contains learnable parameters (weight and bias which are modified by the optimizer over the training sessions, i.e. y = weight * x + bias).
You can also use the Relu function if your value to predict is strictly positive
the output of ReLu is the maximum value between zero and the input value. An output is zero when the input value is negative and the input value when the input is positive.
Note that unlike the rectified linear activation function Relu does not have an adjustable parameter (learnable).
Another possible function is PReLu (parametric linear rectification unit):
Also another possible final activation function is Bent identity.
Finally I recommend the possible use of Parametric Soft Exponential, I use the echoAi implementation because it is not natively in PyTorch
The cost function (regression case, calculation of numerical value):
The cost function which makes it possible to determine the distance between the predicted value and the real value is the average of the squared distance between the 2 predictions. The function to use is mean squared error (MSE): Website pytorch.org
Here is a simple example that illustrates the use of each of the final functions:
I started off with the example of house prices which I then adapted to test with each of the functions.
Example: Loading test data
First, I import the necessary libraries:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from collections import OrderedDict import math from random import randrange import torch from torch.autograd import Variable import torch.nn as nn from torch.autograd import Function from torch.nn.parameter import Parameter from torch import optim import torch.nn.functional as F from torchvision import datasets, transforms from echoAI.Activation.Torch.bent_id import BentID from echoAI.Activation.Torch.soft_exponential import SoftExponential from sklearn.model_selection import train_test_split import sklearn.datasets from sklearn.metrics import accuracy_score import hiddenlayer as hl import warnings warnings.filterwarnings("ignore") from torchviz import make_dot, make_dot_from_trace from sklearn.datasets import make_regression from torch.autograd import Variable
I create the dataset which is quite simple and display it:
house_prices_array = [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140] house_price_np = np.array(house_prices_array, dtype=np.float32) house_price_np = house_price_np.reshape(-1,1) house_price_tensor = Variable(torch.from_numpy(house_price_np)) house_size = [ 7.5, 7, 6.5, 6.0, 5.5, 5.0, 4.5,3.5,3.2,2.8,3.0,2.5] house_size_np = np.array(house_size, dtype=np.float32) house_size_np = house_size_np.reshape(-1, 1) house_size_tensor = Variable(torch.from_numpy(house_size_np)) import matplotlib.pyplot as plt plt.scatter(house_prices_array, house_size_np) plt.xlabel("House Price $") plt.ylabel("House Sizes") plt.title("House Price $ VS House Size") plt.show()
In the example, we see that the function to find is close to
f (x) = – 0.05 * x + 9
Example: – 0.05 * 40 + 9 = 7 and -0.05 * 30 + 9 = 7.5
I set the bias and the weight to -0.01 and 8 to limit the training time.
Linear activation function (Solving regression problem):
fc1 = nn.Linear(1, 1) model = nn.Sequential(OrderedDict([ ('lin', fc1) ])) model = model.float() fc1.weight.data.fill_(-0.01) fc1.bias.data.fill_(8)
I declare MSELoss and an Adam optimizer tuned with a learning rate 0.01
loss = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=0.01)
Here is the training loop:
history1 = hl.History() canvas1 = hl.Canvas() for epoch in range(100): model.train() optimizer.zero_grad() result = model(house_price_tensor) lossoutput = loss(result, house_size_tensor) lossoutput.backward() optimizer.step() print("EPOCH " + str(epoch) + "- train loss = " + str(lossoutput.item())) history1.log(epoch, trainloss=lossoutput.item()) canvas1.draw_plot([history1["trainloss"]])
After a hundred training sessions we obtain an MSELoss = 0.13
If I display the weights and biases found by the model I get in my example:
print("weight" + str(fc1.weight)) print("bias" + str(fc1.bias))
The network therefore executes here the function f (x) = -0.0408 * x + 8.1321
I then display the predicted result to compare it with the actual result:
result = model(house_price_tensor) plt.scatter(house_prices_array, house_size_np) plt.title("True") plt.show() plt.scatter(house_prices_array,result.detach().numpy()) plt.title("predicted") plt.show()
Here is a preview of the architecture diagram for a simple Pytorch neuron network with Linear function, we see the multiplication operator and the addition operator to execute y = ax + b.
ReLu activation function (Regression problem solving):
ReLu is not an activation function with « learnable » parameters (modified by the optimizer) so I add a linear layer upstream for the test:
The result is identical, which is logical since in my case reLu only passes the result which is strictly positive
In the previous example, a ReLu layer must be added at the output after the linear layer:
fc1 = nn.Linear(1, 1) model = nn.Sequential(OrderedDict([ ('lin', fc1), ('relu',nn.ReLU()) ])) model = model.float() fc1.weight.data.fill_(-0.01) fc1.bias.data.fill_(8)
The same for PrRelu in my case:
These two functions simply make a flat pass between the linear function and the output. ReLu remains interesting if you want to set to 0 all the outputs concerning a negative input.
Bent Identity Function (Regression Problem Solving):
Bent identity there is no learning parameter but the function does not just forward the input it applies the curved identity function to it which slightly modifies the result:
fc1 = nn.Linear(1, 1) model = nn.Sequential(OrderedDict([ ('lin', fc1), ('BentID',BentID()) ])) model = model.float() fc1.weight.data.fill_(-0.01) fc1.bias.data.fill_(8)
In our case after 100 training sessions the network found the following weights for the linear function upstream of bent identity the final loss is 0.78 so less good than with the linear function (the network converges less quickly):
In our case, the function applied by the network will therefore be:
-0.0481 ((math.sqt(x**2 + 1) -1)/2 + x) + 7.7386
Result obtained with Bent Identity:
I display the architecture of the Pytorch network which is now linear layer + bent identity layer :
Soft Exponential – (Regression problem solving)
Soft Exponential has a trainable alpha parameter.We are going to place a linear function upstream and softExponential in output and check the approximation performed in this case:
fc1 = nn.Linear(1, 1) fse = SoftExponential(1,torch.tensor([-0.9],dtype=torch.float)) model = nn.Sequential(OrderedDict([ ('lin', fc1), ('SoftExponential',fse) ])) model = model.float() fc1.weight.data.fill_(-0.01) fc1.bias.data.fill_(8)
After 100 training sessions the results on the linear layer parameters and the soft exponential layer parameters are as follows:
print("weight" + str(fc1.weight)) print("bias" + str(fc1.bias)) print("fse alpha" + str(fse.alpha))
The result is not really good since the approximation is not done in the right direction,
We therefore notice that we could invert the function in our case to define a Soft Exponential customize activation function with a bias criterion and that is what we will do here.
Custom function – (Regression problem solving)
We declare a custom activation function to PyTorch compared to the original soft exponential I added the beta bias criterion as well as the torch.div inversion (1, …)
class SoftExponential2(nn.Module): def __init__(self, in_features, alpha=None,beta=None): super(SoftExponential2, self).__init__() self.in_features = in_features # initialize alpha if alpha is None: self.alpha = Parameter(torch.tensor(0.0)) # create a tensor out of alpha else: self.alpha = Parameter(torch.tensor(alpha)) # create a tensor out of alpha if beta is None: self.beta = Parameter(torch.tensor(0.0)) # create a tensor out of alpha else: self.beta = Parameter(torch.tensor(beta)) # create a tensor out of alpha self.alpha.requiresGrad = True # set requiresGrad to true! self.beta.requiresGrad = True # set requiresGrad to true! def forward(self, x): if self.alpha == 0.0: return x if self.alpha < 0.0: return torch.add(torch.div(1,(-torch.log(1 - self.alpha * (x + self.alpha)) / self.alpha)),self.beta) if self.alpha > 0.0: return torch.add(torch.div(1,((torch.exp(self.alpha * x) - 1) / self.alpha + self.alpha)),self.beta)
We now have 2 parameters that can be trained in this custom function in Pytorch.
By also lowering the learning rate to 0.01 after 100 training sessions and initializing alpha = 0 .1 and beta = 0.7 I arrive at a loss <5
fse = SoftExponential2(1,torch.tensor([0.1],dtype=torch.float),torch.tensor([7.0],dtype=torch.float)) model = nn.Sequential(OrderedDict([ ('SoftExponential',fse) ])) model = model.float() fc1.weight.data.fill_(-0.01) fc1.bias.data.fill_(8)
print("fse alpha" + str(fse.alpha)) print("fse beta" + str(fse.beta))
Categorization problem (predict a class among several classes possible) – single-label classifier with pytorch
You are looking to predict the probability that your entry will match a single class and exit betting multiple possible classes.
For example you want to predict the type of vehicle on an image allowed the classes: car, truck, train
The dimensioning of your output network corresponds to n neurons for n classes. And for each output neurons corresponding to a possible class to be predicted, we will obtain a probability between 0 and 1 which will represent the probability that the input corresponds to this class at the output. So for each class this comes down to solving a binary classification problem in part 1 but in addition to which it must be considered that an input can only correspond to a single output class.
Activation function Categorization problem (predict a class among several possible classes):
The activation function to be used on the final layer is a Softfmax function with n dimensions corresponding to the number of classes to be predicted.
Softmax is a mathematical function that converts a vector of numbers (tensor) into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
In other words, for an output tensor it will return a probability for each class by scaling each of them so that their sum is equal to one.
Cost function – Categorization problem (predict a class among several possible classes):
The cost function to be used is close to that used in the case of binary classification but with this notion of probability vector.
Cross Entropy will evaluate the difference between 2 probability distributions. We will therefore use it to compare the predicted value and the true value.
Based on the tutorial and the data set on this page we have:Website pytorch.org
For info if you get the following error:
ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
You need to setup Iprogress on jupyter lab
pip install ipywidgets jupyter nbextension enable --py widgetsnbextension
In our example we will first retrieve the input dataset
The dataset used is CIFAR-10, more information here:
This is already integrated with Pytorch to allow us to perform certain tests.
Exemple (problème de catégorisation) :
Preparation of the dataset:
import torch import torchvision import torchvision.transforms as transforms transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2) testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2) classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
We then define a function which allows to visualize the imshow image and we display for each image the unique label associated with it:
import matplotlib.pyplot as plt import numpy as np # functions to show an image def imshow(img): img = img / 2 + 0.5 # unnormalize npimg = img.numpy() plt.imshow(np.transpose(npimg, (1, 2, 0))) plt.show() # get some random training images dataiter = iter(trainloader) images, labels = dataiter.next() # show images imshow(torchvision.utils.make_grid(images)) # print labels print(' '.join('%5s' % classes[labels[j]] for j in range(4)))
Create a convolutional neural network (more information here: Website towardsdatascience.com)
A convolutional neural network (ConvNet or CNN) is a DeepLearning algorithm that can take an image as input, it sets a score (weight and bias which are learnable parameters) to various aspects / objects of the image and be able to differentiate one from the other.
The architecture of a convolutional neuron network is close to that of the model of neuron connectivity in the human brain and was inspired by the organization of the visual cortex.
Individual neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. A collection of these fields overlap to cover the entire visual area.
In our case we are going to use with Pytorch a Conv2d layer and a pooling layer.
Website pytorch.org : The Conv2d layer is the 2D convolution layer.
A filter in a conv2D layer has a height and a width. They are often smaller than the input image. This filter therefore moves over the entire image during training (this area is called the receptive field).
The Max Pooling layer is a sampling process. The objective is to sub-sample an input representation (image for example), by reducing its size and by making assumptions on the characteristics contained in the grouped sub-regions.
In my example with PyTorch the declaration is made :
import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x net = Net()
Define the criterion of the cost function with CrossEntropyLoss, here we use the SGD opimizer rather than adam (more info here on the comparison between optimizer Website ruder.io)
import torch.optim as optim criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
The training loop:
for epoch in range(2): # loop over the dataset multiple times running_loss = 0.0 for i, data in enumerate(trainloader, 0): # get the inputs; data is a list of [inputs, labels] inputs, labels = data # zero the parameter gradients optimizer.zero_grad() # forward + backward + optimize outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics running_loss += loss.item() if i % 2000 == 1999: # print every 2000 mini-batches print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000)) running_loss = 0.0 print('Finished Training')
The Pytorch neuron network is saved
PATH = './cifar_net.pth' torch.save(net.state_dict(), PATH)
We load a test and use the network to predict the outcome.
dataiter = iter(testloader) images, labels = dataiter.next() # print images imshow(torchvision.utils.make_grid(images)) print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))
In this example the « true » labels are « Cat, ship, ship, plane », we then launch the network to make a prediction:
net = Net() net.load_state_dict(torch.load(PATH)) outputs = net(images) _, predicted = torch.max(outputs, 1) print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4)))
Here is an overview of the architecture of this convolution network.
import hiddenlayer as hl from torchviz import make_dot, make_dot_from_trace make_dot(net(images), params=dict(net.named_parameters()))
Categorization problem (predict several class among several classes possible) – multiple-label classifier with pytorch – Pytorch tutorial
Overall, it is about predicting several probabilities for each of the classes to indicate their probabilities of presence in the entry. One possible use is to indicate the presence of an object in an image.
The problem then comes back to a problem of binary classification for n classes.
The final activation function is sigmoid and the loss function is Binary cross entropy
This internet example perfectly illustrates the use of BCELoss in the case of the prediction of several classes among several possible classes.
In this example: The image dataset used is the CelebFaces Large Scale Attribute Dataset (CelebA).
In this data dataset there are 200K images with 40 different class labels and each image has a different background footprint and there are a lot of different variations making it difficult for a model to classify each label effectively class.
I suggest you follow the tutorial from the articleWebsite medium.com:
- Step1: Download the file img_align_celeba.zip hard site Website mmlab.ie.cuhk.edu.hk (it is in a google drive)
- Step2: Download the list_attr_celeba.txt file which contains the annotations for each image on the same site Website mmlab.ie.cuhk.edu.hk
- Step3: Opening the file annotations you can see the 40 labels: 5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young
For each of the images present in the test set there is a value of 1 or -1 specifying for image whether the class is present in the image.
The objective here will be to predict for each of the classes the probability of its presence in the image.
- Step4: you can load the notebook which is present on Website github.com
Here is the results:
Load of the dataset :
Use of the imshow function (see previous example for single label prediction)
Architecture of the convolutional neuron network:
import torch.nn.functional as F class MultiClassifier(nn.Module): def __init__(self): super(MultiClassifier, self).__init__() self.ConvLayer1 = nn.Sequential( nn.Conv2d(3, 64, 3), # 3, 256, 256 nn.MaxPool2d(2), # op: 16, 127, 127 nn.ReLU(), # op: 64, 127, 127 ) self.ConvLayer2 = nn.Sequential( nn.Conv2d(64, 128, 3), # 64, 127, 127 nn.MaxPool2d(2), #op: 128, 63, 63 nn.ReLU() # op: 128, 63, 63 ) self.ConvLayer3 = nn.Sequential( nn.Conv2d(128, 256, 3), # 128, 63, 63 nn.MaxPool2d(2), #op: 256, 30, 30 nn.ReLU() #op: 256, 30, 30 ) self.ConvLayer4 = nn.Sequential( nn.Conv2d(256, 512, 3), # 256, 30, 30 nn.MaxPool2d(2), #op: 512, 14, 14 nn.ReLU(), #op: 512, 14, 14 nn.Dropout(0.2) ) self.Linear1 = nn.Linear(512 * 14 * 14, 1024) self.Linear2 = nn.Linear(1024, 256) self.Linear3 = nn.Linear(256, 40) def forward(self, x): x = self.ConvLayer1(x) x = self.ConvLayer2(x) x = self.ConvLayer3(x) x = self.ConvLayer4(x) x = x.view(x.size(0), -1) x = self.Linear1(x) x = self.Linear2(x) x = self.Linear3(x) return F.sigmoid(x)
An overview of the network architecture:
SUMMARY – Pytorch tutorial :
Here is a summary of my pytorch tutorial : sheet that I created to allow you to choose the right activation function and the right cost function more quickly according to your problem to be solved.
Pytorch tutorial : THE 5 BEST DEEP LEARNING LINKS
Website fr.wikipedia.org : Dree courses Yann LeCun
Website pytorch.org : Official web site PyToch
Website machinelearningmastery.com : Advanced on deep learning
Website www.fun-mooc.fr : Free MOOC