Loss Functions and Gradients

What Is a Gradient?
Common Functions and Their Gradients
Activation Functions and Their Gradients
Loss Functions and Their Gradients
Conclusion

What Is a Gradient?

A gradient is the rate of change of a function at a specific point. It represents the direction of the steepest ascent in a multi-dimensional space. Gradients allow us to find minima or maxima of functions, which is crucial in machine learning because they drive optimization.

Key Concepts:

Derivative: The slope of a one-dimensional function.
Partial Derivative: The rate of change with respect to one variable in a multi-variable function.
Gradient: A vector of partial derivatives indicating the direction in which a function grows the fastest.

Code Example:

import torch

# Example: compute the gradient of y = x^2
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
y.backward()
print(x.grad)  # dy/dx = 2x = 4

tensor([4.])

Gradient descent is a common optimization algorithm that updates parameters in the negative direction of the gradient to reduce the value of the loss function.

Common Functions and Their Gradients

In machine learning, model predictions are usually outputs of parameterized functions. Understanding common functions and their gradient derivations is fundamental for optimization.

Function Examples:

Linear function:

Gradient: ,

Quadratic function:

Gradient: ,

Code Example:

# Compute gradients using PyTorch autograd
w = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)
x = torch.tensor([3.0])
y = (w * x + b) ** 2
y.backward()
print(w.grad)  # gradient of w
print(b.grad)  # gradient of b

tensor([42.])
tensor([14.])

Activation Functions and Their Gradients

Activation functions introduce nonlinearity into neural networks. Common examples include Sigmoid, Tanh, and ReLU. Their gradients determine how well the network can learn.

Activation Functions and Gradients:

Sigmoid:

Function:
Gradient:

Tanh:

Function:
Gradient:

ReLU:

Function:
Gradient: (if ); (if )

Code Example:

import torch.nn.functional as F

x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)

# Activation functions
y_sigmoid = torch.sigmoid(x)
y_relu = F.relu(x)

# Compute gradients
y_sigmoid.sum().backward()
print(x.grad)  # Sigmoid gradient

x.grad.zero_()  # reset gradients
y_relu.sum().backward()
print(x.grad)  # ReLU gradient

tensor([0.1966, 0.2500, 0.1966])
tensor([0., 0., 1.])

Loss Functions and Their Gradients

Loss functions measure the difference between predictions and ground-truth labels. Common ones include Mean Squared Error (MSE) and Cross-Entropy.

Typical Loss Functions:

Mean Squared Error (MSE):

Formula:
Gradient:

Cross-Entropy (CE) (multiclass):

Formula:
Gradient:

Code Example:

# MSE Loss gradient
y_true = torch.tensor([1.0, 0.0])
y_pred = torch.tensor([0.9, 0.1], requires_grad=True)
loss = torch.nn.functional.mse_loss(y_pred, y_true)
loss.backward()
print(y_pred.grad)

# Cross-Entropy Loss gradient
y_pred.grad.zero_()
loss = torch.nn.functional.cross_entropy(y_pred, y_true)
loss.backward()
print(y_pred.grad)

tensor([-0.1000,  0.1000])
tensor([-0.3100,  0.3100])

Conclusion

Understanding gradients and loss functions is essential for optimizing deep learning models. With automatic differentiation tools provided by frameworks like PyTorch, we can efficiently compute gradients and update parameters, forming the foundation for building complex neural networks.