In this article, we will build an Autograd engine and a neural network library that handle an N-dimensional array. Autograd is a tool used for derivative calculation. It tracks operations on values with enabled gradients and builds a dynamic computational graph — a graph without cycles. Input values serve as the leaves of the graph, while output values act as its roots. Gradients are computed by traversing the graph from root to leaf, applying the chain rule to multiply gradients at each step.
Neural networks are complex mathematical functions that are adjusted through a process called training to produce the desired output. Backpropagation is a key algorithm used in this training process. It computes gradients, which represent the change in loss for small adjustments in input weights and biases. These gradients are then utilised to update the weights and biases, with a learning rate applied to reduce the overall loss and train the neural network. This fine-tuning is also used to carefully adjust the parameters of a pre-trained model to adapt it to a specific task or dataset. The training process occurs iteratively, involving the calculation of multiple gradients. A computation graph is constructed to store these gradient functions.
Andrej Karpathy's Micrograd and his video tutorial about building Micrograd, from which I would take a few examples, served as inspiration for this article. But this Autograd engine will accept N-dimensional array, whereas Microgard accepts scalar values only.
While assuming a basic understanding of Python programming, high school calculus, and neural networks, I'll provide various teaching methods for those who may not have that background. This includes line-by-line Python code explanations and visualizations of the output. The article includes an interactive area to explore derivatives, utilizing concepts from calculus. For a comprehensive understanding, I recommend watching 3Blue1Brown's video series on neural networks and backpropagation. In the final video, he covers the Chain Rule, which is the mathematical foundation of backpropagation. Additionally, Jay Alammar's article on neural network basics is highly recommended.
Exploring the building blocks of neural networks and their training process, the journey begins with the basics of derivatives and covers various examples and methods. Delving into backpropagation, the focus is on understanding how to perform it manually and programmatically, including implementation techniques. Creating an autograd class from scratch and utilizing it to train a neural network on a dataset is the next step, leading to the development of a simple neural network library using our autograd class.
Let's dive right in and begin with the fundamentals.
def f(x):
: This line defines a function named f
that takes a single input parameter x
. The function calculates and returns the result of the equation \( 3 * x^2 - 4 * x + 5 \). return 3*x**2 - 4*x + 5
: This line specifies the equation that the function f
evaluates. It calculates the result of the quadratic expression \( 3 * x^2 - 4 * x + 5 \) and returns that value. f(3.0)
: This line calls the function f
with an input value of 3.0
. It calculates the result of the expression \( 3 * x^2 - 4 * x + 5 \) with \( x \) set to \( 3.0 \) and returns the result. h = 0.001
: This line assigns a value of 0.001
to the variable h
. In this context, h
represents a small value used to approximate the derivative of the function. x = 3.0
: This line assigns a value of 3.0
to the variable x
. This represents the point at which we want to evaluate the derivative of the function. (f(x + h) - f(x))/h
: This line calculates the approximate derivative of the function def f(x):
at the point \( x \) using the finite difference method. It evaluates the expression \( \frac{{f(x + h) - f(x)}}{h} \), where \( f(x + h) \) represents the value of the function at \( x + h \), \( f(x) \) represents the value of the function at \( x \), and \( h \) is a small increment. This expression calculates the slope between two nearby points on the function and then divides it by \( h \) to approximate the derivative.
f
that represents a quadratic equation. It then sets the values of h
and x
, representing the small increment and the point at which to evaluate the derivative, respectively. Finally, it calculates the approximate derivative of the function at x
using the finite difference method and returns the result.
h = 0.0001
: This line assigns a value of 0.0001
to the variable h
. In this code, h
is referred to as the "slope," but it is actually a small increment used for approximating the derivative. a = 2.0
, b = -3.0
, c = 10.0
: These lines assign specific values to the variables a
, b
and c
. These values represent the inputs for the function or expression we are working with. d1 = a * b + c
: This line calculates the value of the expression \( a * b + c \) and assigns it to the variable d1
. a += h
: This line increments the value of a
by adding h
to it. The purpose is to create a new value for a to be used in the next line. d2 = a * b + c
: This line calculates the value of the expression \( a * b + c \) using the updated value of a
and assigns it to the variable d2
. print('d1: ', d1)
, print('a: ', a)
, print('d2: ', d2)
, print('slope: ', (d2 - d1)/h)
: These lines print the values of d1
, a
, d2
and the slope (approximated derivative) calculated as \(\frac{{d2 - d1}}{h}\).
h
) and the inputs (a
, b
and c
). It then performs calculations using these values to find d1
and d2
, representing the expressions \( a * b + c \) at different values of a
. Finally, it prints the values of d1
, a
, d2
, and the slope calculated using the finite difference method.
Value
class is to implement automatic differentiation, which allows for computing gradients of mathematical expressions with respect to their inputs. The class represents a value or variable in a computational graph, where each value can be connected to other values through mathematical operations. Value
class, and meticulously analyze each line of the code to provide an explanation of its purpose and functionality. __init__
method initializes an instance of the Value
class. It takes in a data
parameter, which can be a NumPy array or any data that can be converted into a NumPy array. If data
is already a NumPy array, it is stored directly. Otherwise, it is converted into a NumPy array using np.array(data)
. grad
attribute is initialized as an array of zeros with the same shape as the data
array. This attribute is used to store the gradient (derivative) of the node with respect to some output. _backward
attribute is a placeholder for the backward function, which is used for backpropagation in neural networks. It is initially set to a lambda function that does nothing. _prev
attribute is a set that keeps track of the previous nodes (parents) in the computation graph. This is used to determine the dependencies between nodes during backpropagation. _op
attribute stores the operation associated with the current node. This can be used to identify the type of operation performed on the node, such as addition, multiplication, etc. label
attribute is an optional label for the node, which can be used for identification or debugging purposes. __add__
method is overridden in the Value
class to define the behavior of the addition operation (+
) between two Value
objects. This method allows adding two Value
instances together. __add__
method, the other
parameter is checked to see if it is already a Value
object. If not, it is converted into a Value
object using Value(other).
Value
instance called out
is created to represent the result of the addition operation. The data of out
is obtained by adding the data of self
(the current instance) and the data of other
. The parents of out
are set as self
and other
, and the operation associated with out
is set as '+
'. _backward
is defined within the __add__
method. This function is responsible for calculating the gradients using the chain rule and updating the gradients of the operands (self
and other
). The gradient of self
is updated by adding the element-wise multiplication of \( 1.0 \) and the gradient of out
. Similarly, the gradient of other
is updated by adding the element-wise multiplication of \( 1.0 \) and the gradient of out
. _backward
function is then assigned to the _backward
attribute of the out instance, effectively replacing the placeholder backward function with the actual backward function. out
instance representing the result of the addition operation is returned. __add__
method allows adding two Value
instances together and defines the necessary computations for calculating gradients during backpropagation. __mul__
method is overridden in the Value
class to define the behavior of the multiplication operation (*
) between two Value
objects. This method allows multiplying two Value
instances together. __add__
method, the other
parameter is checked to determine if it is already a Value
object. If not, it is converted into a Value
object using Value(other)
. Valu
e instance called out
is created to represent the result of the multiplication operation. The data of out
is obtained by element-wise multiplying the data of self
(the current instance) and the data of other
. The parents of out
are set as self
and other
, and the operation associated with out
is set as '*
'. _backward
is defined within the __mul__
method. This function calculates the gradients using the chain rule and updates the gradients of the operands (self
and other
). The gradient of self is updated by adding the element-wise multiplication of other.data
and the gradient of out. Similarly, the gradient of other
is updated by adding the element-wise multiplication of self.data
and the gradient of out
. _backward
function is assigned to the _backward
attribute of the out
instance, replacing the placeholder backward function. out
instance representing the result of the multiplication operation is returned. __mul__
method allows multiplying two Value
instances together and defines the necessary computations for calculating gradients during backpropagation. __pow__
method is overridden in the Value
class to define the behavior of the power operation (**
) between two Value
objects. This method allows raising a Value
instance to a scalar exponent or element-wise raising a Value
instance to another Value
instance. other
parameter is a scalar (integer or float), an element-wise power operation is performed. The self.data
is raised to the power of other
, and the result is stored in out_data
. A new Value
instance called out
is created with out_data
as the data, self
as the parent, and f'**{other}'
as the operation label. _backward
function is defined for the scalar exponent case. It calculates the gradients using the chain rule and updates the gradient of self
. The gradient update involves multiplying other
with np.power(self.data, other - 1)
and then multiplying the result with the gradient of out
. The computed gradient is added to self.grad
. other
parameter is a Value
instance, an element-wise power operation is performed. The self.data
is raised to the power of other.data
, and the result is stored in out_data
. A new Value
instance called out
is created with out_data
as the data, self
and other
as parents, and '**
' as the operation label. _backward
function is defined for the Value
exponent case. It calculates the gradients using the chain rule and updates the gradients of self
and other
. The gradient update for self
involves multiplying other.data
with np.power(self.data, other.data - 1)
and then multiplying the result with the gradient of out
. The computed gradient is added to self.grad
. The gradient update for other
involves multiplying np.log(self.data)
with the gradient of out
and adding it to other.grad
. other
parameter is neither a scalar nor a Value
instance, a TypeError
is raised to indicate unsupported operand types. __pow__
method allows raising a Value
instance to a scalar exponent or element-wise raising a Value
instance to another Value
instance. It defines the necessary computations for calculating gradients during backpropagation in each case. Value
class: __radd__
method allows performing right addition by the Value
instance. It returns the result of addition self
by other
using the +
operator. __rmul__
method allows performing right multiplication by the Value
instance. It returns the result of multiplying self
by other
using the *
operator. __truediv__
method allows performing true division by the Value
instance. It returns the result of dividing self
by other
using the /
operator. This is achieved by multiplying self
by other
raised to the power of \( -1 \). __neg__
method allows performing negation of the Value
instance. It returns the negation of self
by multiplying it with \( -1 \). __sub__
method allows performing subtraction of a Value
instance. It returns the result of subtracting other
from self
using the -
operator. This is achieved by adding self
to the negation of other
. Value
instances and allow for a more expressive and intuitive usage of the Value
class. exp
method is defined in the Value
class to compute the element-wise exponential of a Value
instance. self.data
array and assigning it to the variable x
. Value
instance called out
is created with the exponential of x
as the data. The parents of out
are set to (self,)
, indicating that self
is the parent node. The operation label is set as 'exp'
to represent the exponential operation. _backward
is defined to calculate the gradients using the chain rule. In this case, the gradient of self
is updated by adding the element-wise multiplication of out.data
and the gradient of out
. _backward
function is assigned to the _backward
attribute of the out
instance, replacing the placeholder backward function. out
instance representing the result of the exponential operation is returned. exp
method computes the element-wise exponential of a Value
instance and defines the necessary computations for calculating gradients during backpropagation. tanh
method is defined in the Value
class to compute the element-wise hyperbolic tangent of a Value
instance. self.data
array and assigning it to the variable x
. x
is calculated using np.tanh(x)
and assigned to the variable t
. Value
instance called out
is created with t
as the data. The parent of out
is set to (self,)
, indicating that self
is the parent node. The operation label is set as 'tanh'
to represent the hyperbolic tangent operation. _backward
is defined to calculate the gradients using the chain rule. In this case, the gradient of self
is updated by adding the element-wise multiplication of (1 - t**2)
and the gradient of out
. The derivative (1 - t**2)
corresponds to the derivative of the hyperbolic tangent function. _backward
function is assigned to the _backward
attribute of the out
instance, replacing the placeholder backward function. out
instance representing the result of the hyperbolic tangent operation is returned. tanh
method computes the element-wise hyperbolic tangent of a Value
instance and defines the necessary computations for calculating gradients during backpropagation. backward
method is defined in the Value
class to perform backpropagation and compute gradients for the computational graph. topo
to store the topological order of nodes and a set called visited
to keep track of visited nodes. build_topo
is defined to recursively build the topological order of nodes starting from a given node v
. The function checks if the node v
has been visited before. If not, it adds v
to the visited
set and recursively calls build_topo
for each child node of v
. After visiting all child nodes, it appends v
to the topo
list. This ensures that nodes are added to the topo
list in a topological order, i.e., parents before children. build_topo
function is invoked with self
as the starting node to build the topological order of nodes. np.ones_like(self.data)
. This initializes the gradient for backpropagation. reversed
function on the topo
list. For each node, it calls the _backward
function associated with that node. The _backward
function was assigned in the respective arithmetic and mathematical operation methods, and it calculates and updates the gradients of the node's parents based on the chain rule. backward
method ensures that the gradients are calculated correctly and efficiently. backward
method performs backpropagation by traversing the computational graph in reverse topological order and calling the _backward
function for each node to compute the gradients. __repr__
method is overridden in the Value
class to provide a string representation of a Value
instance. data
attribute of the Value
instance. It uses f-string formatting to construct the string representation in the format Value(data=)
, where
is replaced with the actual value of self.data
. Value
instance when it is printed or converted to a string. __repr__
method provides a string representation of a Value
instance, making it easier to understand and debug the object when printed or converted to a string.
Value
class performs computation on expressions that can potentially be large, depending on the inputs, it would be beneficial to have a visually appealing representation of these expressions. This visualization would allow us to better understand and gain a sense of how the results of these expressions look, facilitating comprehension and analysis. trace
that takes a root
node as input and returns a set of nodes and edges that represent a trace of the computational graph rooted at the root
node. nodes
and edges
to store the nodes and edges of the graph, respectively. build
is defined to recursively build the set of nodes and edges starting from a given node v
. The function checks if the node v
is already in the nodes
set. If not, it adds v
to the nodes
set and proceeds to iterate over the previous nodes (parents) of v
. For each previous node (child
), it adds an edge from child
to v
in the edges
set and recursively calls build
on the previous node (child
). build
function is invoked initially with the root
node to start building the set of nodes and edges. nodes
) and edges (edges
). trace
function traces the computational graph rooted at a given node and returns the set of nodes and edges that represent the graph. draw_dot
that takes a root
node as input and generates a visualization of the computational graph rooted at the root
node using the Graphviz library. Digraph
object named dot
with the format set to SVG and the graph attribute rankdir
set to 'LR' to arrange the nodes from left to right. trace
function is called to retrieve the set of nodes and edges that represent the computational graph rooted at the root
node. nodes
set. For each node n, it assigns a unique ID (uid
) based on the node's ID using str(id(n))
. It converts the data
and grad
arrays of the node into string representations using np.array2string
with specified precision, separator, and suppression of small values. It creates the label for the node by combining the node's label, data string, and gradient string. The node is added to the graph with the unique ID, label, and shape 'record' to visualize the node as a record-shaped box. n._op
), it adds an additional node to the graph with a unique ID based on the node's ID and operation, and labels it with the operation. An edge is added from the operation node to the current node (n
) in the graph. edges
in the edges set. For each edge (n1
, n2
), it adds an edge from the ID of the first node (n1
) to the ID of the second node (n2
) concatenated with the operation name (n2._op
). dot
). draw_dot
function generates a visualization of the computational graph rooted at a given node by creating nodes and edges in a Digraph
object using Graphviz. The nodes represent the Value
instances in the graph, and the edges represent the dependencies between the nodes based on their parent-child relationships. The labels of the nodes display information such as the node's label, data, and gradient.
o
through backward propagation to determine its impact on the hyperbolic tangent function (tanh). random
module, which is used for generating random numbers. Neuron
class is defined, representing a single neuron in a neural network.
__init__
method initializes the neuron with random weights and a bias. The number of input connections (inputs) to the neuron is specified by nin. The weights are randomly initialized using the random.uniform
function between -1 and 1, and stored in a list self.w
. The bias is also randomly initialized and stored in self.b
. __call__
method implements the behavior of the neuron. Given an input x
, it calculates the weighted sum of inputs multiplied by weights, adds the bias, applies the hyperbolic tangent function to the result (activation), and returns the output of the neuron. parameters
method returns the parameters of the neuron, which include the weights (self.w)
and the bias (self.b)
. Neuron
class represents a single neuron in a neural network. It has methods for initialization, computing the output of the neuron, and retrieving the neuron's parameters. The neuron's weights and bias are randomly initialized within a specified range. The neuron computes its output by calculating the weighted sum of inputs, adding the bias, and applying the hyperbolic tangent function to the result. The neuron's parameters include the weights and bias. Layer
class is defined, representing a layer of neurons in a neural network.
__init__
method initializes a layer with a specified number of input neurons (nin
) and output neurons (nout
). It creates a list of nout
neurons by using a list comprehension. Each neuron in the list is created by calling the Neuron
class with nin as the number of input connections. __call__
method computes the output of each neuron in the layer given an input x
. It iterates over the neurons in the layer and calls each neuron with x
, collecting the outputs in a list named outs
. If there is only one output neuron, it returns the output directly; otherwise, it returns the list of outputs. parameters
method returns the parameters of all neurons in the layer. It achieves this by iterating over the neurons in the layer and calling the parameters
method of each neuron. The parameters of each neuron are then flattened into a single list using a list comprehension and returned. Layer
class represents a layer of neurons in a neural network. The layer is initialized with a specified number of input and output neurons. The layer contains a list of neurons, each created with the specified number of input connections. The layer's __call__
method computes the output of each neuron in the layer given an input. The outputs are collected in a list, and if there is only one output neuron, it is returned directly; otherwise, the list of outputs is returned. The layer's parameters
method returns the parameters of all neurons in the layer by iterating over the neurons, calling their parameters
method, and flattening the parameters into a single list. MLP
(Multi-Layer Perceptron) class is defined, representing a multi-layer neural network.
__init__
method initializes an MLP with the specified number of input neurons (nin
) and a list of output neurons for each layer (nouts
). It creates a list sz
that contains the sizes of all layers, starting with the input layer size (nin
) followed by the output sizes of each layer (nouts
). Each layer in the MLP is created by using a list comprehension and calling the Layer
class. The size of each layer is determined by sz[i]
as the number of input neurons and sz[i+1]
as the number of output neurons. __call__
method performs forward propagation through the MLP. It iterates over the layers in the MLP and sequentially applies each layer to the input x
. The output of each layer becomes the input for the next layer. Finally, it returns the final output of the MLP. parameters
method returns the parameters of all layers in the MLP. It achieves this by iterating over the layers in the MLP and calling the parameters
method of each layer. The parameters of each layer are then flattened into a single list using a list comprehension and returned. MLP
class represents a multi-layer perceptron (MLP), which is a type of neural network. The MLP is initialized with the number of input neurons and a list of output neurons for each layer. The MLP consists of multiple layers, where each layer is represented by the Layer
class. The __call__
method performs forward propagation through the MLP by sequentially applying each layer to the input. The final output of the MLP is returned. The parameters
method returns the parameters of all layers in the MLP by iterating over the layers, calling their parameters
method, and flattening the parameters into a single list.
x
to [2.0, 3.0, -1.0]
. MLP
object named n
is created. It is initialized with 3 input neurons and a list [4, 4, 1]
representing the number of neurons in each hidden layer and the output layer. Therefore, the MLP has 3 hidden layers, each with 4 neurons, and 1 output neuron. __call__
method of the MLP
object n
is called with the input values x
. This triggers the forward propagation of the input through all layers of the MLP.
__call__
method sequentially applies each layer in the MLP to the input. It iterates over the layers stored in the self.layers
attribute of the MLP and applies each layer to the input. The output of each layer becomes the input for the next layer. xs
. Each training example is represented as a list of input values. In this case, there are four training examples, each with three input values. ys
. Each target value corresponds to a training example in xs
. In this case, there are four target values. xs[0]
corresponds to ys[0]
, xs[1]
corresponds to ys[1]
, and so on. xs
and the desired target values in the list ys
. The correspondence between the training examples and their target values is indicated through the comment. These data are typically used for training a machine learning model, where the model learns to map the input values in xs
to the corresponding target values in ys
.
range(20)
function. n
to each input in the xs
list using a list comprehension. This generates a list of predicted outputs ypred
corresponding to each input in xs
. yout
and the corresponding target values ygt
using a generator expression and the sum()
function. parameters()
method and setting p.grad = 0.0
for each parameter p
. backward()
method of the loss. This method calculates the gradients using automatic differentiation and the chain rule. p.data
is updated by subtracting 0.5 * p.grad
from its current value. k
and the value of the loss are printed using the print()
function. x
in the xs
list is passed through the MLP object n
. This applies the MLP's forward propagation to each input and produces a list of predicted outputs. xs
list using the MLP object n
. The ypred
list contains these predicted outputs.
Value
class, and a library.py file that includes the code for the neural network library. The library.py file imports the Value
class from the engine.py file. Furthermore, I have included a Jupyter notebook file, which you can export to an editor and experiment with.