In this article, we will build an Autograd engine and a neural network library that handle an N-dimensional array. Autograd is a tool used for derivative calculation. It tracks operations on values with enabled gradients and builds a dynamic computational graph — a graph without cycles. Input values serve as the leaves of the graph, while output values act as its roots. Gradients are computed by traversing the graph from root to leaf, applying the chain rule to multiply gradients at each step.
Neural networks are complex mathematical functions that are adjusted through a process called training to produce the desired output. Backpropagation is a key algorithm used in this training process. It computes gradients, which represent the change in loss for small adjustments in input weights and biases. These gradients are then utilised to update the weights and biases, with a learning rate applied to reduce the overall loss and train the neural network. This fine-tuning is also used to carefully adjust the parameters of a pre-trained model to adapt it to a specific task or dataset. The training process occurs iteratively, involving the calculation of multiple gradients. A computation graph is constructed to store these gradient functions.
Andrej Karpathy's Micrograd and his video tutorial about building Micrograd, from which I would take a few examples, served as inspiration for this article. But this Autograd engine will accept N-dimensional array, whereas Microgard accepts scalar values only.
While assuming a basic understanding of Python programming, high school calculus, and neural networks, I'll provide various teaching methods for those who may not have that background. This includes line-by-line Python code explanations and visualizations of the output. The article includes an interactive area to explore derivatives, utilizing concepts from calculus. For a comprehensive understanding, I recommend watching 3Blue1Brown's video series on neural networks and backpropagation. In the final video, he covers the Chain Rule, which is the mathematical foundation of backpropagation. Additionally, Jay Alammar's article on neural network basics is highly recommended.
Exploring the building blocks of neural networks and their training process, the journey begins with the basics of derivatives and covers various examples and methods. Delving into backpropagation, the focus is on understanding how to perform it manually and programmatically, including implementation techniques. Creating an autograd class from scratch and utilizing it to train a neural network on a dataset is the next step, leading to the development of a simple neural network library using our autograd class.
Let's dive right in and begin with the fundamentals.
def f(x): : This line defines a function named f that takes a single input parameter x. The function calculates and returns the result of the equation \( 3 * x^2 - 4 * x + 5 \). return 3*x**2 - 4*x + 5 : This line specifies the equation that the function f evaluates. It calculates the result of the quadratic expression \( 3 * x^2 - 4 * x + 5 \) and returns that value. f(3.0) : This line calls the function f with an input value of 3.0. It calculates the result of the expression \( 3 * x^2 - 4 * x + 5 \) with \( x \) set to \( 3.0 \) and returns the result. h = 0.001 : This line assigns a value of 0.001 to the variable h. In this context, h represents a small value used to approximate the derivative of the function. x = 3.0 : This line assigns a value of 3.0 to the variable x. This represents the point at which we want to evaluate the derivative of the function. (f(x + h) - f(x))/h : This line calculates the approximate derivative of the function def f(x): at the point \( x \) using the finite difference method. It evaluates the expression \( \frac{{f(x + h) - f(x)}}{h} \), where \( f(x + h) \) represents the value of the function at \( x + h \), \( f(x) \) represents the value of the function at \( x \), and \( h \) is a small increment. This expression calculates the slope between two nearby points on the function and then divides it by \( h \) to approximate the derivative.
f that represents a quadratic equation. It then sets the values of h and x, representing the small increment and the point at which to evaluate the derivative, respectively. Finally, it calculates the approximate derivative of the function at x using the finite difference method and returns the result.
h = 0.0001 : This line assigns a value of 0.0001 to the variable h. In this code, h is referred to as the "slope," but it is actually a small increment used for approximating the derivative. a = 2.0, b = -3.0, c = 10.0 : These lines assign specific values to the variables a, b and c. These values represent the inputs for the function or expression we are working with. d1 = a * b + c : This line calculates the value of the expression \( a * b + c \) and assigns it to the variable d1. a += h : This line increments the value of a by adding h to it. The purpose is to create a new value for a to be used in the next line. d2 = a * b + c : This line calculates the value of the expression \( a * b + c \) using the updated value of a and assigns it to the variable d2. print('d1: ', d1), print('a: ', a), print('d2: ', d2), print('slope: ', (d2 - d1)/h) : These lines print the values of d1, a, d2 and the slope (approximated derivative) calculated as \(\frac{{d2 - d1}}{h}\).
h) and the inputs (a, b and c). It then performs calculations using these values to find d1 and d2, representing the expressions \( a * b + c \) at different values of a. Finally, it prints the values of d1, a, d2, and the slope calculated using the finite difference method.
Value class is to implement automatic differentiation, which allows for computing gradients of mathematical expressions with respect to their inputs. The class represents a value or variable in a computational graph, where each value can be connected to other values through mathematical operations. Value class, and meticulously analyze each line of the code to provide an explanation of its purpose and functionality. __init__ method initializes an instance of the Value class. It takes in a data parameter, which can be a NumPy array or any data that can be converted into a NumPy array. If data is already a NumPy array, it is stored directly. Otherwise, it is converted into a NumPy array using np.array(data). grad attribute is initialized as an array of zeros with the same shape as the data array. This attribute is used to store the gradient (derivative) of the node with respect to some output. _backward attribute is a placeholder for the backward function, which is used for backpropagation in neural networks. It is initially set to a lambda function that does nothing. _prev attribute is a set that keeps track of the previous nodes (parents) in the computation graph. This is used to determine the dependencies between nodes during backpropagation. _op attribute stores the operation associated with the current node. This can be used to identify the type of operation performed on the node, such as addition, multiplication, etc. label attribute is an optional label for the node, which can be used for identification or debugging purposes. __add__ method is overridden in the Value class to define the behavior of the addition operation (+) between two Value objects. This method allows adding two Value instances together. __add__ method, the other parameter is checked to see if it is already a Value object. If not, it is converted into a Value object using Value(other). Value instance called out is created to represent the result of the addition operation. The data of out is obtained by adding the data of self (the current instance) and the data of other. The parents of out are set as self and other, and the operation associated with out is set as '+'. _backward is defined within the __add__ method. This function is responsible for calculating the gradients using the chain rule and updating the gradients of the operands (self and other). The gradient of self is updated by adding the element-wise multiplication of \( 1.0 \) and the gradient of out. Similarly, the gradient of other is updated by adding the element-wise multiplication of \( 1.0 \) and the gradient of out. _backward function is then assigned to the _backward attribute of the out instance, effectively replacing the placeholder backward function with the actual backward function. out instance representing the result of the addition operation is returned. __add__ method allows adding two Value instances together and defines the necessary computations for calculating gradients during backpropagation. __mul__ method is overridden in the Value class to define the behavior of the multiplication operation (*) between two Value objects. This method allows multiplying two Value instances together. __add__ method, the other parameter is checked to determine if it is already a Value object. If not, it is converted into a Value object using Value(other). Value instance called out is created to represent the result of the multiplication operation. The data of out is obtained by element-wise multiplying the data of self (the current instance) and the data of other. The parents of out are set as self and other, and the operation associated with out is set as '*'. _backward is defined within the __mul__ method. This function calculates the gradients using the chain rule and updates the gradients of the operands (self and other). The gradient of self is updated by adding the element-wise multiplication of other.data and the gradient of out. Similarly, the gradient of other is updated by adding the element-wise multiplication of self.data and the gradient of out. _backward function is assigned to the _backward attribute of the out instance, replacing the placeholder backward function. out instance representing the result of the multiplication operation is returned. __mul__ method allows multiplying two Value instances together and defines the necessary computations for calculating gradients during backpropagation. __pow__ method is overridden in the Value class to define the behavior of the power operation (**) between two Value objects. This method allows raising a Value instance to a scalar exponent or element-wise raising a Value instance to another Value instance. other parameter is a scalar (integer or float), an element-wise power operation is performed. The self.data is raised to the power of other, and the result is stored in out_data. A new Value instance called out is created with out_data as the data, self as the parent, and f'**{other}' as the operation label. _backward function is defined for the scalar exponent case. It calculates the gradients using the chain rule and updates the gradient of self. The gradient update involves multiplying other with np.power(self.data, other - 1) and then multiplying the result with the gradient of out. The computed gradient is added to self.grad. other parameter is a Value instance, an element-wise power operation is performed. The self.data is raised to the power of other.data, and the result is stored in out_data. A new Value instance called out is created with out_data as the data, self and other as parents, and '**' as the operation label. _backward function is defined for the Value exponent case. It calculates the gradients using the chain rule and updates the gradients of self and other. The gradient update for self involves multiplying other.data with np.power(self.data, other.data - 1) and then multiplying the result with the gradient of out. The computed gradient is added to self.grad. The gradient update for other involves multiplying np.log(self.data) with the gradient of out and adding it to other.grad. other parameter is neither a scalar nor a Value instance, a TypeError is raised to indicate unsupported operand types. __pow__ method allows raising a Value instance to a scalar exponent or element-wise raising a Value instance to another Value instance. It defines the necessary computations for calculating gradients during backpropagation in each case. Value class: __radd__ method allows performing right addition by the Value instance. It returns the result of addition self by other using the + operator. __rmul__ method allows performing right multiplication by the Value instance. It returns the result of multiplying self by other using the * operator. __truediv__ method allows performing true division by the Value instance. It returns the result of dividing self by other using the / operator. This is achieved by multiplying self by other raised to the power of \( -1 \). __neg__ method allows performing negation of the Value instance. It returns the negation of self by multiplying it with \( -1 \). __sub__ method allows performing subtraction of a Value instance. It returns the result of subtracting other from self using the - operator. This is achieved by adding self to the negation of other. Value instances and allow for a more expressive and intuitive usage of the Value class. exp method is defined in the Value class to compute the element-wise exponential of a Value instance. self.data array and assigning it to the variable x. Value instance called out is created with the exponential of x as the data. The parents of out are set to (self,), indicating that self is the parent node. The operation label is set as 'exp' to represent the exponential operation. _backward is defined to calculate the gradients using the chain rule. In this case, the gradient of self is updated by adding the element-wise multiplication of out.data and the gradient of out. _backward function is assigned to the _backward attribute of the out instance, replacing the placeholder backward function. out instance representing the result of the exponential operation is returned. exp method computes the element-wise exponential of a Value instance and defines the necessary computations for calculating gradients during backpropagation. tanh method is defined in the Value class to compute the element-wise hyperbolic tangent of a Value instance. self.data array and assigning it to the variable x. x is calculated using np.tanh(x) and assigned to the variable t. Value instance called out is created with t as the data. The parent of out is set to (self,), indicating that self is the parent node. The operation label is set as 'tanh' to represent the hyperbolic tangent operation. _backward is defined to calculate the gradients using the chain rule. In this case, the gradient of self is updated by adding the element-wise multiplication of (1 - t**2) and the gradient of out. The derivative (1 - t**2) corresponds to the derivative of the hyperbolic tangent function. _backward function is assigned to the _backward attribute of the out instance, replacing the placeholder backward function. out instance representing the result of the hyperbolic tangent operation is returned. tanh method computes the element-wise hyperbolic tangent of a Value instance and defines the necessary computations for calculating gradients during backpropagation. backward method is defined in the Value class to perform backpropagation and compute gradients for the computational graph. topo to store the topological order of nodes and a set called visited to keep track of visited nodes. build_topo is defined to recursively build the topological order of nodes starting from a given node v. The function checks if the node v has been visited before. If not, it adds v to the visited set and recursively calls build_topo for each child node of v. After visiting all child nodes, it appends v to the topo list. This ensures that nodes are added to the topo list in a topological order, i.e., parents before children. build_topo function is invoked with self as the starting node to build the topological order of nodes. np.ones_like(self.data). This initializes the gradient for backpropagation. reversed function on the topo list. For each node, it calls the _backward function associated with that node. The _backward function was assigned in the respective arithmetic and mathematical operation methods, and it calculates and updates the gradients of the node's parents based on the chain rule. backward method ensures that the gradients are calculated correctly and efficiently. backward method performs backpropagation by traversing the computational graph in reverse topological order and calling the _backward function for each node to compute the gradients. __repr__ method is overridden in the Value class to provide a string representation of a Value instance. data attribute of the Value instance. It uses f-string formatting to construct the string representation in the format Value(data=), where is replaced with the actual value of self.data. Value instance when it is printed or converted to a string. __repr__ method provides a string representation of a Value instance, making it easier to understand and debug the object when printed or converted to a string.
Value class performs computation on expressions that can potentially be large, depending on the inputs, it would be beneficial to have a visually appealing representation of these expressions. This visualization would allow us to better understand and gain a sense of how the results of these expressions look, facilitating comprehension and analysis. trace that takes a root node as input and returns a set of nodes and edges that represent a trace of the computational graph rooted at the root node. nodes and edges to store the nodes and edges of the graph, respectively. build is defined to recursively build the set of nodes and edges starting from a given node v. The function checks if the node v is already in the nodes set. If not, it adds v to the nodes set and proceeds to iterate over the previous nodes (parents) of v. For each previous node (child), it adds an edge from child to v in the edges set and recursively calls build on the previous node (child). build function is invoked initially with the root node to start building the set of nodes and edges. nodes) and edges (edges). trace function traces the computational graph rooted at a given node and returns the set of nodes and edges that represent the graph. draw_dot that takes a root node as input and generates a visualization of the computational graph rooted at the root node using the Graphviz library. Digraph object named dot with the format set to SVG and the graph attribute rankdir set to 'LR' to arrange the nodes from left to right. trace function is called to retrieve the set of nodes and edges that represent the computational graph rooted at the root node. nodes set. For each node n, it assigns a unique ID (uid) based on the node's ID using str(id(n)). It converts the data and grad arrays of the node into string representations using np.array2string with specified precision, separator, and suppression of small values. It creates the label for the node by combining the node's label, data string, and gradient string. The node is added to the graph with the unique ID, label, and shape 'record' to visualize the node as a record-shaped box. n._op), it adds an additional node to the graph with a unique ID based on the node's ID and operation, and labels it with the operation. An edge is added from the operation node to the current node (n) in the graph. edges in the edges set. For each edge (n1, n2), it adds an edge from the ID of the first node (n1) to the ID of the second node (n2) concatenated with the operation name (n2._op). dot). draw_dot function generates a visualization of the computational graph rooted at a given node by creating nodes and edges in a Digraph object using Graphviz. The nodes represent the Value instances in the graph, and the edges represent the dependencies between the nodes based on their parent-child relationships. The labels of the nodes display information such as the node's label, data, and gradient.
o through backward propagation to determine its impact on the hyperbolic tangent function (tanh).
random module, which is used for generating random numbers. Neuron class is defined, representing a single neuron in a neural network.
__init__ method initializes the neuron with random weights and a bias. The number of input connections (inputs) to the neuron is specified by nin. The weights are randomly initialized using the random.uniform function between -1 and 1, and stored in a list self.w. The bias is also randomly initialized and stored in self.b. __call__ method implements the behavior of the neuron. Given an input x, it calculates the weighted sum of inputs multiplied by weights, adds the bias, applies the hyperbolic tangent function to the result (activation), and returns the output of the neuron. parameters method returns the parameters of the neuron, which include the weights (self.w) and the bias (self.b). Neuron class represents a single neuron in a neural network. It has methods for initialization, computing the output of the neuron, and retrieving the neuron's parameters. The neuron's weights and bias are randomly initialized within a specified range. The neuron computes its output by calculating the weighted sum of inputs, adding the bias, and applying the hyperbolic tangent function to the result. The neuron's parameters include the weights and bias. Layer class is defined, representing a layer of neurons in a neural network.
__init__ method initializes a layer with a specified number of input neurons (nin) and output neurons (nout). It creates a list of nout neurons by using a list comprehension. Each neuron in the list is created by calling the Neuron class with nin as the number of input connections. __call__ method computes the output of each neuron in the layer given an input x. It iterates over the neurons in the layer and calls each neuron with x, collecting the outputs in a list named outs. If there is only one output neuron, it returns the output directly; otherwise, it returns the list of outputs. parameters method returns the parameters of all neurons in the layer. It achieves this by iterating over the neurons in the layer and calling the parameters method of each neuron. The parameters of each neuron are then flattened into a single list using a list comprehension and returned. Layer class represents a layer of neurons in a neural network. The layer is initialized with a specified number of input and output neurons. The layer contains a list of neurons, each created with the specified number of input connections. The layer's __call__ method computes the output of each neuron in the layer given an input. The outputs are collected in a list, and if there is only one output neuron, it is returned directly; otherwise, the list of outputs is returned. The layer's parameters method returns the parameters of all neurons in the layer by iterating over the neurons, calling their parameters method, and flattening the parameters into a single list. MLP (Multi-Layer Perceptron) class is defined, representing a multi-layer neural network.
__init__ method initializes an MLP with the specified number of input neurons (nin) and a list of output neurons for each layer (nouts). It creates a list sz that contains the sizes of all layers, starting with the input layer size (nin) followed by the output sizes of each layer (nouts). Each layer in the MLP is created by using a list comprehension and calling the Layer class. The size of each layer is determined by sz[i] as the number of input neurons and sz[i+1] as the number of output neurons. __call__ method performs forward propagation through the MLP. It iterates over the layers in the MLP and sequentially applies each layer to the input x. The output of each layer becomes the input for the next layer. Finally, it returns the final output of the MLP. parameters method returns the parameters of all layers in the MLP. It achieves this by iterating over the layers in the MLP and calling the parameters method of each layer. The parameters of each layer are then flattened into a single list using a list comprehension and returned. MLP class represents a multi-layer perceptron (MLP), which is a type of neural network. The MLP is initialized with the number of input neurons and a list of output neurons for each layer. The MLP consists of multiple layers, where each layer is represented by the Layer class. The __call__ method performs forward propagation through the MLP by sequentially applying each layer to the input. The final output of the MLP is returned. The parameters method returns the parameters of all layers in the MLP by iterating over the layers, calling their parameters method, and flattening the parameters into a single list.
x to [2.0, 3.0, -1.0]. MLP object named n is created. It is initialized with 3 input neurons and a list [4, 4, 1] representing the number of neurons in each hidden layer and the output layer. Therefore, the MLP has 3 hidden layers, each with 4 neurons, and 1 output neuron. __call__ method of the MLP object n is called with the input values x. This triggers the forward propagation of the input through all layers of the MLP.
__call__ method sequentially applies each layer in the MLP to the input. It iterates over the layers stored in the self.layers attribute of the MLP and applies each layer to the input. The output of each layer becomes the input for the next layer. xs. Each training example is represented as a list of input values. In this case, there are four training examples, each with three input values. ys. Each target value corresponds to a training example in xs. In this case, there are four target values. xs[0] corresponds to ys[0], xs[1] corresponds to ys[1], and so on. xs and the desired target values in the list ys. The correspondence between the training examples and their target values is indicated through the comment. These data are typically used for training a machine learning model, where the model learns to map the input values in xs to the corresponding target values in ys.
range(20) function. n to each input in the xs list using a list comprehension. This generates a list of predicted outputs ypred corresponding to each input in xs. yout and the corresponding target values ygt using a generator expression and the sum() function. parameters() method and setting p.grad = 0.0 for each parameter p. backward() method of the loss. This method calculates the gradients using automatic differentiation and the chain rule. p.data is updated by subtracting 0.5 * p.grad from its current value. k and the value of the loss are printed using the print() function. x in the xs list is passed through the MLP object n. This applies the MLP's forward propagation to each input and produces a list of predicted outputs. xs list using the MLP object n. The ypred list contains these predicted outputs.
Value class, and a library.py file that includes the code for the neural network library. The library.py file imports the Value class from the engine.py file. Furthermore, I have included a Jupyter notebook file, which you can export to an editor and experiment with.