Applying the Chain Rule
00:00 To review, your network has two layers. For example, the error function is x squared, but x is the result of another function: the difference of the prediction and expected value, as seen in the previous lesson.
00:15 When the input of one function is the result of another, it’s called function composition. In the previous lesson, you saw how to use the derivative to reduce the error function.
00:25 But since the error function composes other functions, you must use the chain rule to take the derivative and reduce the error. When using the chain rule, you take the partial derivative of each function and then multiply them.
00:40 This isn’t as complex as it might sound for the simple functions in this course. It’s easier to see in a diagram. To take the derivative of the error function, you’ll need partial derivatives.
00:53 Take the partial derivative of the error with respect to the prediction, then take the partial derivative of the prediction with respect to the first layer, and finally, the partial derivative of the first layer with respect to the weights.
01:08 Then take all the partial derivatives and multiply them together. This product will give you the derivative of the error with respect to the weights. Perhaps you noticed that you were starting at the error and working backwards to the weights. This is called a backward pass and the algorithm is called backpropagation.
01:29 Here’s how to update the bias. It’s the same algorithms for the weights, just different variables. So instead of taking the derivative of the error with respect to the weights, you take the derivative of the error with respect to the bias, as seen in this diagram.
01:45 The error function is x squared, and the derivative, as you’ve seen, is 2x. For the next partial derivative, you’ll take a step in reverse and compute the partial derivative of the prediction with respect to the layer.
01:58 This is the derivative of the sigmoid function. For this course, just accept that it is the product of the sigmoid and the difference of 1 and the sigmoid. Finally, you can take the partial derivative of the layer with respect to the bias.
02:14 If you multiply them, you’ll get the derivative of the error with respect to the bias, and you’ll use this value to update the bias to reduce the error.
02:25 Here’s what it looks like in Python. First, this function computes the derivative of the sigmoid function.
02:33
The partial error of the derivative with respect to the prediction is 2x, but x is the difference of the prediction and the target. The derivative of the prediction with respect to the layer is the derivative of the sigmoid, which accepts the output of the layer. And the derivative of the layer with respect to the bias is the constant 1
.
02:54 Multiply them together for the derivative of the error with respect to the bias and subtract that from the bias. And, of course, do the same with the weights and the derivative of the error with respect to the weights. You’ll implement that in the next lesson as you write a class to build a neural network.
Become a Member to join the conversation.
Ahmed Ahmed Elnaghy on Feb. 24, 2023
i did not got it, i need more explain about backpropagation or chain rule, i mean step by step while doing it by hand or by example, thanks for your kindness