Using the Chain Rule to calculate derivatives
Name:w1,w2,w3,b1,b2,b3.
Estimate b3. Assume we have the optimal values for all of the parameters except for the last bias term b3.
Activation function is the softplus activation function with the expression of:
f(x)=log(1+ex)
b3 initial value: initialized to 0.
Quantify differences: using residual
Residual=(Observed−Predicted)
The sum of the squared residuals(SSR):
∑Residual2
Plugging the derivatives into Gradient Descent to optimize parameters
calculate
dSSR/db3=dSSR/dPredicted×dPredicted/db3
Predicted=const+b3⇒dPredicted/db3=1
dSSR/dPredicted=∑−2(Observed−Predicted)
Apply to multiple parameters simultaneously
don’t know w3,w4,b3
initial value b3=0,w3,w4=random.
Fancy Notation
x1,i=inputi×w1+b1;x2,i=inputi×w2+b2
y1,i=f(x1,i);y2,i=f(x2,i)
Predicted=y1,iw3+y2,iw4+b3
dSSR/dw3=dSSR/dPredicted×dPredicted/dw3
Predicted=const+y1,iw3⇒dPredicted/dw3=y1,i
Apply to w1,b1,w2,b2
dSSR/dw1
dSSR/dPredicted(known)
dPredicted/dy1=w3
dy1/dx1=1+ex1ex1
dx1/dw1=Inputi