COMPUTING GRADIENTS FOR THE SAKE OF IT
i always struggled remembering formulas for what would be the gradient of a particular layer in the transformer. recently while studying about distributed training I realized that I can not move forward without having a crystal clear understanding of how gradients flew in the network
so here is a blog which will teach you all you need to be a wizard of gradients
Rule 1
suppose we want to find the gradient dL/dW where L is the loss and W is the gradient. the rule says that the shape of dL/dW will be equal to the shape of W, memorize this.
x→[YOUR LAYER (f)]→y→[…Rest of Network…]→L
clearly x is the input here, f(x) is the layer about which we care, wow it rhymes
y is the output of the layer, y = f(x) and L is the final loss.
what we are interested in is dL/dX and dL/dF, now F can be a weight matrice or some other function which has learnable parameters to it
Memorize
dL/dX = dL/dY * dY/dX
generally the f is of two types, weight matrices and element wise operations. we are going to look at element wise operations for now
an “element-wise” operation means the math happens to each number in the matrix independently. none of the numbers “talk” to each other.
Examples:
Y = X + Z
(Matrix Addition)
Y = ReLU(X)
(Activation)
Y = X^2
(Square every element)
Rule for Element Wise Operations (memorize this)
you might be wondering what does this circle and a dot between it means
that is the hadamard product (element-wise multiplication). it means: multiply the top-left of a with the top-left of b, top-right with top-right, etc. no fancy row-column dot products here. just simple multiplication.
example 1
Y = ReLU(x) (rule: if x>0, keep it. if x≤0, set to 0)
so dY/dX = 1 if x>0 or 0 if x<0
what does this mean philosophically? if x>0 pass my gradients (y speaking) as it is if not stop make all the gradients 0 ( do not change the weights they did not contribute to the loss)
example 2
Residual Connection Y = X + Z
dY/dX = dY/dZ = 1
so the gradients flow as it is from Y to X and Z
note: we talked about relu above and only calculated dY/dX and not dY/dW because ReLU does not have any parameters to tune, if we were using GeGLU we would have also calculated dy/DW because we also want its parameters to learn.
now we have completed activation functions lets move on to the matrice multiplications
Y = XW
we want to learn two things dL/dX and dL/dW
so here are the formulas, please memorize
you are now all set for calculating any gradient in transformers
but lets cover one special hard case, that is gradient of loss wrt to the Logit layer
Softmax
z = logits
(shape: b, t, v where v is vocab size).
p = softmax(z)
(probabilities).
l = -log(p_correct_token).
deriving softmax is messy (lots of jacobian matrices), but the final result is elegantly simple
dL/dZ = P - Y(one hot)
and from here it is all matrice multiplications and activation functions which we have already covered. so no worries.

