COMPUTING GRADIENTS FOR THE SAKE OF IT

Dec 07, 2025

i always struggled remembering formulas for what would be the gradient of a particular layer in the transformer. recently while studying about distributed training I realized that I can not move forward without having a crystal clear understanding of how gradients flew in the network

so here is a blog which will teach you all you need to be a wizard of gradients

Rule 1

suppose we want to find the gradient dL/dW where L is the loss and W is the gradient. the rule says that the shape of dL/dW will be equal to the shape of W, memorize this.

x→[YOUR LAYER (f)]→y→[…Rest of Network…]→L

clearly x is the input here, f(x) is the layer about which we care, wow it rhymes

y is the output of the layer, y = f(x) and L is the final loss.

what we are interested in is dL/dX and dL/dF, now F can be a weight matrice or some other function which has learnable parameters to it

Memorize

dL/dX = dL/dY * dY/dX

generally the f is of two types, weight matrices and element wise operations. we are going to look at element wise operations for now

an “element-wise” operation means the math happens to each number in the matrix independently. none of the numbers “talk” to each other.

Examples:

Y = X + Z
(Matrix Addition)

Y = ReLU(X)
(Activation)

Y = X^2
(Square every element)

Rule for Element Wise Operations (memorize this)

\( \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \odot \text{LocalDerivative}(X) \)

you might be wondering what does this circle and a dot between it means

that is the hadamard product (element-wise multiplication). it means: multiply the top-left of a with the top-left of b, top-right with top-right, etc. no fancy row-column dot products here. just simple multiplication.

example 1

Y = ReLU(x) (rule: if x>0, keep it. if x≤0, set to 0)

so dY/dX = 1 if x>0 or 0 if x<0

what does this mean philosophically? if x>0 pass my gradients (y speaking) as it is if not stop make all the gradients 0 ( do not change the weights they did not contribute to the loss)

example 2

Residual Connection Y = X + Z

dY/dX = dY/dZ = 1

so the gradients flow as it is from Y to X and Z

note: we talked about relu above and only calculated dY/dX and not dY/dW because ReLU does not have any parameters to tune, if we were using GeGLU we would have also calculated dy/DW because we also want its parameters to learn.

now we have completed activation functions lets move on to the matrice multiplications

Y = XW

we want to learn two things dL/dX and dL/dW

so here are the formulas, please memorize

\(\frac{\partial L}{\partial W} = X^{\top} \cdot \frac{\partial L}{\partial Y}\)

\(\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \cdot W^{\top}\)

you are now all set for calculating any gradient in transformers

but lets cover one special hard case, that is gradient of loss wrt to the Logit layer

Softmax

z = logits
(shape: b, t, v where v is vocab size).

p = softmax(z)
(probabilities).

l = -log(p_correct_token).

deriving softmax is messy (lots of jacobian matrices), but the final result is elegantly simple

dL/dZ = P - Y(one hot)

and from here it is all matrice multiplications and activation functions which we have already covered. so no worries.

ayush’s Substack

Discussion about this post

Ready for more?