<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[ayush’s Substack]]></title><description><![CDATA[My personal Substack]]></description><link>https://goyalayus.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png</url><title>ayush’s Substack</title><link>https://goyalayus.substack.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 17 Jun 2026 19:08:32 GMT</lastBuildDate><atom:link href="https://goyalayus.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[ayush goyal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[goyalayus@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[goyalayus@substack.com]]></itunes:email><itunes:name><![CDATA[ayush goyal]]></itunes:name></itunes:owner><itunes:author><![CDATA[ayush goyal]]></itunes:author><googleplay:owner><![CDATA[goyalayus@substack.com]]></googleplay:owner><googleplay:email><![CDATA[goyalayus@substack.com]]></googleplay:email><googleplay:author><![CDATA[ayush goyal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Inference ]]></title><description><![CDATA[notes for myself]]></description><link>https://goyalayus.substack.com/p/inference</link><guid isPermaLink="false">https://goyalayus.substack.com/p/inference</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Wed, 10 Dec 2025 14:22:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lMh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>KV Caching </p><p>core logic (matrix dimensions):</p><p>goal: generate one new token at a time.<br>input: most recent token, shape [b, 1, n].<br>b: batch size, n: embedding dim, t: current sequence length.</p><p>step-by-step for token t+1:</p><ul><li><p>projection:</p><ul><li><p>input [b, 1, n] &#215; weight matrix W_qkv [b, n, 3n] &#8594; [b, 1, 3n]</p></li><li><p>split into:</p><ul><li><p>q_new [b, 1, n]</p></li><li><p>k_new [b, 1, n]</p></li><li><p>v_new [b, 1, n]</p></li></ul></li></ul></li><li><p>use cache:</p><ul><li><p>retrieve K_cache [b, t, n], V_cache [b, t, n]</p></li><li><p>append new:</p><ul><li><p>K_total = concat(K_cache, k_new) &#8594; [b, t+1, n]</p></li><li><p>V_total = concat(V_cache, v_new) &#8594; [b, t+1, n]</p></li></ul></li></ul></li><li><p>attention calculation:</p><ul><li><p>scores: q_new @ K_total^T &#8594; [b, 1, t+1]</p></li><li><p>weights: softmax(scores / sqrt(n)) &#8594; [b, 1, t+1]</p></li><li><p>output: weights @ V_total &#8594; [b, 1, n]</p></li></ul></li></ul><p>result: output [b, 1, n], passed to next layer for next token prediction.</p><p></p><blockquote><p>if you ever forget why we can store k and v as cache and how is it equivalent to doing fully, just remember the row matrice multiplication , you will realize</p></blockquote><p></p><div><hr></div><h3><strong>Arithmetic Intensity Analysis of Transformer Operations (on NVIDIA H100)</strong></h3><p>    Machine Balance (AI_knee): ~295 FLOPs / Byte</p><p>    If AI &gt; 295: Compute-Bound (Limited by 989 TFLOP/s)</p><p>    If AI &lt; 295: Memory-Bound (Limited by 3.35 TB/s)</p><p>    B: Batch Size</p><p>    S: Sequence Length</p><p>    D: Model Hidden Dimension</p><p>    F: MLP Intermediate Dimension (typically ~4D)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lMh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lMh_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 424w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 848w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 1272w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lMh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png" width="1456" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95846,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/181160799?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lMh_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 424w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 848w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 1272w, https://substackcdn.com/image/fetch/$s_!lMh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c3e6297-9698-41b8-97a5-02d969fe7514_1558x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p>kv cache memory calculation</p><p>calculation per layer:</p><ul><li><p>memory for keys: b &#215; s &#215; h &#215; d_k &#215; 2 bytes</p></li><li><p>memory for values: b &#215; s &#215; h &#215; d_k &#215; 2 bytes</p></li><li><p>total per layer: b &#215; s &#215; h &#215; 2 &#215; d_k &#215; 2 bytes = b &#215; s &#215; 2 &#215; d &#215; 2 bytes (since h &#215; d_k = d)</p></li></ul><p>total for entire model:</p><ul><li><p>kv cache size (bytes) = b &#215; s &#215; l &#215; 2 &#215; d &#215; 2</p></li></ul><p>example: llama 2 13b</p><ul><li><p>l = 40 layers, d = 5120, b = 1, s = 4096</p></li><li><p>cache size = 1 &#215; 4096 &#215; 40 &#215; 2 &#215; 5120 &#215; 2 = 3,355,443,200 bytes &#8776; 3.36 gb</p></li><li><p>this is in addition to ~26 gb for model weights</p></li></ul><p>kv cache size is proportional to the number of key and value vectors stored. standard mha is inefficient.</p><div><hr></div><p>revisiting multi-head attention (mha)</p><p><br>structure: in mha, the model&#8217;s hidden dimension d is split among n attention heads. each head operates independently with its own query, key, and value projection weights (w_q, w_k, w_v).<br></p><p>implication for kv cache: if a model has n query heads, it also has n key heads and n value heads. for every token, we must compute and store n distinct key vectors and n distinct value vectors.<br></p><p>visualization:<br>query heads: [q1] [q2] [q3] [q4] [q5] [q6] [q7] [q8]<br>key heads: [k1] [k2] [k3] [k4] [k5] [k6] [k7] [k8]<br>value heads: [v1] [v2] [v3] [v4] [v5] [v6] [v7] [v8]<br></p><p>the number of k/v heads is equal to the number of q heads.</p><p><strong>multi-query attention (mqa)</strong><br>mqa is based on a simple observation: the model might not need the full expressive power of n distinct key and value heads.<br></p><p>structure: mqa maintains n query heads but uses only a single key head and a single value head.</p><p>these single k/v heads are shared across all n query heads.<br>implication for kv cache: for every token, we compute n q vectors, but only one k vector and one v vector.<br></p><p>visualization:<br>query heads: [q1] [q2] [q3] [q4] [q5] [q6] [q7] [q8]<br>key head: +-------------------[ k ]-------------------+<br>value head: +-------------------[ v ]-------------------+<br></p><p>benefit: the size of the kv cache is reduced by a factor of n (the number of heads). for a model with 40 heads, this is a 40x reduction in cache size and a 40x reduction in the amount of data that needs to be read from hbm during the decode step&#8217;s attention calculation. this directly improves tpot (time-per-output-token).</p><p><strong>grouped-query attention (gqa)</strong><br>gqa is an interpolation between the extremes of mha and mqa. it recognizes that while mqa offers huge savings, the quality degradation from sharing a single k/v head might be too severe.<br></p><p>structure: gqa maintains n query heads, but groups them. each group of g query heads shares a single key/value head pair. the total number of k/v heads is n/g.<br>implication for kv cache: it offers a tunable trade-off.<br></p><p>if g=1, you have n/1 = n k/v heads, which is identical to mha.<br>if g=n, you have n/n = 1 k/v head, which is identical to mqa.<br></p><p>visualization (n=8, g=4):<br>query heads: [q1] [q2] [q3] [q4] | [q5] [q6] [q7] [q8]<br>key heads: +------[ k1 ]------+ | +------[ k2 ]------+<br>value heads: +------[ v1 ]------+ | +------[ v2 ]------+<br></p><p>benefit: reduces kv cache size by a factor of g. llama 2 models, for instance, use gqa to manage their kv cache size while maintaining high quality.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8zdP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8zdP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 424w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 848w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 1272w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8zdP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png" width="835" height="512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:512,&quot;width&quot;:835,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57274,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/181160799?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8zdP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 424w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 848w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 1272w, https://substackcdn.com/image/fetch/$s_!8zdP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc290f1f2-f789-416f-a7b7-d8abbba6a95d_835x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>multi-head latent attention (mla)</strong><br>mla (used in the deepseek-v2 model) takes a different approach to dimension reduction. instead of reducing the number of k/v heads, it reduces the dimension of each k/v vector.<br>structure: mla projects the full-dimension key and value vectors (n<em>h</em>d_k) down to a much smaller compressed or &#8220;latent&#8221; dimension, c.<br>example (deepseek-v2):<br>full k/v dimension: 16384multi-head latent attention (mla) reduces kv cache size by compressing the dimension of each key and value vector, rather than reducing the number of heads. instead of storing full-dimensional k/v vectors (nhd_k), mla projects them down to a much smaller latent dimension c using a compression matrix. for example, in deepseek-v2, the full k/v dimension is 16384, which is compressed to 512, giving a 32x reduction in cache size.</p><p>the data flow for mla is as follows:</p><ul><li><p>input: token embedding x, shape (1, d)</p></li><li><p>k-projection: x is multiplied by the key weight matrix w_k to get the full-dimensional key k_full (1, d)</p></li><li><p>compression: k_full is then multiplied by a new compression matrix w_compress (d, c) to get the compressed key k_latent (1, c)</p></li><li><p>storage: only k_latent is stored in the kv cache; k_full is discarded</p></li><li><p>attention: when a new query q_full is generated, it is also compressed using w_compress to get q_latent (1, c)</p></li><li><p>scores: attention scores are computed in the latent space: q_latent @ k_cache_latent^t</p></li><li><p>decompression: the attention output (c-dimensional) is projected back to the full model dimension d using a decompression matrix w_decompress (c, d)</p></li></ul><p>a wrinkle is that compression can interfere with positional encodings like rope, which work on the full-dimensional vectors. to handle this, deepseek-v2 adds back 64 dimensions, making the final latent dimension 576. this allows efficient cache storage while preserving the ability to use advanced positional encodings.<br>compressed latent dimension c: 512<br></p><p><strong>cross-layer attention (cla)</strong><br>the idea: gqa shares key/value vectors across heads in the same layer. cla extends this by sharing key/value vectors across multiple consecutive layers.<br>mechanism:</p><ul><li><p>layer 10 computes and caches its own k and v vectors.</p></li><li><p>layer 11 skips k/v computation and reuses the cache from layer 10 for its attention.</p></li><li><p>layer 12 does the same.<br>benefit: memory savings are substantial&#8212;sharing one kv cache across 3 layers reduces memory by 3x.<br>trade-off: model loses expressive power, as layers share less specialized information. must be trained from scratch with this architecture.</p></li></ul><p><strong>local attention (sliding window attention)</strong><br>the idea: for many tasks, only the local context matters. long-range dependencies are often negligible.<br>mechanism: each token only attends to a fixed window of recent tokens (e.g., last 512 or 4096).<br>visualization:</p><p>full attention: token 8000 attends to all previous tokens.</p><p>local attention: token 8000 attends only to last 512 tokens.<br>benefit - compute: q @ k^t matrix multiplication is much smaller, speeding up prefill for long sequences.<br>benefit - cache: kv cache size is fixed (independent of sequence length). oldest tokens are evicted as new ones arrive, enabling infinite sequences with small cache.<br>cache size (local): b &#215; w &#215; l &#215; 2 &#215; d &#215; 2<br>problem: model becomes short-sighted&#8212;cannot access distant context, hurting tasks needing long-range dependencies.<br>solution - hybrid layers: most layers use local attention, but every nth layer uses full attention. balances efficiency with long-range coherence.</p><blockquote><p>tell me what would would be the the dimensions of Value, and how </p></blockquote><div><hr></div><p><strong>the static batching problem</strong></p><p>static batching (batched scheduling):</p><ul><li><p>inference server collects b user requests, pads them to the length of the longest request, and processes them as one dense tensor.</p></li><li><p>processing starts only when batch is full or timeout reached.</p></li></ul><p>example:<br>three requests:</p><ul><li><p>a: &#8220;the cat sat on the&#8221; (5)</p></li><li><p>b: &#8220;summarize this article&#8221; (4)</p></li><li><p>c: &#8220;translate to french: hello&#8221; (3)</p></li></ul><p>padded batch:</p><p><code>[ the,  cat,   sat,   on,  the   ]<br>[ summ, this, article, p,   p     ]<br>[ tran, to, french, hello, p      ]<br></code></p><p>forms a (b=3, s=5) tensor for efficient gpu processing.</p><p>inefficiencies of static batching:</p><p>a. throughput inefficiency (padding):</p><ul><li><p>gpu computes on padding tokens&#8212;wasted computation.</p></li><li><p>in example, 3/15 slots (20%) are padding. real workloads often see 50&#8211;70% waste.</p></li><li><p>reduces max throughput; wastes compute and kv cache memory.</p></li></ul><p>b. latency inefficiency (head-of-line blocking):</p><ul><li><p>short, fast requests wait for longer ones in the queue.</p></li><li><p>requests arriving early wait for batch to fill or for long requests to finish.</p></li><li><p>consequences:</p><ul><li><p>high time-to-first-token (ttft): early requests delayed by 100&#8211;500ms.</p></li><li><p>poor fairness: fast requests wait behind slow ones, hurting user experience.</p></li></ul></li></ul><h3><strong>How padding really works</strong></h3><p><strong>the attention mask</strong></p><p>the attention mask is the key mechanism that prevents the model from attending to padding tokens.</p><ul><li><p>before softmax, a mask matrix is added to the attention scores (q @ k^t).</p></li><li><p>mask contains 0 where attention is allowed, and a large negative number (e.g., -1e9) at padding positions.</p></li><li><p>after softmax, scores for padding positions become effectively zero.</p></li><li><p>result: no token attends to padding tokens&#8212;they are invisible to attention.</p></li></ul><p><strong>masking during loss</strong></p><p>example: batch of two sequences</p><ul><li><p>sequence 1: [the, cat, sat, &lt;eos&gt;] (length 4)</p></li><li><p>sequence 2: [hello, world, &lt;eos&gt;, &lt;pad&gt;] (length 4 after padding)</p></li></ul><p>prediction targets (y values)</p><ul><li><p>for each input, the target is the next token in its own sequence.</p></li><li><p>after &lt;eos&gt;, there is no meaningful target&#8212;the sequence is over.</p></li></ul><p>why masked loss is necessary</p><ul><li><p>the model produces logits at every position, even after &lt;eos&gt; and at &lt;pad&gt;.</p></li><li><p>if we don&#8217;t mask loss at these positions, the model would be forced to predict something after the sequence ends.</p></li><li><p>this would teach nonsense patterns, corrupting language understanding.</p></li></ul><p>solution: gradients for &lt;eos&gt; and &lt;pad&gt; tokens are set to zero.</p><ul><li><p>these positions are ignored during training.</p></li><li><p>the network does not learn from them, preserving correct language structure.</p></li></ul><blockquote><p>note in static  batching you can not do prefil and decode together, because then the padding for the decode tokens would become very very massive</p></blockquote><p></p><p><strong>Continuous batching</strong> </p><p>instead of looking it like we will wait for some pre-fixed batch of user requests, what we do is we see it as a per token generation or per forward pass iteration. for one forward pass the batch and sequence length does nto change but for another forward pass they might change, based on the user request. in that single forward pass we want have many requests of type both prefil and decode. the decode type will have sequence length of 1 only but the prefil one can have varying sequence lengths so the way we are looking at it now even doing a forward pass is impossible<br><br>you can not have (batch , changing not constant sequene, dimension) size fo a matrice and do a matrice multiplication. so what we do instead is multiply the batch and sequence (or add all the sequence).</p><p>suppose there is a prefil request of 500 tokens and 4 decode request so our matrice will become of size (504, n) and multiplying it with a (n, 4n) matrice is now possible<br><br>now lets look at the attention machanism.</p><p>you would have notices that we can generate k q v vectors of each token in parallel now also that is  not an issue but </p><p>q * k ^ Trnaspose. this operation is hard to do because the last dimension of k keeps chanfing for both decode and prefill so we can not batch them. so we will have to do them in parallel but we will have to write a custom cuda kernel to do this efficiently.</p><p>now while writing the custom cuda kernel there is a issue,  to solve which we introduce </p><h3><strong>paged attention</strong></h3><p>without pagedattention, each new request requires a contiguous block of gpu vram for its full kv cache:</p><ul><li><p>request a (max_len=1024): allocate 1024-token block</p></li><li><p>request b (max_len=2048): allocate 2048-token block</p></li><li><p>request c (max_len=1024): allocate another 1024-token block</p></li></ul><p>this causes two major problems:</p><p>internal fragmentation:</p><ul><li><p>if request a only uses 50 tokens, 974 token-slots are wasted, even though they&#8217;re allocated.</p></li></ul><p>external fragmentation:</p><ul><li><p>when request b finishes, its 2048-token block is freed.</p></li><li><p>a new request d needs 2049 tokens. even if total free memory is enough, if no single contiguous block is large enough, the request fails.</p></li></ul><p>this fragmentation makes memory management inefficient and scheduler complex.</p><p>pagedattention: the solution</p><p>pagedattention applies the idea of paging from os virtual memory:</p><ul><li><p>physical memory: gpu kv cache is split into many small, fixed-size blocks (pages), e.g., 16 tokens per block.</p></li><li><p>logical view: each sequence sees its kv cache as a continuous sequence.</p></li><li><p>page table: for each request, a lookup table maps logical blocks to physical blocks in memory.</p></li></ul><p>this allows flexible allocation, reduces fragmentation, and makes memory usage much more efficient.</p><div><hr></div><h3><strong> Speculative Sampling</strong></h3><p>step 1: drafting (fast &amp; sequential)</p><ul><li><p>a small, fast draft model (m_draft) runs autoregressively and generates k candidate tokens.</p></li><li><p>example: draft predicts [&#8221;the&#8221;, &#8220;quick&#8221;, &#8220;brown&#8221;, &#8220;fox&#8221;]. this is fast due to the model&#8217;s small size.</p></li></ul><p>step 2: verification (fast &amp; parallel)</p><ul><li><p>the large target model (m_target) takes the original context plus the entire draft as input.</p></li><li><p>one forward pass produces probability distributions (logits) for each draft position.</p></li></ul><p>step 3: acceptance/rejection (rejection sampling)</p><ul><li><p>compare draft predictions with target model&#8217;s probabilities token by token.</p></li><li><p>token 1 (&#8221;the&#8221;): if p_target(&#8221;the&#8221;) is high, accept.</p></li><li><p>token 2 (&#8221;quick&#8221;): if previous token accepted, check p_target(&#8221;quick&#8221; | &#8220;the&#8221;). if high, accept.</p></li><li><p>token 3 (&#8221;brown&#8221;): if target prefers &#8220;red&#8221;, reject &#8220;brown&#8221;.</p></li><li><p>chain breaks: any rejection discards the rest of the draft.</p></li></ul><p>step 4: correction and resumption</p><ul><li><p>keep accepted tokens (e.g., [&#8221;the&#8221;, &#8220;quick&#8221;]).</p></li><li><p>at the rejection point, sample a corrected token from target model&#8217;s logits (e.g., &#8220;red&#8221;).</p></li><li><p>resume generation from the corrected token. draft model generates a new draft from there.</p></li></ul><p>performance gain</p><ul><li><p>one draft and one verification step can yield multiple tokens.</p></li><li><p>speedup depends on acceptance rate: better draft models = longer accepted sequences = greater speedup.</p></li><li><p>method is lossless: final sequence matches target model&#8217;s distribution exactly.</p></li><li><p>not an approximation&#8212;just a faster generation method.</p></li></ul><div><hr></div><h3>Distillation</h3><p>inputs: standard text dataset (e.g., &#8220;the quick brown...&#8221;).</p><p>forward pass (student):</p><ul><li><p>text is fed into the student model.</p></li><li><p>produces logits for next token (e.g., fox: 5.0, dog: 2.0, car: -10.0).</p></li></ul><p>forward pass (teacher):</p><ul><li><p>same text is fed into the frozen teacher model.</p></li><li><p>produces its own logits (e.g., fox: 10.0, dog: 8.0, car: -20.0).</p></li></ul><p>loss function (the &#8220;training&#8221; part):</p><ul><li><p>standard supervised learning: loss checks if model predicted the correct single word.</p></li><li><p>distillation: loss (kl-divergence) checks if student&#8217;s probability distribution matches teacher&#8217;s.</p></li><li><p>this is &#8220;curve fitting.&#8221;</p></li></ul><p>step 1: softmax with temperature</p><ul><li><p>divide logits by temperature t before softmax.</p></li><li><p>high t smooths the distribution, making small probabilities more visible.</p></li><li><p>p_student = softmax(student_logits / t)</p></li><li><p>p_teacher = softmax(teacher_logits / t)</p></li></ul><p>step 2: calculate divergence</p><ul><li><p>loss = kl_divergence(p_teacher, p_student)</p></li></ul><p>step 3: backpropagation</p><ul><li><p>gradients of loss are calculated w.r.t. student&#8217;s weights.</p></li><li><p>optimizer updates student to match teacher&#8217;s output curve.</p></li><li><p>teacher weights remain frozen.</p><div><hr></div></li></ul><p>in quantization weights are stored in quantized manner but to do calculations they are cnoverted to bf16</p><p>also pruning is removing some layers based on how active they are in a sample dataset</p><p></p><p></p><p></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[BROADCASTING]]></title><description><![CDATA[notes for myself]]></description><link>https://goyalayus.substack.com/p/broadcasting</link><guid isPermaLink="false">https://goyalayus.substack.com/p/broadcasting</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 09 Dec 2025 14:19:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1wep!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1wep!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1wep!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1wep!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1wep!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1wep!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1wep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg" width="720" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:720,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73613,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/181129003?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6fe2e6-4e69-4e4f-83f4-f1716aa61ace_720x1600.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1wep!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1wep!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1wep!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1wep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7adeeb19-871b-44b0-ad3e-2fe825178e49_720x744.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Given a tensor with Shape (d0,d1,&#8230;,dn&#8722;1), the stride sk for dimension k is the product of the sizes of all subsequent dimensions.</p><p>for the last dimension (k = n-1):<br>s(n-1) = 1</p><p>for any other dimension k (counting backwards from n-2 down to 0):<br>s(k) = s(k+1) * d(k+1)</p><p>the explicit formula:<br>s(k) = product from j = k+1 to n-1 of d(j)</p><p><strong>The Memory Offset Formula (Linear Index)</strong></p><p>given a specific logical index (i0, i1, ..., in-1) and strides (s0, s1, ..., sn-1):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{offset} = \\sum_{k=0}^{n-1} (i_k \\times s_k)\n&quot;,&quot;id&quot;:&quot;XQMDQSCDMX&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>contiguous vs non-contiguous memory:</strong></p><p>a tensor is contiguous if its elements are stored in memory in the order they&#8217;re traversed when incrementing indices from right to left (axis -1 moves fastest).</p><p>example:<br>matrix a, shape (2, 3), storage: , strides: (3, 1).<br>after transpose b = a.t(), shape (3, 2), storage unchanged, strides: (1, 3).<br>iterating along axis 1 of b now jumps by 3 in memory&#8212;not sequential. this is non-contiguous.</p><p>why it matters:<br>operations like .view() need contiguous memory. if a tensor is non-contiguous, .view() fails. use .contiguous() to copy data into contiguous storage.</p><p><strong>Broadcasting is &#8220;Virtual&#8221; Memory</strong></p><p>broadcasting uses strides to repeat data without copying.<br>take a vector v, shape (1, 3), storage: , strides: (3, 1).<br>to broadcast to shape (4, 3), pytorch creates a view:<br>shape: (4, 3), strides: (0, 1).</p><p>analyze stride 0:<br>offset = (i0 &#215; s0) + (i1 &#215; s1)<br>for row 0: offset = (0 &#215; 0) + (0 &#215; 1) = 0 (value: 10)<br>for row 3: offset = (3 &#215; 0) + (0 &#215; 1) = 0 (value: 10)</p><p>stride 0 means moving along axis 0 doesn&#8217;t advance the pointer. the tensor appears to have 4 rows, but it re-reads the same row.<br>mental model: broadcasting isn&#8217;t copying or stretching&#8212;set stride to 0 to freeze the pointer for that axis.</p><p></p><p><strong>checking if two tensors are compatible for brodcasting</strong></p><p>step1 &#8212;&gt; keep their dimensions vertically stacked</p><p>for each verticle stacked dimension check if  either they are equal, one of them is one, or does not exist</p><p>if either of this is voilated we can not do brodcasting</p><p>for exmaple </p><p>1 3 2 <br>      2<br><br>matches<br><br>1 3 2 <br>1 2 3 <br><br>does nto match<br><br>3 1 2 <br>   4 2<br><br>yes matches</p><p></p><p><strong>one more way</strong></p><p>scenario a: original (5, 6)<br>target: (5, 6, 10)<br>input: (5, 6)<br>alignment:<br>ax-3 ax-2 ax-1<br>tgt: 5 6 10<br>inp: 5 6<br>result: crash. ax-1: 10 vs 6. mismatch.</p><p>scenario b: reshaped (5, 6, 1)<br>target: (5, 6, 10)<br>input: (5, 6, 1)<br>alignment:<br>ax-3 ax-2 ax-1<br>tgt: 5 6 10<br>inp: 5 6 1<br>result: compatible. ax-1: 10 vs 1. expansion.</p><p></p><blockquote><p>you can also broadcast a scalar you know how so I wont go into depths of that</p></blockquote><p></p><div><hr></div><p>the element-wise product (hadamard product)<br>symbol: * or torch.mul()<br>logic: pure broadcasting.<br>rule: align dimensions right-to-left. 1s expand. multiply cell-by-cell.<br>result shape: max size at each dimension.<br>example: (3, 4) * (3, 4) &#8594; (3, 4)<br>c(i,j) = a(i,j) &#215; b(i,j)</p><p>the matrix product (matmul)<br>symbol: @ or torch.matmul()<br>logic: linear algebra + broadcasting.<br>rule: split dimensions into batch (all except last two) and matrix (last two).</p><p>step 1: matrix compatibility<br>a: (..., n, k)<br>b: (..., k, m)<br>inner dimension k must match. result: (n, m)</p><p>step 2: batch compatibility<br>broadcast remaining dimensions (right-to-left, 1s expand).</p><p>example:<br>a: (3, 4), b: (4, 5)<br>a * b &#8594; error (4 vs 5)<br>a @ b &#8594; (3, 5)</p><p>high-dimensional matmul:<br>a: (10, 1, 3, 4)<br>b: (1, 20, 4, 5)<br>matrix core: (3, 4) @ (4, 5) &#8594; (3, 5)<br>batch: (10, 1) and (1, 20) &#8594; broadcast to (10, 20)<br>final shape: (10, 20, 3, 5)</p><div><hr></div><blockquote><p>note :- we can not multile 10,3,5 to 10,1,8</p></blockquote><div><hr></div><p>case a: vector @ matrix<br>vector v: (4,)<br>matrix m: (4, 5)<br>hidden process:</p><ul><li><p>prepend 1: v becomes (1, 4)</p></li><li><p>multiply: (1, 4) @ (4, 5) &#8594; (1, 5)</p></li><li><p>squeeze: remove 1 &#8594; (5,)</p></li></ul><p>case b: matrix @ vector<br>matrix m: (3, 4)<br>vector v: (4,)<br>hidden process:</p><ul><li><p>append 1: v becomes (4, 1)</p></li><li><p>multiply: (3, 4) @ (4, 1) &#8594; (3, 1)</p></li><li><p>squeeze: remove 1 &#8594; (3,)</p></li></ul><div><hr></div><p>broadcasting is automatic shape manipulation. for manual geometry changes, use view, reshape, and permute.</p><p>permute(*dims): swaps axes. changes stride order, not data. result is usually non-contiguous. example: (a, b, c) &#8594; (c, a, b). you can say this is generalized version of transpose</p><p>view(*shape): reshapes tensor without copying data. only works on contiguous tensors. fails if memory is scrambled (e.g. after permute).</p><p>reshape(*shape): like view, but if tensor is non-contiguous, silently copies data to make it contiguous first. always works, but may be slower.</p><div><hr></div><p>unsqueeze(dim): adds a dimension of size 1 at the specified axis. pure metadata operation&#8212;no data copied. used for explicit alignment, e.g., (3,) &#8594; (3, 1).</p><p>squeeze(dim=None): removes dimensions of size 1. if dim given, only removes that axis if size is 1. otherwise, removes all singleton dimensions. also a metadata-only operation.</p><p>expand(*shape): manually expands dimensions of size 1 to larger sizes, using stride 0&#8212;no data copied. new shape must match or expand from original, and only 1s can be expanded. expands are views, not copies.</p><p>expand vs repeat: expand is virtual (stride 0), repeat is physical (copies data).</p>]]></content:encoded></item><item><title><![CDATA[MY COMPLETE NOTES ON DISTRIBUTED TRAINING]]></title><description><![CDATA[these are just notes for myself, it might feel ai written because some part of it is.]]></description><link>https://goyalayus.substack.com/p/my-complete-notes-on-distributed</link><guid isPermaLink="false">https://goyalayus.substack.com/p/my-complete-notes-on-distributed</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Mon, 08 Dec 2025 11:05:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VFLu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>these are just notes for myself, it  might feel ai written because some part of it is.</p><div><hr></div><p>intranode communication is faster than internode communication</p><div><hr></div><p><strong>Basic Operations</strong></p><p><strong>Broadcast</strong>: GPU 0 starts with the data (e.g., model weights) and sends a copy to GPU 1, 2, and 3 so that all GPUs end up with identical data.</p><p><strong>Scatter</strong>: GPU 0 starts with a list of data items [A, B, C, D], splits it into chunks, keeps A, sends B to GPU 1, C to GPU 2, and D to GPU 3, so the data is partitioned and each GPU owns a slice.</p><p>Gather: Each GPU has a partial result (GPU 0: A, GPU 1: B, GPU 2: C, GPU 3: D), everyone sends their piece to GPU 0, and GPU 0 ends up with the full list [A, B, C, D] while others still only have their own piece.</p><p>All-Gather: Each GPU starts with a piece (A, B, C, or D), everyone sends their piece to everyone else, and every GPU ends up with the full list [A, B, C, D], but this is expensive because it moves a lot of data.</p><p>Reduce: Each GPU starts with a number (GPU 0: 1, GPU 1: 2, GPU 2: 3, GPU 3: 4), everyone sends their number to GPU 0, GPU 0 applies an operation like sum to get 1+2+3+4=10, so GPU 0 holds the reduced result (10) while others keep their original numbers.</p><blockquote><p>its basically a gather operation at the end of which a sum or avg or max happens, but then why it is given a separate name, no clear answers but yeah the sum happening at the end can be made very efficient if we know we are gonna sum</p></blockquote><p>All-Reduce: basically suppose you have vectors of size 4 in 4 saperate gpu&#8217;s</p><p>gpu 1 &#8212;&gt; A0 A1 A2 A3<br>gpu 2 &#8212;&gt; B0 B1 B2 B3</p><p>gpu 3 &#8212;&gt; c0 c1 c2 c3<br>gpu 4 &#8212;&gt; d0 d1 d2 d3<br><br>now you want each gpu to have four elements [sigma a , sigma b, sigma c, sigma d]<br>what you will do , you will perform an internal reduction in each gpu and then perform an all gather right</p><p>but what if you have </p><p>gpu 1 &#8212;&gt; A0 B0 C0 D0<br>gpu 2 &#8212;&gt; A1 B1 C1 D1</p><p>gpu 3 &#8212;&gt; A2 B2 C2 D2</p><p>gpu 4 &#8212;&gt; A3 B3 C3 D3</p><p>and you want each gpu to have four elements [sigma a , sigma b, sigma c, sigma d]</p><p>one approach you can take is  gpu 1 sends b0 c0 d0, and receives a1<br>similarly for all other gpus, but this approach is kind of inefficient. because every node is talking to every node, congesting the network<br><br>alternate approach :- Ring Reduce</p><p>Initial State (Time = 0):</p><p>GPU 0: [A0, B0, C0, D0]<br>GPU 1: [A1, B1, C1, D1]<br>GPU 2: [A2, B2, C2, D2]<br>GPU 3: [A3, B3, C3, D3]</p><p>Step 1 of 3:</p><p>GPU 0: [A0, B0, C0+C3, D0]<br>GPU 1: [A1, B1, C1, D1+D0]<br>GPU 2: [A2+A1, B2, C2, D2]<br>GPU 3: [A3, B3+B2, C3, D3]</p><p>Step 2 of 3:</p><p>GPU 0: [A0, B0+B3+B2, C0+C3, D0]<br>GPU 1: [A1, B1, C1+C0+C3, D1+D0]<br>GPU 2: [A2+A1, B2, C2, D2+D1+D0]<br>GPU 3: [A3+A2+A1, B3+B2, C3, D3]</p><p>Step 3 of 3 (final step of Phase 1):</p><p>Send:<br>GPU 0 sends B0+B3+B2 to GPU 1.<br>GPU 1 sends C1+C0+C3 to GPU 2.<br>GPU 2 sends D2+D1+D0 to GPU 3.<br>GPU 3 sends A3+A2+A1 to GPU 0.</p><p>Receive and add:<br>GPU 0 receives A3+A2+A1 and adds it to A0. Result: A_sum.<br>GPU 1 receives B0+B3+B2 and adds it to B1. Result: B_sum.<br>GPU 2 receives C1+C0+C3 and adds it to C2. Result: C_sum.<br>GPU 3 receives D2+D1+D0 and adds it to D3. Result: D_sum.</p><p>End of Phase 1: <strong>Reduce-Scatter</strong> is complete.</p><p>GPU 0 now holds only A_sum.<br>GPU 1 now holds only B_sum.<br>GPU 2 now holds only C_sum.<br>GPU 3 now holds only D_sum.</p><p>The other chunks can be discarded.</p><p>or we can now perform an all gather so that everyone has all the sums we call this Ring Reduce All</p><div><hr></div><p><strong>Memory</strong></p><ul><li><p>Static memory has three parts: weights, gradients, and optimizer states.</p></li><li><p>With mixed precision (BF16 weights, Adam in FP32), you pay about 16 bytes per parameter:</p><ul><li><p>2 bytes: weights (BF16)</p></li><li><p>2 bytes: gradients (BF16)</p></li><li><p>12 bytes: Adam (FP32 master weights + momentum + variance)</p></li></ul></li><li><p>For a 7B model, that is roughly 112 GB, even though the raw weights are only about 14 GB.</p></li><li><p>Dynamic memory comes from activations saved for backprop.</p></li><li><p>Activation memory scales with sequence length, batch size, hidden size, and number of layers.</p></li><li><p>activation-memory &#8203;= s&#8901;b&#8901;h&#8901;(constant)</p></li></ul><div><hr></div><h4>Activation Checkpointing</h4><p>A standard Transformer block (simplified) looks like this:</p><p>Input (x0) -&gt; [LayerNorm 1] -&gt; (x1) -&gt; [Self-Attention] -&gt; (x2) -&gt; [Residual Add: x0+x2] -&gt; (x3) -&gt; [LayerNorm 2] -&gt; (x4) -&gt; [MLP] -&gt; (x5) -&gt; [Residual Add: x3+x5] -&gt; Output (x6)<br></p><p>we only store x0 and x6 and discard all of them and re-calculate while back propagating to save memory</p><p>note : this is only for training we do not do this in infrence, I mean like we cache kv, everything else we anyway delete</p><div><hr></div><h3>Data Parallelism</h3><ul><li><p>Split batch: 32 images &#8594; 8 per GPU (GPUs 0&#8211;3).</p></li><li><p>Forward: Each GPU runs forward pass on its 8 images with its local copy of the model.</p></li><li><p>Backward: Each GPU computes gradients from its own 8 images (gradients differ across GPUs).</p></li><li><p>All-Reduce: Use All-Reduce on gradients to average them across GPUs so every GPU has the same Grad_avg.</p></li><li><p>Optimizer step: Each GPU updates its local weights with the same rule: Wnew=Wold&#8722;lr&#215;GradavgWnew=Wold&#8722;lr&#215;Gradavg, keeping all model replicas in sync.</p></li><li><p>Data Parallel (DDP) assumes each GPU can store a full copy of the model, its gradients, and optimizer states.</p></li><li><p>For big LLMs, static training memory is huge: a 7B model can need ~112 GB (weights + gradients + Adam states).</p></li><li><p>With DDP, every GPU must hold this full 112 GB copy.</p></li><li><p>Even if you have 4&#215;80GB GPUs, each individual GPU still can&#8217;t fit 112 GB, so DDP will run out of memory and crash.uu</p></li></ul><div><hr></div><h3>ZeRO Stage 2</h3><p>why stage two? because stage one is stupid and not worth studying. consider this particular setup</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VFLu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VFLu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 424w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 848w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 1272w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VFLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png" width="628" height="745" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:745,&quot;width&quot;:628,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:578827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/180771042?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02820dd0-0fd3-4c9a-a491-d91cd965a2ea_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VFLu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 424w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 848w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 1272w, https://substackcdn.com/image/fetch/$s_!VFLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3109820e-71d3-49d1-97d6-75c37d66d6f5_628x745.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>this is a two layer neural net being trained on two saperate gpu&#8217;s.</p><p>now notice that gradient 2 and optimizer 2 is missing from GPU 1 and viceversa. why? to save space, but then how will we train this?</p><p>look at the backward pass, <br>GPU 1 &#8212;&gt; we calculate gradients for W1 and W2 both<br>GPU 0 &#8212;&gt; we calculate the gradients for W1 and W2 both</p><p>now we do ring-reduce-scatter and GPU 1 has the average gradient for W1 and GPU 0 has average gradient for W2 (all other gradients are deleted, for example GPU1 deleted gradient for W2) , they individually update the weights with the optimzers they have and then share the weights (GPU 1 shared W1 and GPU0 shares W2) through an all gather.</p><p>basically as soon as gradients for W2 are calculated, an all reduce takes place, the weights are updates for W2 on a specefic gpu and then all the gradients for W2 accross all gpus are thrown away<br><br>instead of storing all the gradients for all the weights, we need to store them for only a single layer at a time</p><div><hr></div><p>what if we are training lamma 70b, its weights require 140GB but we do not have that in a single GPU so we will have to distribute the weights too, but how will we do that </p><h2>Stage-3 / FSDP</h2><p>the forward pass starts at layer i. the weights for layer i are split into pieces, with each gpu holding one piece. at this point, no gpu can do any calculation because each only has part of the weights.</p><p>the system then does an all-gather operation on the weights. each gpu sends its piece to all others. for a short time, every gpu has the complete set of weights in memory.</p><p>now, each gpu runs the forward pass using the full weights to compute the output.</p><p>as soon as the calculation is done, each gpu deletes the full weights from memory. they only keep their original small piece. this brings memory usage back down.</p><p>later, during the backward pass, the gradients need to be calculated for layer i. to do this, the full weights are needed again, but they were deleted.</p><p>so, the system does another all-gather on the weights. each gpu sends its piece again, and every gpu reconstructs the full weights in memory.</p><p>each gpu then computes the gradients using the full weights.</p><p>after this, the gradients are reduced and scattered. each gpu gets only the part of the gradients that matches its original weight shard.</p><p>finally, each gpu deletes the full weights and the full gradient vector from memory, keeping only its own small piece.</p><blockquote><p>weights are stored in FP-32 but converted into fp-16 at the time of transmission and calculation</p></blockquote><h3>Tensor Parallelism</h3><p>lets just say that if we were doing fsdp to save memory then we are doing tensor parallelism to save compute</p><blockquote><p>if you do not know how block matrice multiplication works check it out <a href="https://open.substack.com/pub/goyalayus/p/block-matrice-multiplication?r=2fa55s&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">here</a></p></blockquote><p>I want you to forget all about data parallelism for now, we are training on a single batch of data</p><p>the multi-layer perceptron (mlp) in a transformer is made of two linear layers with an activation function in between. for layer a, the math is y = x * a. the input x has size [1, h], and the weight matrix a has size [h, 4h]. the output y has size [1, 4h].</p><p>when using two gpus, the weight matrix a is split along the columns. gpu 0 stores the left half of a, called a_left, and gpu 1 stores the right half, called a_right. both gpus have the full input x. each gpu computes its part: gpu 0 calculates y0 = x * a_left, and gpu 1 calculates y1 = x * a_right. the result y is split between the two gpus, with each holding half.</p><p>next, the activation function gelu is applied. since gelu works on each number independently, each gpu applies it to its own part of y. gpu 0 gets y0&#8217; = gelu(y0), and gpu 1 gets y1&#8217; = gelu(y1). no communication is needed, and the output remains split.</p><p>for layer b, the math is z = y&#8217; * b. here, y&#8217; is split, with each gpu holding half. the weight matrix b has size [4h, h], and is split along the rows. gpu 0 stores the top half, b_top, and gpu 1 stores the bottom half, b_bottom. each gpu computes a partial result: gpu 0 calculates z_partial_0 = y0&#8217; * b_top, and gpu 1 calculates z_partial_1 = y1&#8217; * b_bottom. both partial results have size [1, h].</p><p>to get the final output z, the two partial results must be added together. this is done using all-reduce (sum). both gpus send their partial result to each other, add them up, and now both have the complete output z. after all-reduce, both gpus hold the same final tensor z</p><p>This is why Tensor Parallelism is <strong>strictly restricted to NVLink (Intra-Node)</strong>. why? because it is of blocking nature, in the fsdp all-geather we can perform all geather in the background while the computation is going on<br><br>how will this work with fsdp weight sharing?<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ldcx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ldcx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 424w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 848w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 1272w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ldcx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png" width="1071" height="508" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:508,&quot;width&quot;:1071,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42222,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/180771042?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ldcx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 424w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 848w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 1272w, https://substackcdn.com/image/fetch/$s_!ldcx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a70dd3f-b412-436c-b830-953c3ed1a76a_1071x508.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br></p><div><hr></div><p><strong>now lets see how we can combine tensor parallism and fsdp</strong></p><p>the hardware is set up as a 2d grid: 4 nodes, each with 8 gpus. inside each node, gpus are connected by nvlink (intra-node). nodes are connected by ethernet or infiniband (inter-node). tensor parallelism (tp) runs horizontally (inside a node), and fsdp (data parallelism) runs vertically (across nodes).&#8203;</p><p>a weight matrix w of size is split in two steps. first, tensor parallelism splits w into 8 vertical columns, called wtp0 to wtp7. in pure tp, each gpu in a node would hold one full column. but with fsdp, each column is further split across the 4 nodes. for example, wtp0 is split into 4 pieces: node 1 gpu 0 holds chunk 1, node 2 gpu 0 holds chunk 2, and so on. each gpu ends up with 1/32 of the total weights.&#8203;</p><p>for the forward pass, each gpu must compute its part of the matrix multiply. say node 1 gpu 0 wants to compute ylocal = x * wtp0. but it only has one chunk of wtp0. so, it does an all-gather across the vertical (inter-node) axis: node 1 gpu 0 talks to node 2 gpu 0, node 3 gpu 0, and node 4 gpu 0 to get all chunks of wtp0. now it has the full wtp0 in memory.&#8203;</p><p>next, it computes ylocal = x * wtp0 locally. after this, it must combine results with other gpus in its node. each gpu in node 1 does the same for its own column. then, they do an all-reduce horizontally (intra-node) over nvlink, summing their partial results. now, every gpu in node 1 has the full output z</p><h3>Sequence Parallelism</h3><p><strong>The Problem</strong></p><p>each gpu holds a full copy of the activation tensor after the row-parallel linear layer, even though only part of the computation is unique before all-reduce.</p><p>with sequence length 1,000,000, hidden size 8,192, and bf16, the tensor shape is and takes about 16 gb per gpu.</p><p>after all-reduce, every gpu has the same final tensor z_total, so all 8 gpus store identical 16 gb copies. total memory used is 128 gb, but only 16 gb is unique data.</p><p>for layernorm and dropout, each gpu processes its own copy of z_total. the work is not parallelized; every gpu does the same math and produces the same result.&#8203;</p><p>this wastes both memory and compute, since all operations are repeated across gpus instead of being distributed or optimized</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cEVW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cEVW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cEVW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5121815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/180771042?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cEVW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!cEVW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417375dd-c1bb-4fa0-bc58-205f19b40803_2816x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br><strong>The Solution</strong></p><p>at this stage, every gpu has a partial sum tensor of size 16 gb, matching the full sequence length.&#8203;</p><p>to avoid storing 16 gb on every gpu, use reduce-scatter instead of all-reduce. reduce-scatter sums the partials and then splits the result along the sequence dimension, giving each gpu only 1/8th of the tokens. each gpu now holds a correct, fully summed shard of 2 gb.&#8203;</p><p>for layernorm, each gpu processes its own shard of 125,000 tokens. since layernorm acts per token, each gpu can compute its part independently, keeping memory at 2 gb per gpu.&#8203;</p><p>when the next layer needs the full sequence, call all-gather. each gpu sends its shard to all others, temporarily reconstructing the full 16 gb tensor. after the matrix multiply, discard the full tensor and return to sharded state</p><blockquote><p>fsdp does not shard the activations of layernorm so like how tp+sfdp work together sequence parallelism works alone</p></blockquote><div><hr></div><h3>Pipeline Parallelism</h3><p>until now what we have been doing is sharding the weights gradients etc within each layer but a single GPU has to calculate all layers.</p><p>in pipeline parallel we divide each layer to a single gpu.</p><p>and run micro-batches, accumulate their gradients and update after  a full batch has been completed. (doing an reduce scatter at each micro-batch would be very in-efficient)</p><p>and when you combine all of these it would become 3d parallel.</p><p>lets see it by an example, our setup :- </p><ol><li><p>2 Stages: S0, S1 (to keep it minimal).</p></li><li><p>3 Micro-batches: mb1, mb2, mb3.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mNBH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mNBH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mNBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png" width="728" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:5480574,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/180771042?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mNBH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!mNBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ad0cf91-84d6-459e-b98e-b0460ca6d26a_2816x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><blockquote><p>with this you can imagine how all the things we talked about can work together and create a 3d parallel.  combining fsdp, tensor parallel, sequence parallel and pipeline parallel</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Block Matrice Multiplication]]></title><description><![CDATA[notes for myself]]></description><link>https://goyalayus.substack.com/p/block-matrice-multiplication</link><guid isPermaLink="false">https://goyalayus.substack.com/p/block-matrice-multiplication</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Sun, 07 Dec 2025 12:48:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>notes for myself</p><p>suppose you wanna do C= AB</p><p>the grid shape of A if is m x n then of B should be n x p</p><p>the grid shape of C would be m x p</p><p>and block of C </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_{ij} = \\sum_{n} A_{in} B_{nj}\n&quot;,&quot;id&quot;:&quot;IQSIJOPSYU&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Ain is a block. think of it like treating blocks as numbers and calculating as we generally calculate numbers in matrix multiplication<br><br>very very important :-<br>The shapes of blocks of A and B must be of the form m x p  &amp; p x n. they need not to be identical.<br><br></p>]]></content:encoded></item><item><title><![CDATA[COMPUTING GRADIENTS FOR THE SAKE OF IT]]></title><description><![CDATA[.]]></description><link>https://goyalayus.substack.com/p/computing-gradients-for-the-sake</link><guid isPermaLink="false">https://goyalayus.substack.com/p/computing-gradients-for-the-sake</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Sun, 07 Dec 2025 07:36:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>i always struggled remembering formulas for what would be the gradient of a particular layer in the transformer. recently while studying about distributed training I realized that I can not move forward without having a crystal clear understanding of how gradients flew in the network</p><p>so here is a blog which will teach you all you need to be a wizard of gradients</p><p>Rule 1</p><p>suppose we want to find the gradient dL/dW where L is the loss and W is the gradient. the rule says that the shape of dL/dW will be equal to the shape of W, memorize this.</p><div><hr></div><p>x&#8594;[YOUR LAYER (f)]&#8594;y&#8594;[&#8230;Rest of Network&#8230;]&#8594;L</p><p>clearly x is the input here, f(x) is the layer about which we care, wow it rhymes</p><p>y is the output of the layer, y = f(x) and L is the final loss.</p><p>what we are interested in is dL/dX  and dL/dF, now F can be a weight matrice or some other function which has learnable parameters to it</p><p>Memorize</p><p>dL/dX = dL/dY * dY/dX</p><div><hr></div><p>generally the f is of two types, weight matrices and element wise operations. we are going to look at <strong>element wise</strong> operations for now</p><p>an &#8220;element-wise&#8221; operation means the math happens to each number in the matrix independently. none of the numbers &#8220;talk&#8221; to each other.<br></p><p>Examples:</p><p>Y = X + Z<br>(Matrix Addition)</p><p>Y = ReLU(X)<br>(Activation)</p><p>Y = X^2<br>(Square every element)</p><p>Rule for Element Wise Operations (memorize this)</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\frac{\\partial L}{\\partial X}\n=\n\\frac{\\partial L}{\\partial Y}\n\\odot \\text{LocalDerivative}(X)\n\n&quot;,&quot;id&quot;:&quot;RROHXURFNC&quot;}" data-component-name="LatexBlockToDOM"></div><p>you might be wondering what does this circle and a dot between it means</p><p>that is the hadamard product (element-wise multiplication). it means: multiply the top-left of a with the top-left of b, top-right with top-right, etc. no fancy row-column dot products here. just simple multiplication.</p><div><hr></div><p>example 1 </p><p>Y = ReLU(x)  (rule: if x&gt;0, keep it. if x&#8804;0, set to 0)</p><p>so dY/dX = 1 if x&gt;0 or 0 if x&lt;0</p><p>what does this mean philosophically? if x&gt;0 pass my gradients (y speaking) as it is if not stop make all the gradients 0 ( do not change the weights they did not contribute to the loss)</p><p>example 2</p><p>Residual Connection Y = X + Z </p><p>dY/dX = dY/dZ = 1</p><p>so the gradients flow as it is from Y to X and Z</p><p>note: we talked about relu above and only calculated dY/dX and not dY/dW because ReLU does not have any parameters to tune, if we were using GeGLU we would have also calculated dy/DW because we also want its parameters to learn.</p><div><hr></div><p>now we have completed activation functions lets move on to the matrice multiplications</p><p>Y = XW</p><p>we want to learn two things dL/dX and dL/dW</p><p>so here are the formulas, please memorize </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial L}{\\partial W}\n=\nX^{\\top} \\cdot \\frac{\\partial L}{\\partial Y}&quot;,&quot;id&quot;:&quot;UHBJTXXJQW&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial L}{\\partial X} = \\frac{\\partial L}{\\partial Y} \\cdot W^{\\top}&quot;,&quot;id&quot;:&quot;QHLMQFFKCH&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>you are now all set for calculating any gradient in transformers </p><p>but lets cover one special hard case, that is gradient of loss wrt to the Logit layer</p><div><hr></div><p>Softmax</p><p>z = logits<br>(shape: b, t, v where v is vocab size).</p><p>p = softmax(z)<br>(probabilities).</p><p>l = -log(p_correct_token).</p><p>deriving softmax is messy (lots of jacobian matrices), but the final result is elegantly simple</p><p>dL/dZ = P - Y(one hot)</p><p>and from here it is all matrice multiplications and activation functions which we have already covered. so no worries.</p>]]></content:encoded></item><item><title><![CDATA[CUDA Kernels Zero to One]]></title><description><![CDATA[learn basics of gpu programming and writing custom cuda kernels]]></description><link>https://goyalayus.substack.com/p/speak-fluent-cuda</link><guid isPermaLink="false">https://goyalayus.substack.com/p/speak-fluent-cuda</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Sat, 29 Nov 2025 04:57:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!st9M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Throughput and Latency</strong></p><p>throughput means how many operations can we do per second and latency means how much time does one operation take. cpu optimizes latency, while gpu&#8217;s optimizes throughput</p><p>two primary components on the chip are ALU ( unit of computation ) and Control/Caches.</p><p>cpu prioritizes control on chip to decrease latency while gpu prioritizes ALU&#8217;s to increase throughput</p><p>The NVIDIA A100 GPU has <strong>6,912</strong> FP32 ALUs</p><p>Because the GPU removed the complex caches and control logic, individual GPU threads are actually <strong>slower</strong> (higher latency) than CPU threads. If a GPU thread needs to fetch data from memory, it waits a long time (relative to the clock speed). but it covers it via parallism</p><h2>physical structure of a GPU</h2><p>a gpu is divided in many SM&#8217;s (Streaming Multiprocessors) </p><p>Each SM is a self-contained processor. An SM has its own instruction scheduler, its own registers (fast local memory), and its own shared memory (cache)</p><p>Because SMs are independent, hardware scaling is straightforward. If NVIDIA wants to make a faster GPU, they print more SMs on the silicon.</p><ul><li><p><strong>NVIDIA A100:</strong> Has <strong>108 SMs</strong>.</p></li><li><p><strong>NVIDIA H100:</strong> Has <strong>132 SMs</strong>.</p></li></ul><p>Inside every SM, there are the actual execution units that perform the math. We call these <strong>ALUs</strong> (Arithmetic Logic Units) or &#8220;Cores.&#8221;</p><p>In modern NVIDIA architectures (like Ampere, used in the A100), an SM contains several types of specialized cores:</p><ul><li><p>FP64 cores: 32 per SM; used for scientific double-precision work</p></li><li><p>FP32 cores: 64 per SM, 108 SMs &#8594; 6912 total; used for normal 32-bit math</p></li><li><p>Tensor Cores: 4 per SM; do fast 4&#215;4 matrix multiply-accumulate; give major ML speedup</p></li><li><p>SFUs: handle sin/cos/exp/log; slower and fewer than FP32 cores</p></li><li><p>Load/Store units: handle memory read/write address calculation</p></li></ul><h2><strong>Chapter 1.3: The Memory Hierarchy (Physical)</strong></h2><ul><li><p>global memory: huge, high-latency, high-bandwidth storage; where all tensors initially sit; slowest level but largest</p></li><li><p>l2 cache: on-die, medium size, shared by all sms; faster than global; auto-caches data so repeated accesses don&#8217;t go back to hbm</p></li><li><p>shared memory: tiny, per-sm, extremely fast; manually controlled; used for blocking/tiling so threads can reuse data cheaply</p></li><li><p>registers: per-thread storage for immediate operands; fastest; extremely limited and heavily partitioned across threads</p></li></ul><p>tensor cores only reach their 312-tfops peak when fed from shared memory because global memory (~1.5 tb/s) is too slow, so the fix is to fetch each tile once from global, stash it in shared memory (~19 tb/s), and reuse it repeatedly</p><div><hr></div><h2>Virtual Mapping</h2><p>cuda gives you a virtual grid&#8594;block&#8594;thread model while the hardware dynamically maps the many launched blocks onto whatever sms are free, so the same kernel runs unchanged on gpus with 80, 108, or 132 sms</p><p>more sms just means more blocks execute simultaneously, increasing throughput with zero code changes</p><p>a <strong>gpu kernel</strong> is a function definition, but execution spawns a massive number of parallel instances</p><p><strong>thread</strong>: smallest execution unit; runs kernel code independently; owns its registers and instruction pointer; roughly maps to an alu</p><p><strong>block</strong>: group of threads that can share data via shared memory and synchronize (wait for each other to complete); limited in size (how many threads a block can have) because an sm has limited shared memory and registers; scheduler assigns one block at a time to an sm. (this means at a time only one block can run on a sm, but more than one block can be assigned)</p><p><strong>grid</strong>: all blocks of a kernel launch; blocks are fully independent and never communicate; scheduler spreads them over all sms to utilize the whole gpu</p><p>look first we will look at some facts</p><p>when you write a cuda kernel, you define how many blocks your grid will have.</p><p>and you define how many threads each block will have. (these two are totally in your hands)</p><p>but how do you properly choose these two hyper params? we will go into it later in this blog post</p><p>fact #1 &#8212;&gt; there are limits on threads per block (commonly 1024) and threads per sm (commonly 2048&#8211;4096), because each sm has finite registers, shared memory, and scheduling slots; exceeding these resources prevents additional threads from being issued</p><p><strong>now lets see how many blocks can be alloted to a particular sm and how do we calculate that</strong></p><ul><li><p>sm has hard caps: 2048 threads, 192 kb shared memory, 65536 registers</p></li><li><p>each block consumes some portion of each resource; the number of blocks that fit is limited by whichever resource runs out first</p></li><li><p>scenario a (thread-limited): 256-thread block, no shared memory &#8594; 2048/256 = 8 blocks can fit; high occupancy. so when one block is waiting for memory the sm can run another block</p></li><li><p>scenario b (shared-memory-limited, 192kb only available): 256-thread block, 64 kb shared memory &#8594; 192/64 = 3 blocks fit; thread capacity unused because shared memory is the bottleneck</p></li><li><p>scenario c (register-limited, 65280 registers): 256 threads &#215; 255 registers = 65280 registers &#8594; only one block fits; low occupancy and poor latency hiding</p></li></ul><h2>Cordinate System</h2><p>You launch a grid of 100 blocks, each with 256 threads. Total threads: 100&#215;256=25,600. All threads run the same kernel code, so each thread must compute a unique global index to know which element (pixel, array entry, etc.) it should process.</p><p>CUDA provides built-in variables inside a kernel (<code>__global__ void MyKernel(...)</code>):</p><ul><li><p><code>threadIdx.x</code>: Thread index within the block, 0&#8230;blockDim.x&#8722;10&#8230;blockDim.x&#8722;1.</p></li><li><p><code>blockIdx.x</code>: Block index within the grid, 0&#8230;gridDim.x&#8722;10&#8230;gridDim.x&#8722;1.</p></li><li><p><code>blockDim.x</code>: Number of threads per block (constant for that launch).</p></li></ul><h4>Global index</h4><p>Standard 1D mapping:</p><p><code>int idx = blockIdx.x * blockDim.x + threadIdx.x;<br></code></p><p>Examples with <code>blockDim.x = 256</code>:</p><ul><li><p>Block 0, Thread 0:  <code>idx = 0 * 256 + 0   = 0</code></p></li><li><p>Block 0, Thread 255:<code>idx = 0 * 256 + 255 = 255</code></p></li><li><p>Block 1, Thread 0:  <code>idx = 1 * 256 + 0   = 256</code></p></li><li><p>Block 1, Thread 5:  <code>idx = 1 * 256 + 5   = 261</code></p></li></ul><p>Each thread computes this once and then operates on <code>data[idx]</code>.</p><h4>Handling non-multiple sizes</h4><p>For N=1000 elements and <code>blockDim.x = 256</code>, you need 4 blocks:</p><ul><li><p>4&#215;256=1024 threads.</p></li><li><p>Valid indices: <code>0</code> to <code>999</code>.</p></li><li><p>Extra threads: <code>1000</code> to <code>1023</code> (must do nothing).</p></li></ul><p>Standard pattern:</p><p><code>__global__ void MyKernel(float* data, int N) {<br>    int idx = blockIdx.x * blockDim.x + threadIdx.x;<br>    if (idx &lt; N) {<br>        data[idx] *= 2.0f;<br>    }<br>}<br></code></p><p>Threads with <code>idx &gt;= N</code> exit immediately, so the kernel is safe for any NN.</p><blockquote><p>suppose we have defined blocks = 100, and we have a gpu with only 10 sm and each sm can only take 2 blocks. so how will our code run? our code will first run 20 blocks on those sm&#8217;s and add the remaining 80 to the queue. then again run 20 from those 80 and so on.</p></blockquote><p></p><h2><strong>The Memory Bottleneck</strong></h2><p>In normal computer science, we talk a lot about Big O notation: O(N), O(N2), and so on.<br>Big O tells us how the number of operations grows as the input size grows.</p><p>On GPUs, this is not enough.</p><p>For GPU speed, the main limit is often not how many math operations you do.<br>The main limit is how many bytes you move to and from memory.</p><p>So instead of only counting operations, you must also count <strong>bytes moved</strong>.</p><h2>3.1 Arithmetic Intensity</h2><p>The key metric for GPU performance is called <strong>Arithmetic Intensity</strong>.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Arithmetic Intensity}=\\frac{\\text{Bytes Read/Written from Global Memory}}{\\text{Floating Point Operations (FLOPs)}}\n]&quot;,&quot;id&quot;:&quot;OIGORAONIE&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This number tells you:</p><blockquote><p>For each byte read or written from global memory, how many floating point operations do you do?</p></blockquote><p>Higher intensity means you do more math per byte of data moved.<br>Lower intensity means you move a lot of data but do very little math with it.</p><div><hr></div><h2>Example 1: Element-wise Addition</h2><p>Take a simple vector addition:</p><p>C[i]=A[i]+B[i]C[i]=A[i]+B[i]</p><p>For each element ii:</p><ul><li><p>Math:</p><ul><li><p>1 addition &#8658;&#8658; 1 FLOP.</p></li></ul></li><li><p>Memory:</p><ul><li><p>Read A[i: 4 bytes (assuming 32-bit float).</p></li><li><p>Read B[i]: 4 bytes.</p></li><li><p>Write C[i]: 4 bytes.</p></li></ul></li><li><p>Total data per element:</p><ul><li><p>4+4+4=12 bytes.</p></li></ul></li></ul><p>So the arithmetic intensity is:</p><p>Intensity=1 FLOP/12 Bytes&#8776;0.083 FLOPs per byte</p><p>This is a very low number.<br>You do very little math for each byte you move.</p><div><hr></div><h2>Example 2: Matrix Multiplication (N&#215;N)(N&#215;N)</h2><p>Now look at matrix multiplication of two N&#215;NN&#215;N matrices.</p><ul><li><p>Math:</p><ul><li><p>Roughly 2N^3 FLOPs.</p></li></ul></li><li><p>Memory:</p><ul><li><p>Read matrix A: N^2 elements.</p></li><li><p>Read matrix B: N^2 elements.</p></li><li><p>Write matrix C: N^2 elements.</p></li><li><p>Total elements: 3N^2.</p></li><li><p>If each element is 4 bytes: 3N^2&#215;4 bytes.</p></li></ul></li></ul><p>Arithmetic intensity:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;I=\\frac{3N^{2}\\cdot 4}{2N^{3}}\n=\\frac{12N^{2}}{2N^{3}}\n=\\frac{6}{N}\n]&quot;,&quot;id&quot;:&quot;QRBAVEFYXF&quot;}" data-component-name="LatexBlockToDOM"></div><p>So intensity grows linearly with N.</p><p>This is a very high intensity.<br>You do a lot of math for each byte you move.</p><div><hr></div><h2>Why this matters on a real GPU (A100)</h2><p>Take an NVIDIA A100 GPU:</p><ul><li><p>Compute speed: about 312,000 GFLOPs (tensor operations).</p></li><li><p>Memory bandwidth: about 1,555 GB/s.</p></li></ul><p>The hardware ratio is:</p><p>312,000/1,555 &#8776; 200</p><p>This means:</p><blockquote><p>To fully use the compute units, your kernel should do about 200 FLOPs for every 1 byte of global memory traffic.</p></blockquote><p>This is the <strong>golden rule</strong> for this GPU.</p><p>Now compare our two examples:</p><ul><li><p><strong>Vector Add</strong> (intensity &#8776;0.083):</p><ul><li><p>Much lower than 200.</p></li><li><p>The kernel will <strong>wait on memory</strong> most of the time.</p></li><li><p>The compute units will sit idle.</p></li><li><p>This kernel is <strong>memory bound</strong>.</p></li></ul></li><li><p><strong>Matrix Multiply</strong> with N=2048 (intensity &#8776;341):</p><ul><li><p>Higher than 200.</p></li><li><p>Memory bandwidth is enough to feed the compute units.</p></li><li><p>The compute units can run at or near full speed.</p></li><li><p>This kernel is <strong>compute bound</strong>.</p></li></ul></li></ul><p>So, when you design GPU kernels for deep learning, you must ask:</p><ul><li><p>How many FLOPs does this kernel do?</p></li><li><p>How many bytes from global memory does it read and write?</p></li><li><p>What is the arithmetic intensity, and how does it compare to the GPU&#8217;s hardware ratio?</p></li></ul><p>If your intensity is too low, the GPU will be limited by memory bandwidth, not compute.<br>If your intensity is high enough, you can use the full compute power of the GPU.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Performance} = I \\times B\n]\n\n\n&quot;,&quot;id&quot;:&quot;DUYATFOYUD&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Ops/Sec} = (\\text{Ops/Byte}) \\times (\\text{Bytes/Sec})\n]&quot;,&quot;id&quot;:&quot;TBUEKJSOMK&quot;}" data-component-name="LatexBlockToDOM"></div><p>note B (bytes/second) is fixed <br><br></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n  \\text{Performance} = \\min(\\text{GPU Max Performance}, I*B)\n  &quot;,&quot;id&quot;:&quot;KELRZIKHJP&quot;}" data-component-name="LatexBlockToDOM"></div><p><em><strong>there is a very interesting graph which we can analyze here</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!st9M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!st9M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 424w, https://substackcdn.com/image/fetch/$s_!st9M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 848w, https://substackcdn.com/image/fetch/$s_!st9M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 1272w, https://substackcdn.com/image/fetch/$s_!st9M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!st9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png" width="540" height="365.80645161290323" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:930,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:412725,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!st9M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 424w, https://substackcdn.com/image/fetch/$s_!st9M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 848w, https://substackcdn.com/image/fetch/$s_!st9M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 1272w, https://substackcdn.com/image/fetch/$s_!st9M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc10c6a00-506b-4e69-bc03-b107d31f2fb1_930x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>on x-axis here is the matrice size, now remember that Intensity for matrice multiplication are proportional to size of matrice right</p><p>so first you see a linear line like curve and later it flatens out because the peak performance of gpu hits.</p><p>there is a lot more to this curve which we will be covering in further concepts so lets move on !</p><div><hr></div><p><br>Machine learning is full of matrix multiplication.<br>This is not because neural networks &#8220;prefer&#8221; matrices in some deep way.</p><p>The real reason is simple:</p><blockquote><p>Matrix multiplication is one of the few operations that can beat the memory bottleneck on GPUs.</p></blockquote><h2>Element-wise operation (ReLU, addition, etc.)</h2><p>Task: Take a vector of length N.<br>For each element: read it, change it, write it back.</p><ul><li><p>Math operations:</p><ul><li><p>You touch each element once.</p></li><li><p>Total math &#8776;N.</p></li></ul></li><li><p>Memory accesses:</p><ul><li><p>Read each element: N.</p></li><li><p>Write each element: N.</p></li><li><p>Total &#8776;2N</p></li></ul></li><li><p>Intensity:</p><p>Intensity=0.5</p></li><li><p>Scaling:</p><ul><li><p>If you make the vector 1000 times bigger, the intensity remains constant</p></li></ul></li></ul><p>So the intensity stays <strong>constant</strong> as N grows.<br>You never reach a point where you do &#8220;a lot&#8221; of math per memory access.</p><p></p><div><hr></div><h2><strong>Global Memory Optimization</strong></h2><p>In C++ code, memory access looks simple:</p><p><code>float x = data[i];<br></code></p><p>It feels like the GPU goes to memory, reads that one float (4 bytes), and comes back.<br><br>This is not what really happens.</p><p>Global memory (DRAM) is not designed to read single bytes or single floats efficiently.<br><br>It is designed to read data in fixed-size chunks.</p><p>The physical wires between the GPU and the memory chips are called the memory bus.<br><br>Data moves over this bus in fixed-size units called transactions (also known as cache lines).</p><ul><li><p>On modern NVIDIA GPUs:</p><ul><li><p>Minimum transaction size is usually 32 bytes.</p></li><li><p>For high-throughput loads, the effective unit is often 128 bytes.</p></li></ul></li></ul><p>So if your code needs 4 bytes, the hardware does this:</p><ul><li><p>The memory controller reads a 32-byte (or 128-byte) block from DRAM.</p></li><li><p>It sends that block over the memory bus.</p></li><li><p>You only use 4 bytes from that block.</p></li><li><p>The rest of the bytes are unused for that instruction.</p></li></ul><p>You still pay:</p><ul><li><p>The latency cost for the full transaction size.</p></li><li><p>The bandwidth cost for the full transaction size.</p></li></ul><p>If you only needed 4 bytes from a 32-byte transaction, you used:</p><p>4/32=12.5% of the data</p><p>This is low efficiency.</p><p>GPUs are designed around warps: groups of 32 threads that run together.<br><br>The memory controller expects that threads in the same warp will access nearby addresses.</p><p>If thread 0 reads address X, the hardware expects thread 1 to read X+4, thread 2 to read +8, and so on.<br><br>If this happens, the hardware can bundle these requests into a small number of large transactions.</p><div><hr></div><h2>Warp memory requests</h2><p>When a warp runs a load instruction (for example, a global load), the hardware does the following:</p><ul><li><p>It looks at the memory addresses requested by all 32 active threads in the warp.</p></li><li><p>It finds how many 128-byte transactions are needed to cover all these addresses.</p></li><li><p>It issues that many transactions to DRAM.</p></li></ul><p>The number of transactions depends on how the threads access memory.</p><div><hr></div><h2>Scenario A: Sequential (coalesced) access &#8211; ideal</h2><ul><li><p>Thread 0 requests address 0</p></li><li><p>Thread 1 requests address 4</p></li><li><p>Thread 2 requests address 8</p></li><li><p>...</p></li><li><p>Thread 31 requests address 124</p></li></ul><p>Here:</p><ul><li><p>The addresses go from byte 0 to byte 127.</p></li><li><p>This is exactly 128 bytes of contiguous memory.</p></li></ul><p>The memory controller sees that all 32 requests fall inside one 128-byte block.</p><ul><li><p>It issues a single 128-byte transaction.</p></li><li><p>All 32 loads are served by this one transaction.</p></li></ul><p>Result:</p><ul><li><p>The bus moves 128 useful bytes.</p></li><li><p>All 128 bytes are used by the program.</p></li></ul><p>Efficiency: 100%.</p><p>This is the best case.</p><div><hr></div><h2>Scenario B: Strided or scattered access &#8211; worst case</h2><ul><li><p>Thread 0 requests address 0</p></li><li><p>Thread 1 requests address 1000</p></li><li><p>Thread 2 requests address 2000</p></li><li><p>...</p></li></ul><p>Now the addresses are far apart.</p><ul><li><p>They do not fit into one 128-byte block.</p></li><li><p>They are in different regions of memory.</p></li></ul><p>The memory controller cannot group these into one transaction.</p><p>So it must:</p><ul><li><p>Issue 1 transaction for thread 0&#8217;s address.</p></li><li><p>Issue 1 transaction for thread 1&#8217;s address.</p></li><li><p>...</p></li><li><p>Issue 1 transaction for each of the 32 threads.</p></li></ul><p>Total:</p><ul><li><p>32 separate transactions.</p></li><li><p>Each transaction transfers at least 32 bytes.</p></li></ul><p>So:</p><ul><li><p>Total data moved on the bus: 32&#215;32=1024 bytes.</p></li><li><p>Useful data actually used by your code: 32&#215;4=128 bytes.</p></li></ul><p>Efficiency:</p><p>Efficiency=128/1024=12.5%</p><p>That means you get only a small fraction of the possible bandwidth.<br><br>The rest is wasted on unused bytes.</p><p></p><p><strong>Example :- reading elements from a matrice</strong></p><ul><li><p>matrix is 32&#215;32 stored in row-major, memory is linear row by row</p></li><li><p>row 0 has 32 floats, then row 1 has 32 floats, and so on</p></li><li><p>case 1: reading a row with 32 threads</p></li><li><p>each thread reads one element from the same row</p></li><li><p>addresses increase by 4 bytes each time</p></li><li><p>all 32 addresses fall inside one continuous 128-byte range</p></li><li><p>hardware combines them into one memory transaction</p></li><li><p>result: highest possible memory efficiency</p></li><li><p>case 2: reading a column with 32 threads</p></li><li><p>each thread reads one element from the same column</p></li><li><p>element in next row is 128 bytes ahead because each row is 32 floats</p></li><li><p>addresses are 0, 128, 256, &#8230; for each thread</p></li><li><p>addresses are far apart and not in one continuous segment</p></li><li><p>hardware cannot merge them</p></li><li><p>result: 32 separate memory transactions and much lower efficiency</p></li></ul><blockquote><p>but what happens if i read half elements from one row and other half from another, but the elements are continuous</p></blockquote><ul><li><p>gpu checks only the final memory addresses requested by the warp, not which row they belong to</p></li><li><p>if the 32 threads read addresses that form one continuous 128-byte range, it counts as coalesced</p></li><li><p>it does not matter whether those addresses are from one row or split across two rows</p></li><li><p>if the boundary between rows happens to align such that addresses are still continuous in memory, gpu still issues one transaction</p></li></ul><h3><strong>Memory Alignment</strong></h3><p>We established that the GPU reads memory in 128-byte transactions. However, these transactions are not arbitrary windows. They are aligned to specific boundaries.</p><p><strong>The Grid:</strong><br>Memory is like a grid of 128-byte slots.</p><ul><li><p>Slot 0: Addresses 0 to 127.</p></li><li><p>Slot 1: Addresses 128 to 255.</p></li><li><p>Slot 2: Addresses 256 to 383.</p></li></ul><p>The GPU can fetch &#8220;Slot 0&#8221; or &#8220;Slot 1&#8221;. It generally cannot fetch &#8220;The 128 bytes starting at Address 4.&#8221;</p><p><strong>The Problem: Misaligned Data</strong></p><p>Imagine you have a perfectly coalesced access pattern (threads reading sequential addresses 0-127).</p><ul><li><p><strong>Case A (Aligned):</strong> The array starts at Address 0.</p><ul><li><p>The request covers 0 to 127.</p></li><li><p>This falls exactly into Slot 0.</p></li><li><p><strong>Cost:</strong> 1 Transaction.</p></li></ul></li><li><p><strong>Case B (Misaligned):</strong> The array starts at Address 4 (maybe you offset the pointer).</p><ul><li><p>Thread 0 reads Address 4.</p></li><li><p>Thread 31 reads Address 131.</p></li><li><p><strong>The Request:</strong> Bytes 4 to 131.</p></li><li><p><strong>The Mismatch:</strong> This range crosses the boundary. It touches the end of Slot 0 (4-127) and the beginning of Slot 1 (128-131).</p></li><li><p><strong>Hardware Action:</strong> The Memory Controller must fetch <strong>Slot 0 AND Slot 1</strong>.</p></li><li><p><strong>Cost:</strong> 2 Transactions.</p></li></ul></li></ul><h2>On-Chip Memory and Tiling</h2><p>We established in Chapter 3 that Matrix Multiplication is <strong>Compute Bound</strong> (Intensity ~ N/6 ). </p><p>However, that math assumes you only read each number from Global Memory <em>once</em>.</p><p><br>If you write a naive kernel where every thread reads its own data from Global Memory over and over again, you are <strong>Memory Bound</strong>.</p><p>let us prove this mathematically</p><p>We are multiplying two N&#215;N matrices.</p><p>We want to calculate one single element of the output matrix: C[row][col].</p><p>The formula:</p><p>C[row][col] = sum from k = 0 to N-1 of (A[row][k] * B[k][col])</p><p>So we must calculate the dot product of row i of A and column j of B.</p><p>Scenario 1: The Naive Kernel (No Shared Memory)</p><p>In this kernel, every thread is responsible for calculating one element of C.</p><p>The kernel logic (pseudo):</p><ul><li><p>Thread calculates C[y][x]</p></li><li><p>Initialize sum = 0.0</p></li><li><p>For k from 0 to N-1:</p><ul><li><p>Read A[y][k] from global memory</p></li><li><p>Read B[k][x] from global memory</p></li><li><p>Compute sum += A[y][k] * B[k][x]</p></li></ul></li><li><p>Write C[y][x] = sum</p></li></ul><p>Math analysis (for the whole grid):</p><p>Total threads:<br>N&#215;N threads (one per output element).</p><p>Operations per thread:</p><ul><li><p>The loop runs N times.</p></li><li><p>Each iteration does 1 multiply + 1 add = 2 FLOPs.</p></li></ul><p>Total FLOPs:</p><ul><li><p>Per thread: 2N</p></li><li><p>All threads: (N&#215;N threads) &#215; (2N FLOPs per thread) = 2N^3 FLOPs.</p></li><li><p>This matches the theoretical complexity of matrix multiplication.</p></li></ul><p>Memory accesses per thread:</p><ul><li><p>The loop runs N times.</p></li><li><p>In each iteration, we read A (4 bytes) and B (4 bytes) from global memory.</p></li></ul><p>Reads per thread:<br>2N reads.</p><p>Total global reads (all threads):<br>(N&#215;N threads) &#215; (2N reads) = 2N^3 reads.</p><p>Bytes transferred:</p><ul><li><p>Each read is 4 bytes.</p></li><li><p>Total bytes = 2N^3 &#215; 4 bytes = 8N^3 bytes.</p></li></ul><p>Arithmetic intensity:<br>Intensity = (total FLOPs) / (total bytes)<br>= (2N^3 FLOPs) / (8N^3 bytes)<br>= 0.25 FLOPs per byte.</p><p>Verdict:</p><ul><li><p>Arithmetic intensity is 0.25 ops/byte.</p></li><li><p>This is very low.</p></li><li><p>An NVIDIA A100 GPU needs about 200 ops/byte to reach peak compute performance.</p></li><li><p>So this naive kernel will run at around 0.1% of the GPU&#8217;s peak performance.</p></li></ul><p>so this does not scale with size and we are in a big problem</p><p><strong>Shared memory</strong> comes to save us. <br>It is <strong>Programmable</strong>. Unlike a CPU&#8217;s L1/L2 cache, which handles itself automatically, Shared Memory does nothing unless you write code to load data into it.<br></p><p><strong>Variable Scope</strong><br>When you declare a variable in CUDA, the keyword determines where it lives:</p><ol><li><p>float x; (Local Variable)</p><ul><li><p><strong>Memory:</strong> Register.</p></li><li><p><strong>Scope:</strong> Private to <strong>one thread</strong>. Thread A cannot see Thread B&#8217;s x.</p></li></ul></li><li><p>__shared__ float tile[256]; (Shared Variable)</p><ul><li><p><strong>Memory:</strong> Shared Memory (SRAM).</p></li><li><p><strong>Scope:</strong> Visible to <strong>all threads in the same Block</strong>.</p></li><li><p><strong>Lifetime:</strong> Exists as long as the Block is running. Once the Block finishes, this memory is wiped.</p></li><li><p><strong>Isolation:</strong> Block 0 cannot see Block 1&#8217;s shared memory.</p></li></ul></li></ol><p><strong>The Danger: Race Conditions</strong><br>Since all 256 threads in a block see the same tile array, they can crash into each other.</p><ul><li><p>Thread 0 tries to write to tile[0].</p></li><li><p>Thread 1 tries to read from tile[0] at the same time.</p></li><li><p><strong>Result:</strong> Undefined behavior. Garbage data.</p></li></ul><p><strong>The Solution: __syncthreads()</strong><br>This is the most important function in Shared Memory programming.</p><ul><li><p><strong>Action:</strong> It creates a barrier.</p></li><li><p><strong>Rule:</strong> &#8220;No thread in the block is allowed to proceed past this line of code until <strong>ALL</strong> threads in the block have reached this line.&#8221;</p></li></ul><h4><strong>The General Pattern</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!md9l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!md9l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 424w, https://substackcdn.com/image/fetch/$s_!md9l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 848w, https://substackcdn.com/image/fetch/$s_!md9l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 1272w, https://substackcdn.com/image/fetch/$s_!md9l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!md9l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png" width="1456" height="1217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1217,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:644157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!md9l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 424w, https://substackcdn.com/image/fetch/$s_!md9l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 848w, https://substackcdn.com/image/fetch/$s_!md9l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 1272w, https://substackcdn.com/image/fetch/$s_!md9l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F048e94d1-f80d-4305-9459-deb67b591775_2948x2464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Tiling</h2><p><strong>The goal</strong></p><p>We want to calculate C = A &#215; B.<br>The output matrix C has size N&#215;N.<br>We divide C into small square blocks of size TILE_SIZE &#215; TILE_SIZE.<br>We choose TILE_SIZE = 32.</p><p>We focus on one block of C, call it Block_C.<br>This block covers rows 0&#8230;31 and columns 0&#8230;31 of the output.<br>One CUDA thread block (with 32&#215;32 = 1024 threads) is assigned to this Block_C.<br>The goal of this thread block is to compute all 32&#215;32 elements of Block_C.</p><p>Step 2: The data dependency</p><p>To calculate Block_C (rows 0&#8211;31, cols 0&#8211;31), we need the following data:</p><p>From matrix A:</p><ul><li><p>We need rows 0 to 31.</p></li><li><p>For each of these rows, we need all columns from 0 to N&#8722;1.</p></li><li><p>So we need a strip of size 32&#215;N from A.</p></li></ul><p>From matrix B:</p><ul><li><p>We need columns 0 to 31.</p></li><li><p>For each of these columns, we need all rows from 0 to N&#8722;1.</p></li><li><p>So we need a strip of size N&#215;32 from B.</p></li></ul><p>Visualization example</p><p>Assume A and B are 2048&#215;2048.<br>To compute our small 32&#215;32 output block, we need:</p><ul><li><p>From A: a horizontal strip of size 32&#215;2048.</p></li><li><p>From B: a vertical strip of size 2048&#215;32.</p></li></ul><p>Shared memory limitation</p><p>We cannot fit these large strips into shared memory.<br>For A&#8217;s strip: 32&#215;2048 floats = 65536 floats.<br>Each float is 4 bytes, so size = 65536 &#215; 4 bytes = 262144 bytes &#8776; 256 KB.<br>Typical shared memory per SM is about 48 KB on many GPUs (up to about 100 KB on A100).<br>Therefore, a full 32&#215;2048 strip does not fit into shared memory.<br><br><strong>insight</strong></p><p>lets break the matrix multiply even more</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_{ij} = \\sum_{k=0}^{N-1} A_{ik} B_{kj}\n&quot;,&quot;id&quot;:&quot;VWNSDNELXJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>this above can be written like this below.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_{ij} = \\sum_{ph=0}^{(N/32)-1} \\left( \\sum_{k_{local}=0}^{31} \nA_{i,\\,(ph\\cdot32 + k_{local})}\\, B_{(ph\\cdot32 + k_{local}),\\, j} \\right)\n&quot;,&quot;id&quot;:&quot;PCILRZDFTF&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><blockquote><p>taking the assumption that N is a multiple of 32, we will relax it soon</p></blockquote><p><strong>Our goal</strong>: a thread block must compute a 32&#215;32 tile of C.<br>Inside this tile, the local row index i goes from 0 to 31 and the local column index j goes from 0 to 31.</p><p>We now trace a single thread: the one responsible for C within this tile.<br>For this thread, i = 0 and j = 0.</p><p>Its job is to calculate:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nC_{0,0}=\\sum_{ph=0}^{\\frac{N}{32}-1}\\left(\\sum_{k_{\\text{local}}=0}^{31}\nA_{0,\\,(ph\\cdot 32 + k_{\\text{local}})}\\;B_{(ph\\cdot 32 + k_{\\text{local}}),\\,0}\\right)\n\n&quot;,&quot;id&quot;:&quot;DFARRDQVEA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The outer loop (the phases ph)</p><p>The for loop in the tiled kernel iterates over the outer sum </p><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{ph=0}^{\\frac{N}{32}-1}\n&quot;,&quot;id&quot;:&quot;IGJRFMXDOV&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>Let us step through it.</p><p>Phase 0 (ph = 0)</p><p>In this phase, the thread needs to compute the first partial sum:</p><p>PartialSum0=</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{k_{\\text{local}}=0}^{31}\nA_{0,\\,(0\\cdot 32 + k_{\\text{local}})}\\;\nB_{(0\\cdot 32 + k_{\\text{local}}),\\,0}\n&quot;,&quot;id&quot;:&quot;KHBGTVJSUX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Simplified:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{k_{\\text{local}}=0}^{31}\nA_{0,\\,k_{\\text{local}}}\\;\nB_{k_{\\text{local}},\\,0}\n&quot;,&quot;id&quot;:&quot;XXMXQCIPVS&quot;}" data-component-name="LatexBlockToDOM"></div><p>Data needed in phase 0</p><p>For this computation, thread (0,0) needs row 0 of A from columns 0 to 31 and column 0 of B from rows 0 to 31.<br>Thread (0,1) needs row 0 of A and column 1 of B.<br>Thread (1,0) needs row 1 of A and column 0 of B.<br>In fact, all threads in the thread block need access to the same 32&#215;32 block of A (rows 0&#8211;31, cols 0&#8211;31) and the same 32&#215;32 block of B (rows 0&#8211;31, cols 0&#8211;31).</p><p>Collaborative load for phase 0</p><p>Instead of each thread loading all the data it individually needs, the threads cooperate to load these tiles into shared memory:</p><ul><li><p>Each thread loads one element from A and one element from B into shared memory.</p></li><li><p>Thread (0,0) loads A[0][0] into As[0][0] and B[0][0] into Bs[0][0].<br>Thread (5,10) loads A[5][10] into As[5][10] and B[5][10] into Bs[5][10].</p></li><li><p>This continues similarly for all 1024 threads in the 32&#215;32 block.</p></li></ul><p>A barrier (such as __syncthreads()) is then used so that all threads wait until the entire tile has been loaded into shared memory.<br>Now the entire 32&#215;32 tiles of A and B needed for this phase are cached in shared memory.</p><p>Computation of the inner sum in phase 0</p><p>Now thread (0,0) runs its inner summation<code> &#8721;klocal=0 to 31</code> using As and Bs in shared memory, which is fast.<br>Using these tiles, it computes PartialSum_0 and adds it to its running result for C.</p><p>Phase 1 (ph = 1)</p><p>In the next phase, the thread must compute the second partial sum:</p><p>PartialSum1=</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{k_{\\text{local}}=0}^{31}\nA_{0,\\,(1\\cdot 32 + k_{\\text{local}})}\\;\nB_{(1\\cdot 32 + k_{\\text{local}}),\\,0}\n&quot;,&quot;id&quot;:&quot;CVUMJTMXHB&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Simplified:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{k_{\\text{local}}=0}^{31}\nA_{0,\\,(32 + k_{\\text{local}})}\\;\nB_{(32 + k_{\\text{local}}),\\,0}\n&quot;,&quot;id&quot;:&quot;VGSWKMWAUD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Data needed in phase 1</p><p>Now thread (0,0) needs A&#8217;s data from row 0, columns 32 to 63, and B&#8217;s data from column 0, rows 32 to 63.<br>Again, every thread in the thread block needs this new 32&#215;32 chunk of A and B, but with the appropriate offset for its own row and column.</p><p>Collaborative load for phase 1</p><p>The threads repeat the collaborative loading process, but now with an offset in the global indices:</p><ul><li><p>Thread (0,0) loads A[0][32] into As[0][0] and B[32][0] into Bs[0][0].<br>Thread (5,10) loads A[5][42] into As[5][10] and B[42][10] into Bs[5][10].</p></li><li><p>All threads in the block similarly load one element of the current tile of A and one element of the current tile of B.</p></li></ul><p>Another synchronization (__syncthreads()) ensures all data for this new phase is present in shared memory before computation proceeds.</p><p>Computation of the inner sum in phase 1</p><p>Thread (0,0) again runs its inner loop over klocal=0&#8230;31, this time using the newly loaded tiles in As and Bs.<br>It computes PartialSum_1 and adds it to its running total for C.<br>After all phases ph from 0 to (N/32)&#8722;1(N/32)&#8722;1 are processed in this way, the thread has accumulated the full value of C.</p><p>here is the code implementation</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h6aH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h6aH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 424w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 848w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 1272w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h6aH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png" width="1456" height="2399" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2399,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1673990,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h6aH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 424w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 848w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 1272w, https://substackcdn.com/image/fetch/$s_!h6aH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f8c24d-8afe-46ce-bc20-f3520731b7f3_3680x6064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>lets see the efficiency</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DGfo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DGfo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 424w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 848w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 1272w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DGfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png" width="1456" height="346" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:346,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58628,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DGfo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 424w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 848w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 1272w, https://substackcdn.com/image/fetch/$s_!DGfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b128d0a-5e87-42c8-9d53-b250760b28ce_1520x361.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>how did we derived these values for the last tiled kernel part? it&#8217;s your homework</p><p>also you might  be thinking this does not scale with N so it is not efficient. there is another optimization which makes that possible but we are not going over that</p><h2>Wraps</h2><p>So far, we have been using a convenient simplification: we imagined that 1024 threads in a block all run independently and in parallel. </p><p>Now, we will look at how the SM <em>actually</em> executes these threads. This will explain many of the strange performance artifacts you see in real-world GPU code.</p><p>The hardware scheduler inside the SM does not see 1024 individual threads. It sees groups of 32.</p><p>A <strong>Warp</strong> is a group of <strong>32 consecutively numbered threads</strong> that execute in <strong>lockstep</strong>.</p><ul><li><p>In a block of 256 threads:</p><ul><li><p>Threads with threadIdx.x from <strong>0 to 31</strong> form <strong>Warp 0</strong>.</p></li><li><p>Threads with threadIdx.x from <strong>32 to 63</strong> form <strong>Warp 1</strong>.</p></li><li><p>...and so on.</p></li><li><p>A block of 256 threads is composed of 8 Warps (256 / 32).</p></li></ul></li></ul><p>The number 32 is a hardware design choice by NVIDIA, and it is fundamental to the architecture.</p><ol><li><p>The SM&#8217;s instruction scheduler fetches <strong>one</strong> instruction from memory (e.g., ADD R1, R2, R3).</p></li><li><p>It issues this single instruction to a Warp.</p></li><li><p><strong>All 32 threads</strong> in that Warp execute that exact same instruction at the exact same time.</p></li></ol><p>They operate on different data because each thread has its own private registers (R1 for Thread 0 is different from R1 for Thread 1), but the operation itself is identical for all of them.</p><blockquote><p>now you must be thinking the code we wrote in the tiling part, had both + and multiplication in a single thread. actually some operations like AB + C, has been made a single operation by gpu cores.</p></blockquote><h3>Control Divergence</h3><p>We&#8217;ve established the core rule of SIMT: All 32 threads in a Warp must execute the same instruction at the same time.</p><p><strong>The Problem:</strong> What happens when the code has an if/else statement?</p><p><strong>The Setup:</strong><br>Let&#8217;s analyze a simple kernel with the parameters you suggested.</p><ul><li><p><strong>Block Size:</strong> 256 threads.</p></li><li><p><strong>This means:</strong> 8 Warps (Warp 0 to Warp 7).</p></li></ul><p>consider this cod</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!50sQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!50sQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 424w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 848w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!50sQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png" width="532" height="354.0576923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:969,&quot;width&quot;:1456,&quot;resizeWidth&quot;:532,&quot;bytes&quot;:182270,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!50sQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 424w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 848w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!50sQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d392874-e16c-438a-96c6-3b3b94165b82_2080x1384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s trace what the hardware does, Warp by Warp.</p><div><hr></div><p><strong>Trace 1: Warp 1 through Warp 7 (Threads 32 to 255)</strong></p><ul><li><p>Consider any thread in these Warps, for example, Thread 40.</p></li><li><p>Its tid is 40.</p></li><li><p>The condition 40 &lt; 16 is <strong>false</strong>.</p></li><li><p><strong>Every single thread</strong> in Warp 1, 2, 3, 4, 5, 6, and 7 evaluates the condition to false.</p></li><li><p><strong>Result:</strong> They all agree. The hardware simply skips Path A and executes the instructions for Path B.</p></li><li><p><strong>Performance:</strong> Perfect. No time is wasted. This is <strong>Uniform Control Flow</strong>.</p></li></ul><div><hr></div><p><strong>Trace 2: Warp 0 (Threads 0 to 31)</strong><br>This is where the problem occurs.</p><ul><li><p><strong>Threads 0 to 15:</strong></p><ul><li><p>Their tid is less than 16. The condition is <strong>true</strong>.</p></li><li><p>They want to take <strong>Path A</strong>.</p></li></ul></li><li><p><strong>Threads 16 to 31:</strong></p><ul><li><p>Their tid is 16 or greater. The condition is <strong>false</strong>.</p></li><li><p>They want to take <strong>Path B</strong>.</p></li></ul></li></ul><p>The Warp has a disagreement. The hardware cannot execute Path A and Path B at the same time. It must <strong>serialize</strong> them.</p><p><strong>The Hardware&#8217;s Solution (Predicate Registers):</strong></p><ol><li><p><strong>Evaluate:</strong> All 32 threads evaluate the condition tid &lt; 16.</p></li><li><p><strong>Masking:</strong> A &#8220;mask&#8221; is created for the Warp.</p><ul><li><p>Threads 0-15 are marked as &#8220;active.&#8221;</p></li><li><p>Threads 16-31 are marked as &#8220;inactive.&#8221;</p></li></ul></li><li><p><strong>Execute Path A:</strong> The hardware executes the instruction data[tid] = 100.0f;.</p><ul><li><p>Only the &#8220;active&#8221; threads (0-15) are allowed to actually write to memory.</p></li><li><p>The &#8220;inactive&#8221; threads (16-31) do nothing. They are <strong>stalled</strong>. They waste this clock cycle.</p></li></ul></li><li><p><strong>Invert Mask:</strong> The hardware flips the mask.</p><ul><li><p>Threads 0-15 are now &#8220;inactive.&#8221;</p></li><li><p>Threads 16-31 are now &#8220;active.&#8221;</p></li></ul></li><li><p><strong>Execute Path B:</strong> The hardware executes the instruction data[tid] = 200.0f;.</p><ul><li><p>Only the &#8220;active&#8221; threads (16-31) write to memory.</p></li><li><p>Threads 0-15 are now stalled.</p></li></ul></li></ol><p><strong>The Cost:</strong></p><ul><li><p>A uniform Warp would have taken 1 cycle (executing either Path A or Path B).</p></li><li><p>Our diverged Warp took <strong>2 cycles</strong> (executing Path A, then Path B).</p></li><li><p><strong>Performance on Warp 0 was cut in half.</strong></p></li></ul><p>This phenomenon is called <strong>Control Divergence</strong>.</p><p>Divergence only happens when threads <strong>within the same Warp</strong> disagree on which path to take.</p><p>Divergence between different Warps is fine (Warp 0 can take Path A while Warp 1 takes Path B).</p><p>The performance penalty is proportional to the number of divergent paths. An if/else costs 2x. A switch statement with 4 paths could cost 4x.</p><p><strong>Rule:</strong> To write fast code, try to structure your data and logic so that threads in a Warp agree on control flow. For example, if you are processing different types of particles, sort them by type first so that Warp 0 only processes &#8220;Type A&#8221; particles and Warp 1 only processes &#8220;Type B&#8221;.</p><h3><strong>Wave Quantization</strong></h3><h4><strong>The Setup</strong></h4><p>Let&#8217;s use a concrete example.</p><ul><li><p><strong>GPU:</strong> An NVIDIA A100 with <strong>108 SMs</strong>.</p></li><li><p><strong>Kernel:</strong> A Tiled Matrix Multiplication.</p></li><li><p><strong>Workload:</strong> We are calculating a large output matrix. This work is divided into many independent <strong>Thread Blocks</strong>.</p></li><li><p><strong>Key Fact:</strong> Each Thread Block is a single &#8220;unit of work&#8221; that must be scheduled to an SM.</p></li></ul><p>The GPU can process work in <strong>waves</strong>. In a perfect world, it processes 108 Blocks in the first wave, then the next 108 Blocks, and so on.</p><h4><strong>The &#8220;Good&#8221; Case: Perfect Divisibility</strong></h4><p>Imagine our matrix dimensions are such that we need to launch exactly <strong>216 Thread Blocks</strong>.</p><ul><li><p><strong>Wave 1:</strong> The GPU scheduler assigns Blocks 0-107 to the 108 SMs. All SMs are 100% busy.</p></li><li><p><strong>Wave 2:</strong> As soon as the first wave finishes, the scheduler assigns Blocks 108-215 to the 108 SMs. All SMs are 100% busy again.</p></li><li><p><strong>Result:</strong> The GPU is fully occupied for the entire duration. Performance is at its peak for that algorithm.</p></li></ul><h4><strong>The &#8220;Bad&#8221; Case: The Tail Effect</strong></h4><p>Now, let&#8217;s say we change the matrix dimension by just <strong>one pixel</strong>.<br>This tiny change causes the number of required Thread Blocks to become <strong>217</strong>.</p><ul><li><p><strong>Wave 1:</strong> The scheduler assigns Blocks 0-107. All 108 SMs are busy.</p></li><li><p><strong>Wave 2:</strong> The scheduler assigns Blocks 108-215. All 108 SMs are busy.</p></li><li><p><strong>Wave 3 (The &#8220;Tail Wave&#8221;):</strong></p><ul><li><p>The scheduler has only <strong>one Block left</strong> to run (Block #216).</p></li><li><p>It assigns this block to <strong>SM 0</strong>.</p></li><li><p><strong>What are the other 107 SMs doing?</strong> <strong>Nothing.</strong> They are completely idle.</p></li><li><p>The entire GPU, a multi-thousand-dollar accelerator, is sitting and waiting for one SM to finish this single, final block of work.</p></li></ul></li></ul><p><strong>The Performance Impact:</strong><br>The total time taken is the time for Wave 1 + Wave 2 + Wave 3.<br>Even though Wave 3 has almost no work, it still takes the same amount of time as a full wave.</p><ul><li><p><strong>Time for 216 Blocks:</strong> ~2 units.</p></li><li><p><strong>Time for 217 Blocks:</strong> ~3 units.</p></li></ul><p>We added 1/216 (~0.5%) more work, but the runtime increased by 1/2 (50%).<br>This causes a massive, sudden drop in achieved TFLOP/s.</p><h2><strong>Reductions</strong></h2><p>A reduction turns many input elements into one output value using an associative binary operator like +, max, or min. For summation, this means combining all elements of an array into a single scalar.</p><h4>Sequential vs parallel</h4><p>A naive sum on the CPU uses a single loop:</p><p>Initialize sum = 0.</p><p>For i from 0 to N&#8722;1, do sum = sum + data[i].</p><p>Each iteration depends on the previous one because the new sum uses the old sum. This makes the loop inherently sequential, so a single GPU thread doing this loop would not exploit parallelism.</p><h4>Tree reduction idea</h4><p>Addition is associative, so the grouping of terms does not change the result:</p><ul><li><p>Sequential: (((a0 + a1) + a2) + a3) + ...</p></li><li><p>Tree: (a0 + a1) and (a2 + a3) can be computed in parallel, then their partial sums added, and so on.</p></li></ul><p>With 8 elements and 4 threads:</p><ul><li><p>Step 1 (stride = 4): 4 threads each add pairs 4 apart, producing 4 partial sums in the first half of the array.</p></li><li><p>Step 2 (stride = 2): 2 threads add those partial sums, producing 2 values.</p></li><li><p>Step 3 (stride = 1): 1 thread adds the last two values, leaving the final result in element 0.</p></li></ul><p>The number of active threads halves every step, giving a tree of depth log&#8289;2(N)log2(N) instead of doing all N additions in a chain.&#8203;</p><h4>CUDA shared-memory implementation</h4><p>Within a block, the algorithm is:</p><p>Load: All threads cooperatively copy one element each from global memory into a shared-memory array sdata.</p><p>Sync: Call __syncthreads() so every thread sees all loaded values.</p><p><strong>Iterative Reduction:</strong> We loop, cutting the number of active threads in half each time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5JMv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5JMv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 424w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 848w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 1272w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5JMv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png" width="1456" height="1083" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1083,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:659621,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/179833744?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5JMv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 424w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 848w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 1272w, https://substackcdn.com/image/fetch/$s_!5JMv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa088b0c0-d3c9-4509-9268-1b714547720a_2948x2192.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Final Write:</strong> After the loop, sdata[0] holds the sum for the entire block. Thread 0 is responsible for writing this partial sum back to a global results array.</p><p>what if the number of elements we want to add are greater than the number of threads available to us?<br>suppose we want to sum 200 million numbers</p><p>Instead of each thread processing <strong>one</strong> element, each thread processes <strong>many</strong> elements.</p><ol><li><p><strong>Launch a Fixed Grid:</strong> We launch enough blocks to saturate the GPU (e.g., 1000 blocks of 256 threads = 256,000 threads total).</p></li><li><p><strong>Looping:</strong> These 256,000 threads act like a swarm of &#8220;pac-men.&#8221; They eat the first 256,000 numbers. Then they all jump forward by the grid size and eat the next 256,000 numbers. They repeat this until the 200 million numbers are consumed.</p></li><li><p><strong>Local Accumulation:</strong> While looping, each thread maintains a private running total in a <strong>register</strong>.</p></li></ol><h4>lets calculate intensity for this operation</h4><p>Reads: We read N floats to populate sdata.<br>Bytes Read = 4 * N.<br>Writes: We write one single float (the final sum) back.<br>Bytes Written = 4.<br>Total Global Bytes: 4N + 4. For large N, this is approximately 4N<br><br>The reduction performs N-1 additions to sum N numbers.<br>For example, to sum 8 numbers, you do 4 adds, then 2 adds, then 1 add. Total = 7 adds.<br>Total FLOPs: N - 1. For large N, this is approximately N.</p><p>I = N/4N = 1/4</p><h2>Atomic Operations</h2><p>If you try to do parallel reduction by having all threads add to a single shared counter (like global_sum), you can run into a <strong>race condition</strong>. Here&#8217;s what happens:</p><ul><li><p>You have a memory location (global_sum), initialized to 0.</p></li><li><p>Thread A reads its value (for example, data is 5).</p></li><li><p>Thread B reads its value (for example, data is 10).</p></li><li><p>Both threads try to add their own value to global_sum at the same time.</p></li></ul><p>The problem:</p><ul><li><p>Thread A reads global_sum (0).</p></li><li><p>Thread B reads global_sum (0).</p></li><li><p>Thread A calculates 0 + 5 = 5.</p></li><li><p>Thread B calculates 0 + 10 = 10.</p></li><li><p>Thread A writes 5 to global_sum.</p></li><li><p>Thread B writes 10 to global_sum.</p></li><li><p>Final result: global_sum is 10 (not 15). Thread A&#8217;s addition was overwritten.</p></li></ul><p><strong>Why did this happen?</strong></p><ul><li><p>Both threads read the same starting value before either finished updating global_sum.</p></li><li><p>They &#8220;race&#8221; each other&#8212;whichever thread writes last &#8220;wins,&#8221; and the other&#8217;s update is lost.</p></li></ul><p><strong>The Solution: Atomic Operations</strong><br>Hardware provides &#8220;atomic&#8221; instructions, which guarantee the entire sequence (read, modify, write) is performed without interruption.</p><p>In CUDA, the function is atomicAdd(&amp;address, value).</p><p>What happens when you use atomicAdd:</p><ul><li><p>The hardware locks the memory location so no other thread can read or write until it&#8217;s done.</p></li><li><p>It reads the current value.</p></li><li><p>It adds the new value.</p></li><li><p>It writes the new value back.</p></li><li><p>It unlocks the location.</p></li></ul><p>If another thread tries to do atomicAdd at the same address while it&#8217;s locked, it must wait.</p><p><strong>A Simple Kernel using Atomics:</strong><br>Each thread just does:</p><pre><code>if (i &lt; N) {
atomicAdd(global_sum, data[i]);
}</code></pre><p><strong>Is it correct?</strong><br>Yes&#8212;the result will be correct.</p><p><strong>What&#8217;s the problem?</strong><br>It&#8217;s <em>very</em> slow if many threads do atomicAdd on the same memory address at the same time:</p><ul><li><p>Thousands of threads try to add to global_sum at once.</p></li><li><p>Only one can succeed at a time; the others wait.</p></li><li><p>This serializes the computation&#8212;the GPU is idle most of the time, waiting for each atomic operation to finish.</p></li><li><p>As a result, parallelism is destroyed and performance drops.</p></li></ul><p><strong>When to use atomics:</strong></p><ul><li><p>They&#8217;re great if collisions are rare&#8212;for example, histogram updates where most threads update different bins.</p></li><li><p>They&#8217;re useful for debugging or for writing a very simple, baseline-correct algorithm.</p></li><li><p>For high performance with frequent updates to the same memory address, you should use methods like tree reduction with shared memory.</p></li></ul><h2>Softmax</h2><p>In practice, the values e^xi can become enormous, leading to floating point overflow (infinity). To prevent this, We first find the maximum value in the vector, let&#8217;s call it m=max&#8289;(x^j)<br><br>The formula becomes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{softmax}(x_i) = \\frac{e^{x_i - m}}{\\sum_{j=1}^{N} e^{x_j - m}}&quot;,&quot;id&quot;:&quot;LDFHTOYSFD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p><strong>Kernel 1 (Find Max)</strong>:<br>Read: The entire row from Global Memory (N elements).<br>Work: Perform a tree reduction to find the max.<br>Write: The single max value back to Global Memory.<br>Memory Traffic: ~4N bytes</p><p><br><strong>Kernel 2 (Exponentiate):</strong><br>Read: The entire row again, and the max value (N+1 elements).<br>Work: Compute e(x^i&#8722;m) for each element.<br>Write: The entire exponentiated row to a new temporary buffer in Global Memory.<br>Memory Traffic: ~8N bytes</p><p><br><strong>Kernel 3 (Sum):</strong><br>Read: The entire temporary row from Kernel 2 (N elements).<br>Work: Perform another tree reduction to find the sum.<br>Write: The single sum value back to Global Memory.<br>Memory Traffic: ~4N bytes</p><p><br><strong>Kernel 4 (Divide):</strong><br>Read: The temporary row from Kernel 2 again, and the sum value (N+1 elements).<br>Work: Perform the final division.<br>Write: The final result row to the output matrix in Global Memory.<br>Memory Traffic: ~8N bytes</p><p></p><p>but this is quite inefficient, think about how can we make this efficient what if we fuse this into a single kernel? </p><p>in the first kernel you store the m in shared memory of each sm, then in the same kernel you do not need to fetch xi again, you can just modify them to e^ (xi-m)<br>then in the same kernel you calculate sum of e^(xi-m) and store in shared memory of each sm. and then modify e^(xi-m) to the softmax</p><p>this works pretty fine until N gets too big</p><p><strong>A100 Shared Memory:</strong> ~164 KB per SM</p><p>128k Context Window Size of row: 128,000&#215;4 bytes=512 KB128,000&#215;4 bytes=512 KB</p><p>now you might be thinking &#8220;but we do not need to do the whole reduction on a single sm&#8221;</p><p>actually you do need to, else you will have to write to global memory making everything incredibly slow</p><p>cool so now how do we do such a large softmax on a single SM?</p><p>the answer is <strong>Tiling</strong></p><p>you break down the whole array of xi into small chunks, and calculate m-local, and </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{i} e^{x_i - m_{\\text{local}}}&quot;,&quot;id&quot;:&quot;BBRRELSGJF&quot;}" data-component-name="LatexBlockToDOM"></div><p>in a single sm. and using these we can reform the original softmax. How? that&#8217;s a homework. use some brains</p><p></p><p>so this completes basics of  writing cuda kernels<br>      <br></p>]]></content:encoded></item><item><title><![CDATA[Mathematics behind Exploding and Vanishing Gradients]]></title><description><![CDATA[.]]></description><link>https://goyalayus.substack.com/p/mathematics-behind-exploding-and</link><guid isPermaLink="false">https://goyalayus.substack.com/p/mathematics-behind-exploding-and</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Sun, 16 Nov 2025 06:17:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>have you ever noticed in a deep learning codebase, while initializing weights, we initialize them normally with a mean of zero and variance one but then divide each weight by the sq root of embedding dimensions of inputs ( which is called Xaviers Initialization formally ). why?</p><p>what is the issue with just Normal (0, 1) initialization? the problem is it causes exploding and vanishing gradients over multiple layers and even while backprop.</p><p>we are going to prove how does dividing by sq root of embedding dimension helps Statistically</p><p></p><div><hr></div><p>Let&#8217;s consider a single neuron in one layer.</p><p>Its output y is a linear combination of its inputs x and weights w.</p><p>y = w1*x1 + w2*x2 + ... + wn*xn</p><p>Here, n is the number of input dimensions (let&#8217;s call it n_in).</p><p>We will make three simple assumptions about our inputs and weights.</p><ol><li><p>The inputs x are normalized. mean(x) = 0 and variance(x) = 1.</p></li><li><p>The weights w are initialized with mean(w) = 0 and variance(w) = 1. This is the Normal(0, 1) case we want to analyze.</p></li><li><p>Inputs xi and weights wi are all independent of each other.</p></li></ol><p>Our goal is to find the variance of the output y. A stable network should have outputs with a variance close to 1. If the variance keeps increasing layer by layer, the outputs will explode to huge values.</p><p>Let&#8217;s calculate the variance of y.</p><p>Var(y) = Var(w1*x1 + w2*x2 + ... + wn*xn)</p><p>A property of variance is that for independent variables, the variance of a sum is the sum of their variances.</p><p>Var(y) = Var(w1*x1) + Var(w2*x2) + ... + Var(wn*xn)</p><p>Another property of variance for two independent variables A and B with zero mean is: Var(A*B) = Var(A) * Var(B).</p><p>Applying this:</p><p>Var(y) = (Var(w1)*Var(x1)) + (Var(w2)*Var(x2)) + ... + (Var(wn)*Var(xn))</p><p>We assumed Var(wi) = 1 and Var(xi) = 1 for all i.</p><p>Var(y) = (1 * 1) + (1 * 1) + ... + (1 * 1)<br>Var(y) = 1 + 1 + ... + 1 (n_in times)<br>Var(y) = n_in</p><p>This is the problem.</p><p>The variance of the output of the layer is equal to the number of input dimensions.</p><p>If the input dimension n_in is 512, the variance of the output is 512. The standard deviation is sqrt(512), which is about 22.</p><p>To understand what this means, we need to know what standard deviation represents. Standard deviation is a measure of how spread out numbers are from their mean. Our mean is zero. For a normal distribution, about 68% of all values lie within one standard deviation of the mean. About 95% lie within two standard deviations.</p><p>Our inputs had a mean of 0 and a standard deviation of 1. So most input values were between -2 and 2.<br>Our outputs now have a mean of 0 and a standard deviation of 22. So most output values will be between -44 and 44.</p><p>The typical magnitude of an output value is now 22 times larger than the typical magnitude of an input value.</p><p>Pass this through a few layers. Let&#8217;s prove how the variance explodes.</p><ul><li><p><strong>Layer 1:</strong></p><ul><li><p>Input to Layer 1: x_l1. Var(x_l1) = 1.</p></li><li><p>Output of Layer 1: y_l1.</p></li><li><p>As we proved, Var(y_l1) = n_in * Var(w_l1) * Var(x_l1) = 512 * 1 * 1 = 512.</p></li></ul></li><li><p><strong>Layer 2:</strong></p><ul><li><p>The input to Layer 2 is the output of Layer 1. So, x_l2 = y_l1.</p></li><li><p>The variance of the input to Layer 2 is Var(x_l2) = Var(y_l1) = 512.</p></li><li><p>The output of Layer 2 is y_l2. We use the same formula for variance.</p></li><li><p>Var(y_l2) = n_in * Var(w_l2) * Var(x_l2)</p></li><li><p>The weights of Layer 2 are also initialized from Normal(0, 1), so Var(w_l2) = 1.</p></li><li><p>Var(y_l2) = 512 * 1 * 512 = 262,144.</p></li></ul></li></ul><p>The variance of the output from Layer 2 is (n_in)^2. The standard deviation is n_in = 512.</p><p>After just two layers, the typical magnitude of the output values is 512 times larger than the original input values. After a third layer, it would be (n_in)^3.</p><p>The numbers explode. These large outputs are then passed to an activation function (like tanh or sigmoid). These functions saturate for large inputs (their output is close to 1 or -1).</p><p>When the activation function is saturated, its gradient is almost zero. During backpropagation, these zero gradients are multiplied back through the network. The weights do not get updated. The network does not learn.</p><p>Now, let&#8217;s fix it.</p><p>Our goal is to make the output variance Var(y) equal to 1.</p><p>We saw that: Var(y) = n_in * Var(w) * Var(x)</p><p>Assuming Var(x) = 1, we have: Var(y) = n_in * Var(w)</p><p>We want Var(y) = 1.</p><p>So, 1 = n_in * Var(w)</p><p>This means we must choose our weights w to have a variance of:</p><p>Var(w) = 1 / n_in</p><p>How do we get a random variable with Var(w) = 1 / n_in?</p><p>We start with a variable W_standard from a standard normal distribution Normal(0, 1). It has Var(W_standard) = 1.<br>If we scale this variable by a constant c, the new variance is Var(c * W_standard) = c^2 * Var(W_standard) = c^2.</p><p>We want the new variance to be 1 / n_in.<br>So, c^2 = 1 / n_in<br>c = 1 / sqrt(n_in)</p><p>This is the solution. We initialize weights not from Normal(0, 1), but by taking a sample from Normal(0, 1) and then dividing it by sqrt(n_in).</p><p>This makes the variance of the weights equal to 1 / n_in. Let&#8217;s re-calculate the output variance with this new weight initialization.</p><p>Var(y) = n_in * Var(w)<br>Var(y) = n_in * (1 / n_in)<br>Var(y) = 1</p><p>Now, the output of the layer has the same variance as the input. The variance does not explode or vanish from layer to layer. The inputs to activation functions stay in a range where their gradients are non-zero, and the network can learn effectively. This same logic applies to the gradients during the backward pass, keeping them stable as well.</p>]]></content:encoded></item><item><title><![CDATA[Everyone is rewriting their C++ code into Rust. Why? No, It's Not Performance]]></title><description><![CDATA[.]]></description><link>https://goyalayus.substack.com/p/everyone-is-rewriting-their-c-code</link><guid isPermaLink="false">https://goyalayus.substack.com/p/everyone-is-rewriting-their-c-code</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Mon, 13 Oct 2025 21:45:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xVyj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>in systems languages like c++ , c etc you manage the memory yourself ( Heap, Stack, etc) and there is no Garbage Collector like in Languages Like JavaScript and Golang<br><br>but managing memory yourself has one big issue : it can cause a lot of bugs because it&#8217;s very hard to write safe memory code.</p><p>and these bugs are not detectable by the compiler. so you get them on Runtime.</p><p>Rust Solves this, Rust gives you these Errors on Compile Time.</p><p>The Performance of Rust and C++ are Same.</p><p><strong>in today&#8217;s article we will be going over various ways in which you can write buggy c++ code and how to prevent that</strong></p><p>There are majorly Three Categories of memory bugs</p><ol><li><p>use-after-free</p></li><li><p>memory-leak</p></li><li><p>double-free</p></li></ol><h3><strong>Use-After-Free</strong></h3><p>Basically, when you reference a pointer after calling <code>free()</code> on it, it no longer points to a valid memory location, so de-referencing it causes errors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xVyj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xVyj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 424w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 848w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xVyj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png" width="1456" height="528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:528,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182350,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xVyj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 424w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 848w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!xVyj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24858a95-b711-45e8-a24a-b0bd7a02f417_2824x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Memory Leaks</strong></h3><p>you typically allocate memory using a pointer inside a local scope (usually inside functions), or indirectly by creating an object that contains a pointer as a member. When the scope ends, the pointer itself (being a stack variable) is destroyed, but the memory it was pointing to is <strong>not</strong> automatically freed. If you don&#8217;t call <code>free()</code> or <code>delete</code> before the scope ends, the allocated memory remains reserved, but you have no way to access it anymore &#8212; this is a <strong>memory leak</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yaJO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yaJO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 424w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 848w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yaJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png" width="1456" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44612672-3855-43f2-95f4-0af310f07947_2528x1292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:198071,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yaJO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 424w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 848w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!yaJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44612672-3855-43f2-95f4-0af310f07947_2528x1292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Double free</h3><p>it&#8217;s basically a special case of <strong>use-after-free </strong></p><p>A double-free bug occurs when you call <code>free()</code> (or <code>delete</code>) on the same pointer more than once. This typically happens when multiple functions assume they &#8220;own&#8221; the pointer and each one tries to free it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0_nt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0_nt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 424w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 848w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 1272w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0_nt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png" width="1456" height="1150" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1150,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:425441,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0_nt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 424w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 848w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 1272w, https://substackcdn.com/image/fetch/$s_!0_nt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e180490-9262-4fb4-8be1-5cea2cc124b9_2892x2284.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>there are some other interesting ways through which you can run into these memory errors</p><h3>Rule of Three</h3><p>if a class needs a user-defined destructor then it also needs a user-defined copy constructor and vice versa</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MCTC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MCTC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 424w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 848w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 1272w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MCTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png" width="1456" height="2328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fabd5755-b161-419b-96be-299c9490baab_3680x5884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2328,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1766571,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MCTC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 424w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 848w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 1272w, https://substackcdn.com/image/fetch/$s_!MCTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffabd5755-b161-419b-96be-299c9490baab_3680x5884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Iterator Invalidation</strong></h3><p>An <strong>iterator</strong> is an object that acts like a pointer to an element within a container. For a std::vector, an iterator might be implemented as a simple raw pointer to an element in the vector&#8217;s internal, contiguous array.</p><p>Certain modifications to a container can force it to reallocate its internal storage. The most common example is adding an element to a std::vector that has reached its capacity.</p><ol><li><p>The vector allocates a new, larger block of memory on the heap.</p></li><li><p>It copies (or moves) all existing elements from the old memory block to the new one.</p></li><li><p>It deallocates the old memory block.</p></li></ol><p>At this point, any iterators that were pointing to elements in the old memory block are now <strong>invalidated</strong>. They have become dangling pointers. Dereferencing them is a use-after-free, resulting in Undefined Behavior.</p><p>In a single-threaded context, this is a common but often manageable bug. In a multi-threaded context, it becomes a nightmare. If one thread is iterating over a shared vector while another thread concurrently adds an element to it, the first thread&#8217;s iterator can be invalidated mid-loop, leading to crashes or silent data corruption that are extremely difficult to reproduce and debug.</p><p></p><p>c++ has tried to come up with  solutions to these problems with <strong>smart pointers</strong></p><p>two types of smart pointers</p><ol><li><p>unique pointers</p></li><li><p>shared pointers</p></li></ol><h4><strong>Unique Pointers</strong></h4><p>when you define a unique pointer, the scope in which you define it, when the pointer goes out of scope, then the delete is automatically called on its memory so the issue of memory leakage is partially solved ( i will explain why partially ahead )<br><br>now if you want to pass that pointer to a function, there are two options</p><p><strong>Transfer the ownership</strong></p><p>the origional pointer becomes null, and the new scope (the function) is responsible for deleting the memory to the new pointer and it is automatically deleted once the scope ends</p><p><strong>Borrowing</strong></p><p>Delete is not called</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i4Yv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i4Yv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 424w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 848w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 1272w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i4Yv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png" width="1456" height="2648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2648,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2255643,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i4Yv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 424w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 848w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 1272w, https://substackcdn.com/image/fetch/$s_!i4Yv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e341f25-b886-445b-b1f5-f2f4c20ca6df_3680x6692.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>it still doesn&#8217;t solve the problem completely. suppose while borrowing the pointer you call delete on it accidentally then you will get into the use-after-free error</p><p><strong>Shared Pointers</strong></p><p>unique pointers are great but what if you want to pass ownership between multiple functions? can&#8217;t do it via them</p><p>so we introduce shared_pointers</p><p>shared_pointer point to a memory block which is a struct with two members 1. pointer to actual data  2. a counter</p><p>when you initialize this shared pointer the counter is 1</p><p>this shared_pointer is only destroyed when the counter goes to zero</p><p>when you pass this shared_ptr to any function the counter gets incrimented and when the pointer goes out of the scope the counter decrements  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uFMt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uFMt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 424w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 848w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 1272w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uFMt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png" width="1456" height="3467" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3467,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2689692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uFMt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 424w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 848w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 1272w, https://substackcdn.com/image/fetch/$s_!uFMt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ac1d7e-87b4-44a6-80b1-8dec75cb2b22_3680x8764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br>but shared_ptr has a major drawback of Cyclic Dependencies</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yWTH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yWTH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 424w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 848w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 1272w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yWTH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png" width="1456" height="2613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2613,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1897210,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/176076620?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yWTH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 424w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 848w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 1272w, https://substackcdn.com/image/fetch/$s_!yWTH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5af07d84-1b70-42c6-845b-98baa87f372e_3680x6604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br>this can be solved using another type of smart pointers called free_ptr but it&#8217;s not very elegent</p><ul><li><p>A weak_ptr is created from a shared_ptr.</p></li><li><p>It &#8220;observes&#8221; the object but does <strong>not</strong> contribute to the reference count. It does not own the object.</p></li><li><p>Because it doesn&#8217;t own the object, the object can be deleted out from under it. Therefore, you cannot dereference a weak_ptr directly.</p></li><li><p>To access the object, you must call the .lock() method on the weak_ptr. This method checks if the object still exists. If it does, .lock() returns a new shared_ptr to it (atomically incrementing the ref count), which you can then safely use. If the object has already been deleted, .lock() returns a null shared_ptr</p></li></ul><p></p><div><hr></div><p>that&#8217;s it for this blog, now I will be going to learn rust. </p><p>hope another blog on rust coming out soon</p>]]></content:encoded></item><item><title><![CDATA[the hitchhiker’s guide to working with Audio Data]]></title><description><![CDATA[audio data has two parts amplitude and frequency.]]></description><link>https://goyalayus.substack.com/p/the-hitchhikers-guide-to-working</link><guid isPermaLink="false">https://goyalayus.substack.com/p/the-hitchhikers-guide-to-working</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Wed, 24 Sep 2025 18:01:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>audio data has two parts <strong>amplitude</strong> and <strong>frequency.</strong></p><p>lets see how audio data is stored in your wav files</p><p>it mostly a list of integers representing amplitude of the sound.</p><p>sampling rate :- how many times per second the amplitude is calculated and stored in the list of integers. its generally 44,100 Hz, means 44k times sampled in a second</p><p>and the other is bit rate, it&#8217;s about how accurate is the amplitude value stored, commonly it&#8217;s 16 bit, meaning it can range from &#8722;32,768 to +32,767</p><blockquote><p>this format while being okay is not that great for machine learning models.</p></blockquote><p>we try to make this 1d data into 2d using the fourier transform</p><p>a fourier transform tells us how strong different frequency bands (for example 0&#8211;1 kHz, 1&#8211;2 kHz, 10&#8211;20 kHz, etc.) are in the signal</p><p>applying it once on the whole audio only gives the overall frequency distribution (no information about when a frequency occurred) &#8594; still 1d</p><p>to capture time-specific frequency information, we split the audio into short chunks (commonly 20&#8211;30 ms, e.g. 25 ms)</p><p>then we run the fourier transform on each chunk &#8594; this is called short-time fourier transform (STFT)</p><p>stacking these results gives us a 2d representation:</p><ul><li><p>one axis = time (chunks)</p></li><li><p>one axis = frequency</p></li><li><p>values in the grid = amplitude/energy for each frequency at each time</p></li></ul><p>this 2d representation is often visualized as a spectrogram</p><p>example:</p><ul><li><p>pure 440 Hz tone &#8594; spectrogram shows a straight line at 440 Hz across time</p></li><li><p>speech &#8594; spectrogram shows changing frequency bands (formants) over time</p></li></ul><h4>mel-spectrogram</h4><p>mel-spectrogram is derived from the spectrogram by mapping frequencies onto the mel scale</p><p>mel scale is a perceptual scale where equal steps sound equally spaced to human ears</p><p>humans perceive pitch roughly logarithmically (we notice differences at low frequencies more than at high)</p><p>so instead of linear frequency bins, mel-spectrogram compresses high-frequency ranges and expands low ones</p><p>result = representation closer to how humans actually hear sound</p><p>widely used in speech recognition and music analysis</p><h4><strong>mel-frequency cepstral coefficients (mfccs)</strong></h4><ul><li><p>derived from mel-spectrogram by applying a discrete cosine transform (dct) to decorrelate frequency bands</p></li><li><p>captures the overall spectral envelope (timbre/shape of sound) rather than fine detail</p></li><li><p>commonly used in speech recognition because it represents how humans perceive phonemes</p></li><li><p>typical dimension: 12&#8211;13 coefficients per frame (sometimes + energy + delta + delta-delta &#8594; ~39 features per frame)</p></li></ul><h4>chroma features</h4><ul><li><p>group the spectrum into 12 bins, one for each pitch class (c, c#, d &#8230; b), regardless of octave</p></li><li><p>captures harmonic and melodic characteristics, useful for music analysis (e.g., chord detection, key recognition)</p></li><li><p>ignores exact frequency scale, focuses on pitch class energy distribution</p></li><li><p>typical dimension: 12 features per frame</p></li></ul><p></p><p>all these mel-spectrogram and mel-frequency cepstral coefficients and chroma features they are like images so CNN&#8217;s work very well on them.</p>]]></content:encoded></item><item><title><![CDATA[the hitchhiker’s guide to golang concurrency ]]></title><description><![CDATA[Go has special Threads called Go-Routines.]]></description><link>https://goyalayus.substack.com/p/the-hitchhikers-guide-to-golang-concurrency</link><guid isPermaLink="false">https://goyalayus.substack.com/p/the-hitchhikers-guide-to-golang-concurrency</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Wed, 24 Sep 2025 07:48:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yj8J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Go has special Threads called Go-Routines.</p><p>they are different from your OS-Threads in the way that they are very light weight, each with a stack size of 2kb that can grow and shrink compared of 1-8 mb of a local os-thread</p><p><strong>syntax</strong></p><pre><code>package main

import (
&#9;"fmt"
&#9;"time"
)

func sayHello() {
&#9;fmt.Println("Hello from the sayHello goroutine!")
}

func main() {
&#9;
&#9;go sayHello()

&#9;fmt.Println("Hello from the main goroutine.")
&#9;
&#9;time.Sleep(100 * time.Millisecond)
}</code></pre><p>what happens if you remove time.sleep at the last? the sayHello function will not print.<br><br>think of it like this, the main function is run by a main go-routine and it is not in the habit of waiting for smaller go-routines.<br></p><p>Using time.Sleep is a fragile hack. We need a deterministic synchronization mechanism. This is the perfect segue into <em><strong>sync.WaitGroup</strong></em>, which we will cover next.</p><p>It's essentially a concurrent counter that allows a goroutine (usually the main goroutine) to block until a collection of other goroutines have finished their tasks.</p><p>A WaitGroup has three core methods:</p><ol><li><p><strong>Add(delta int):</strong> This increments the WaitGroup's internal counter by delta.</p><ul><li><p>You call this <em>before</em> you launch the goroutine(s). If you have N goroutines to wait for, you would call wg.Add(N).</p></li></ul></li><li><p><strong>Done():</strong> This decrements the WaitGroup's counter by one.</p><ul><li><p>This is called by the worker goroutine itself, typically as the last action before it returns, often using defer. It's a signal from the goroutine saying, "I have finished my work."</p></li></ul></li><li><p><strong>Wait():</strong> This blocks the goroutine that calls it until the WaitGroup's internal counter becomes zero.</p><ul><li><p>This is called by the goroutine that needs to wait (e.g., the main goroutine).</p></li></ul></li></ol><p>let&#8217;s see this in code</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yj8J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yj8J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 424w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 848w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 1272w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yj8J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png" width="628" height="778.5302197802198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:628,&quot;bytes&quot;:663359,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yj8J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 424w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 848w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 1272w, https://substackcdn.com/image/fetch/$s_!yj8J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51fb0404-d409-4f72-b111-e8cc5e293bab_2804x3476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Execution Flow and Output:</p><pre><code>Main: Starting workers...
Main: Waiting for workers to finish...
Worker 3 starting
Worker 5 starting
Worker 1 starting
Worker 4 starting
Worker 2 starting
Worker 3 finished
Worker 5 finished
Worker 1 finished
Worker 4 finished
Worker 2 finished
Main: All workers finished. Exiting.</code></pre><p><br>now you might be wondering why the output is not sequential, that is because there is a difference between starting a go routine and scheduling a go routine.<br>the scheduling part was sequential. and places this new goroutine into a <strong>runnable queue. </strong>but when the go routines are run they are not sequential, the queue does not take care of the order</p><h3>Channels</h3><p>channels are a way for go-threads to communicate information with each other, obviously they could have communicated by read global variables but that is discouraged in golang&#8217;s philosophy and channel communication is encouraged.<br><br>You declare a channel using the chan keyword followed by the type of data it will carry.</p><pre><code>var myIntChannel chan int     // A channel that carries integers
var myStringChannel chan string // A channel that carries strings
var myStructChannel chan MyStruct // A channel that carries values of type MyStruct</code></pre><p>Like maps and slices, a channel is a reference type. Its zero value is nil. Before you can use a channel, you must initialize it with the built-in make() function.</p><pre><code>myIntChannel = make(chan int) </code></pre><h4>Unbuffered Channels and Rendezvous</h4><p>An unbuffered channel is created with a capacity of zero. This is the default.</p><pre><code>ch := make(chan int) // or make(chan int, 0)</code></pre><p>An unbuffered channel has a unique and powerful synchronization property: it forces a rendezvous.</p><ul><li><p>A send operation on an unbuffered channel will block until another goroutine is ready to receive from that same channel.</p></li><li><p>Likewise, a receive operation will block until another goroutine performs a send.</p></li></ul><p>let&#8217;s look this through a concrete code example </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WGgL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WGgL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 424w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 848w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 1272w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WGgL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png" width="540" height="588.9560439560439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1588,&quot;width&quot;:1456,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:484118,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WGgL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 424w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 848w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 1272w, https://substackcdn.com/image/fetch/$s_!WGgL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7664643c-11bb-41ca-b92d-421a315245ab_2692x2936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>The Transactional View</h4><p>The Go runtime acts as a mediator for the unbuffered channel.</p><ol><li><p>Sender Arrives First:</p><ul><li><p>Goroutine A executes ch &lt;- "hello".</p></li><li><p>The runtime sees there is no receiver waiting.</p></li><li><p>The runtime suspends Goroutine A. It is now blocked, waiting for a partner.</p></li></ul></li><li><p>Receiver Arrives:</p><ul><li><p>Goroutine B executes msg := &lt;-ch.</p></li><li><p>The runtime sees there is a sender (Goroutine A) waiting.</p></li><li><p>The Rendezvous Transaction Begins:</p><ul><li><p>The runtime takes the value ("hello") directly from the sending goroutine (A).</p></li><li><p>It passes this value to the receiving goroutine (B).</p></li><li><p>The value is assigned to the variable msg.</p></li><li><p>Now that the transaction is complete, the runtime marks both Goroutine A and Goroutine B as runnable again.</p></li></ul></li><li><p>The Transaction Ends.</p></li></ul></li><li><p>Execution Continues:</p><ul><li><p>At this point, the statement msg := &lt;-ch is complete in Goroutine B. The line of code is finished. Goroutine B can now move on to the next statement, which is fmt.Printf.</p></li><li><p>Simultaneously, the statement ch &lt;- "hello" is complete in Goroutine A. Goroutine A can now move on to its next statement.</p><p></p></li></ul></li></ol><blockquote><p><em>it&#8217;s also possible that instead of the send, our execution first gets to the receive point and blocks the main goroutine until the send is done by another goroutine</em></p></blockquote><p></p><h4>Buffered Channels and Decoupling</h4><p>A <strong>buffered channel</strong> is created with a capacity greater than zero.</p><pre><code>ch := make(chan int, 3) // A channel that can hold up to 3 integers</code></pre><p>A buffered channel decouples the sender and receiver.</p><ul><li><p>A <strong>send</strong> operation on a buffered channel will only block if the channel's buffer is <strong>full</strong>. If there is space, the send completes immediately, and the value is stored in the channel's buffer.</p></li><li><p>A <strong>receive</strong> operation will only block if the channel's buffer is <strong>empty</strong>.</p></li></ul><p>This allows the sender and receiver to work at different paces, as long as the buffer doesn't fill up or empty out.</p><p></p><h4>Deadlocks with Channels</h4><p>A deadlock occurs when all goroutines in a program are blocked, waiting for something that can never happen.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L3p8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L3p8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 424w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 848w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L3p8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png" width="1456" height="591" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:591,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:266807,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L3p8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 424w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 848w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!L3p8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ab5239-1caf-49de-8c28-417a48adb119_3184x1292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>close(channel)</h4><ul><li><p>Signals that no more values will ever be sent on this channel.</p></li><li><p>It is a final "goodbye" from the sender(s).</p></li></ul><p><strong>Properties of a Closed Channel:</strong></p><ol><li><p>Sending to a closed channel will cause a <strong>panic</strong>. This is a strict rule: once you say you're done sending, you must be done.</p></li><li><p>Receiving from a closed channel <strong>never blocks</strong>. It immediately returns a value.</p><ul><li><p>If there are values still in the buffer, it returns them one by one.</p></li><li><p>Once the buffer is empty, any subsequent receives will immediately return the <strong>zero value</strong> for the channel's type (e.g., 0 for int, "" for string, nil for pointers).</p></li></ul></li></ol><p>How does a receiver know the difference between a legitimate zero value and a zero value from a closed, empty channel? The receive operator has a special two-variable form:</p><pre><code>value, ok := &lt;-channel</code></pre><ul><li><p>ok is a boolean.</p></li><li><p>If ok is true, the value was a legitimate value sent on the channel.</p></li><li><p>If ok is false, it means the channel is closed and empty. The value will be the zero value for its type</p></li></ul><p>This brings us to the most elegant way to receive all values from a channel until it is closed: a <strong>for...range loop</strong>.</p><pre><code>for item := range channel {
    // This loop will automatically receive values from the channel
    // and assign them to 'item'.
    
    // The loop will automatically break when the channel is closed
    // and all values have been drained from its buffer.
}</code></pre><p>let&#8217;s see this all in action through a code sampel</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LJBS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LJBS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 424w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 848w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 1272w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LJBS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png" width="502" height="685.4230769230769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0031270-3325-4a38-847b-157cf1547e12_2528x3452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1988,&quot;width&quot;:1456,&quot;resizeWidth&quot;:502,&quot;bytes&quot;:625163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LJBS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 424w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 848w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 1272w, https://substackcdn.com/image/fetch/$s_!LJBS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0031270-3325-4a38-847b-157cf1547e12_2528x3452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>if you see carefully, there is a race condition between producer function and for&#8230;range loop.<br><br>what if the loop win? then the jobs channel would be empty and remember what we said above? empty channels are blocking. so this for loop will get blocked until producer sends some value in this.</p><h4>Directional Channels (chan&lt;- and &lt;-chan) for API Safety</h4><ul><li><p>&lt;-chan T: A <strong>receive-only</strong> channel of type T. You can only receive from it (val := &lt;-ch). You cannot send to it or close it.</p></li><li><p>chan&lt;- T: A send-only channel of type T. You can only send to it (ch &lt;- val). You cannot receive from it or close it.</p></li></ul><p>This is extremely useful for writing clear APIs.</p><ul><li><p>A function that produces data should accept a <strong>send-only</strong> channel as an argument.</p></li><li><p>A function that consumes data should accept a <strong>receive-only</strong> channel as an argument.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ufBb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ufBb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 424w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 848w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 1272w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ufBb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png" width="1456" height="1531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:404110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ufBb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 424w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 848w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 1272w, https://substackcdn.com/image/fetch/$s_!ufBb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66027105-6c0a-4e26-9d0d-ff1224f2646b_2344x2464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Common Channel Use Cases</h3><p>This section is about patterns&#8212;recipes for structuring concurrent code. We'll look at three of the most fundamental: Worker Pools, Fan-in/Fan-out, and Pipelines</p><h4>Worker Pools</h4><p>The idea is to control the level of concurrency for a set of tasks. Instead of launching a new goroutine for every single task (which could be thousands or millions), you launch a fixed number of persistent "worker" goroutines. You then feed tasks to these workers via a channel.</p><h4>Real-World Example: Concurrent Thumbnail Generator</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Z_9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Z_9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 424w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 848w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 1272w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Z_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png" width="1456" height="3431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3431,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2214112,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Z_9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 424w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 848w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 1272w, https://substackcdn.com/image/fetch/$s_!1Z_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3df2f180-a017-4ef3-8977-f4a9c9d4269f_3680x8672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Fan-in / Fan-out</h4><p>this one honestly I also do not understand much, so I am just pasting the code snippet here</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AD8B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AD8B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 424w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 848w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 1272w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AD8B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png" width="1456" height="2861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2861,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1099964,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AD8B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 424w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 848w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 1272w, https://substackcdn.com/image/fetch/$s_!AD8B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F194d4421-be31-4bd8-b67b-6bea1f919d21_2856x5612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Pipelines</h4><p>A pipeline is a chain of processing stages connected by channels. Each stage is a goroutine that:</p><ol><li><p>Receives values from an upstream channel.</p></li><li><p>Performs some function on that value.</p></li><li><p>Sends the result to a downstream channel.</p></li></ol><p>This creates a concurrent assembly line. It's a very powerful and elegant way to structure data processing tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qXFc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qXFc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 424w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 848w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 1272w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qXFc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png" width="422" height="719.0810439560439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2481,&quot;width&quot;:1456,&quot;resizeWidth&quot;:422,&quot;bytes&quot;:534694,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qXFc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 424w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 848w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 1272w, https://substackcdn.com/image/fetch/$s_!qXFc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1aeff2-c847-4429-9847-8b1191757ee0_2080x3544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The <em><strong>select</strong></em> Statement</h3><p>A select statement blocks until one of its cases can run, then it executes that case. If multiple cases are ready at the same time, it chooses one at <strong>random</strong> to execute. This randomness is important because it ensures fairness and prevents a "busy" channel from always starving out another channel.</p><h4>Waiting on Multiple Channels</h4><p>This is the primary use case for select. Imagine a goroutine that needs to process work coming from two different producers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vloI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vloI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 424w, https://substackcdn.com/image/fetch/$s_!vloI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 848w, https://substackcdn.com/image/fetch/$s_!vloI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 1272w, https://substackcdn.com/image/fetch/$s_!vloI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vloI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png" width="1456" height="1688" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1688,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:726496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vloI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 424w, https://substackcdn.com/image/fetch/$s_!vloI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 848w, https://substackcdn.com/image/fetch/$s_!vloI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 1272w, https://substackcdn.com/image/fetch/$s_!vloI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e8ea86-8d02-4be7-8d48-7a5d2a6bf3f9_3368x3904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A select statement normally blocks. However, you can make it non-blocking by adding a default case.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rn9l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rn9l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 424w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 848w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 1272w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rn9l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png" width="1456" height="1221" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a625d268-95af-4cb9-a989-25182b77668d_3368x2824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1221,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:504162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rn9l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 424w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 848w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 1272w, https://substackcdn.com/image/fetch/$s_!Rn9l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa625d268-95af-4cb9-a989-25182b77668d_3368x2824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The most powerful and common pattern is combining for and select. This creates a goroutine that acts like a server, continuously processing events from multiple channels until it receives a signal to stop.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XUMZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XUMZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 424w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 848w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 1272w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XUMZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png" width="1456" height="1726" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1726,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:772480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XUMZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 424w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 848w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 1272w, https://substackcdn.com/image/fetch/$s_!XUMZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa50d897-bf5c-4fa8-bd2c-3e6732a2fbad_3368x3992.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>note : instead of the last sleep we could also have used wait groups <strong>or </strong>there is a nice cool trick which simulates wait groups using channels itself</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!50Ok!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!50Ok!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 424w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 848w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 1272w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!50Ok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png" width="1456" height="1804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:858638,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!50Ok!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 424w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 848w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 1272w, https://substackcdn.com/image/fetch/$s_!50Ok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08debae6-f127-46a9-97e8-793bdd9f0029_3368x4172.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Timeouts and Tickers</h2><p>Often, a goroutine can't afford to wait forever on a channel operation. You might be making a network request or waiting for a job from a queue, and you need to give up after a certain amount of time.</p><p>The <strong>time.After</strong> function is the perfect tool for this.</p><p>When you combine this channel with a select statement, you can create a timeout.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0oec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0oec!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 424w, https://substackcdn.com/image/fetch/$s_!0oec!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 848w, https://substackcdn.com/image/fetch/$s_!0oec!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 1272w, https://substackcdn.com/image/fetch/$s_!0oec!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0oec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png" width="1456" height="1648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1648,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:809054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0oec!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 424w, https://substackcdn.com/image/fetch/$s_!0oec!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 848w, https://substackcdn.com/image/fetch/$s_!0oec!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 1272w, https://substackcdn.com/image/fetch/$s_!0oec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7552099b-17ff-4c18-a2e7-3865564698b7_3368x3812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Output:</strong></p><pre><code>Scenario 1: Waiting for result with a 3-second timeout...
   -&gt; Received result: result 1

Scenario 2: Waiting for result with a 1-second timeout...
   -&gt; Timed out!</code></pre><h3>time.Ticker for Periodic Tasks</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VMyq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VMyq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 424w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 848w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 1272w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VMyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png" width="1456" height="1299" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1299,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:557432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VMyq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 424w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 848w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 1272w, https://substackcdn.com/image/fetch/$s_!VMyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8641d9b-6cbb-4ecc-a489-654d04afd909_3368x3004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Graceful Shutdown and Cancellation with context</h3><p><strong>The Problem It Solves:</strong></p><p>Imagine you have a web server. A user sends an HTTP request.</p><ol><li><p>Your server handler starts a goroutine to handle the request.</p></li><li><p>This goroutine makes a call to a database.</p></li><li><p>It also makes a call to a microservice.</p></li><li><p>The microservice itself might make other calls.</p></li></ol><p>Now, what happens if the user closes their browser? The initial HTTP request is cancelled. All the downstream work being done by the database and microservice calls is now pointless. We need a way to tell all the goroutines involved in this request, &#8220;Stop your work, the result is no longer needed.&#8221;</p><p>The context package provides this cancellation signal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cPe_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cPe_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 424w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 848w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 1272w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cPe_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png" width="1456" height="1806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1089962,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cPe_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 424w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 848w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 1272w, https://substackcdn.com/image/fetch/$s_!cPe_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1c1340-5bbe-4c09-9246-62a3a7cf8f50_3364x4172.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Error Handling in Concurrent Code</h3><p>In a standard sequential program, error handling is straightforward: a function returns an error, and the caller immediately checks if err != nil.</p><p>In a concurrent program, this breaks down. When you launch a goroutine with go myFunction(), you can&#8217;t get a return value. So how does the goroutine report back if it fails? What if you launch 100 goroutines and need to know if <em>any</em> of them failed?</p><p>Simply logging the error from within the goroutine is not enough. The main goroutine, which launched the work, often needs to know about the failure to decide on a course of action (e.g., cancel other workers, retry the operation, or exit).</p><p>We need a pattern to propagate errors <em>out</em> of goroutines and aggregate them for the calling function.</p><h4>5.2.1: Using a Dedicated Error Channel</h4><p>The most idiomatic Go way to solve this is to use the same tool we use for data: a channel. We can create a dedicated channel just for passing error values.</p><p><strong>The Pattern:</strong></p><ol><li><p>Create a buffered channel for errors. The buffer size is typically the number of workers, so that no worker will block when trying to send an error.</p></li><li><p>Pass this error channel to each worker goroutine.</p></li><li><p>Inside the worker, if an error occurs, send it into the error channel instead of just logging it. If no error occurs, the worker does <em>not</em> send anything.</p></li><li><p>The main goroutine, after launching the workers, must have a way to collect and process these errors.</p></li></ol><h4>5.2.2: Combining WaitGroup and Error Aggregation</h4><p>The main goroutine cannot simply range over the error channel, because it doesn&#8217;t know how many errors to expect. If it tries to read N errors (for N workers), it will deadlock if only one worker fails.</p><p>The correct pattern is to use a sync.WaitGroup to know when all workers have finished <em>their attempts</em>. Then, and only then, can we safely close the error channel and read whatever errors were sent.</p><p><strong>The Full Pattern:</strong></p><ol><li><p>Create a tasks channel and an errs channel.</p></li><li><p>Create a sync.WaitGroup.</p></li><li><p>Start a fixed number of workers. In each worker:</p><ul><li><p>defer wg.Done().</p></li><li><p>Process tasks.</p></li><li><p>If a task fails, send the error to the errs channel.</p></li></ul></li><li><p>Start a separate &#8220;closer&#8221; goroutine. This goroutine&#8217;s only job is to wg.Wait() and then close(errs). This is the key to breaking the deadlock.</p></li><li><p>The main goroutine is now free to range over the errs channel. This loop will block until the closer goroutine closes the channel, at which point it will process any received errors and then terminate.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Sq_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Sq_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 424w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 848w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 1272w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Sq_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png" width="1456" height="3000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3000,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1520222,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/174248484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Sq_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 424w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 848w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 1272w, https://substackcdn.com/image/fetch/$s_!0Sq_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d63834a-e1ba-455a-81fd-5b73f915c5fa_3292x6784.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>this is it for this blog, to be honest I am also not an expert on go-concurrency as of now but will try to learn as much as we can.</p>]]></content:encoded></item><item><title><![CDATA[Thread Synchronization in C]]></title><description><![CDATA[I don&#8217;t write for absolute beginners so this post is for someone who know on a high level how threads work but does not know the low levels]]></description><link>https://goyalayus.substack.com/p/thread-synchronization-in-c</link><guid isPermaLink="false">https://goyalayus.substack.com/p/thread-synchronization-in-c</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Mon, 22 Sep 2025 10:04:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gd2d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I don&#8217;t write for absolute beginners so this post is for someone who know on a high level how threads work but does not know the low levels</p><p>some facts on threads first</p><ul><li><p>A thread is the smallest unit of execution that a CPU can schedule and run</p></li><li><p>A process can have one or more threads; they share the same memory space but run independently</p></li><li><p>Each thread has its own stack (function calls, local variables) but shares heap, globals, and resources with other threads of the same process</p></li><li><p>Threads let multiple tasks run seemingly at the same time within a single program</p></li><li><p>Concurrency comes from starting multiple threads so that while one is waiting (for I/O, lock, etc.), another can use the CPU</p></li><li><p>On multi-core CPUs, different threads can actually run in parallel on different cores</p></li><li><p><strong>The downside: since threads share memory, you must use synchronization (locks, semaphores, etc.) to prevent data corruption when multiple threads access the same data</strong></p></li></ul><p>the last point of what this whole article is going to be about. so lets lock-in &#128170;</p><p></p><h3><strong>Race Condition</strong></h3><p>Consider a simple, common operation: incrementing a shared counter. This appears atomic at a high level, but at the machine instruction level, it's typically a sequence of three distinct operations:</p><ol><li><p><strong>Read:</strong> Load the current value of the counter into a CPU register.</p></li><li><p><strong>Modify:</strong> Increment the value in the register.</p></li><li><p><strong>Write:</strong> Store the new value from the register back into memory.</p></li></ol><p>Now, imagine two threads, Thread A and Thread B, both attempting to increment the same shared integer counter which is initially 0.</p><p><strong>Scenario 1: Desired Outcome (Sequential Execution)</strong></p><ul><li><p><strong>Thread A:</strong></p><ul><li><p>Reads counter (value is 0)</p></li><li><p>Increments register (register value is 1)</p></li><li><p>Writes register to counter (counter is now 1)</p></li></ul></li><li><p><strong>Thread B:</strong></p><ul><li><p>Reads counter (value is 1)</p></li><li><p>Increments register (register value is 2)</p></li><li><p>Writes register to counter (counter is now 2)</p></li></ul></li></ul><p>Final counter value: 2. This is correct.</p><p><strong>Scenario 2: Race Condition Leading to Incorrect Outcome</strong></p><ul><li><p><strong>Thread A:</strong> Reads counter (value is 0). <em>Context switch.</em></p></li><li><p><strong>Thread B:</strong></p><ul><li><p>Reads counter (value is 0)</p></li><li><p>Increments register (register value is 1)</p></li><li><p>Writes register to counter (counter is now 1). <em>Context switch.</em></p></li></ul></li><li><p><strong>Thread A:</strong></p><ul><li><p>Increments register (register value, which was 0 from its initial read, becomes 1)</p></li><li><p>Writes register to counter (counter is now 1).</p></li></ul></li></ul><p>Final counter value: 1. This is incorrect. Both threads executed their increment operation, but the shared counter only reflects one increment because Thread A overwrote Thread B's update with an outdated value. This is a classic example of a "lost update."</p><pre><code>#include &lt;stdio.h&gt;
#include &lt;pthread.h&gt;

int shared_counter = 0;

void* increment_function(void* arg) {
    int num_iterations = *((int*)arg);
    for (int i = 0; i &lt; num_iterations; i++) {
        shared_counter = shared_counter + 1;
    }
    return NULL;
}

int main() {
    pthread_t thread1, thread2;
    int iterations_per_thread = 100000;

    pthread_create(&amp;thread1, NULL, increment_function, &amp;iterations_per_thread);
    pthread_create(&amp;thread2, NULL, increment_function, &amp;iterations_per_thread);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    printf("Expected: %d\n", 2 * iterations_per_thread);
    printf("Actual: %d\n", shared_counter);

    return 0;
}</code></pre><h3>Atomicity</h3><p>From the perspective of any other thread, an atomic operation is either not yet started or fully complete&#8212;there is no observable intermediate state.</p><p>As we saw, the C statement shared_counter = shared_counter + 1; is not atomic. It is composed of multiple machine instructions (read, modify, write). A race condition occurs precisely because a thread can be preempted <em>in the middle</em> of this non-atomic sequence.</p><p>If we could make that increment operation atomic, the race condition would disappear.</p><p><strong>How is Atomicity Achieved?</strong></p><ol><li><p><strong>Hardware Support:</strong> Modern CPUs provide special atomic instructions. These instructions are guaranteed by the hardware to execute without being interrupted. Common examples include:</p><ul><li><p><strong>Test-and-Set:</strong> Atomically writes a value to a memory location and returns its old value.</p></li><li><p><strong>Fetch-and-Add:</strong> Atomically increments a value in memory and returns the old value.</p></li><li><p><strong>Compare-and-Swap (CAS):</strong> Atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is a cornerstone of many advanced lock-free algorithms.</p></li></ul><p>In C, using the &lt;stdatomic.h&gt; header (since C11), you can perform these operations. For example, atomic_fetch_add(&amp;shared_atomic_counter, 1); would perform a hardware-guaranteed atomic increment.</p></li><li><p><strong>Software Implementation (via Locks):</strong> The more common way to achieve atomicity for a <em>sequence</em> of operations is by using synchronization primitives. By placing a lock before a sequence of instructions (the critical section) and releasing it after, we create a block of code that is <em>effectively</em> atomic from the perspective of other threads.</p></li></ol><h3>CPU Level Reordering</h3><pre><code>// Shared global variables
int data = 0;
int is_ready = 0;

// Thread 1 (Producer)
void producer() {
    data = 42;      // Write 1
    is_ready = 1;   // Write 2
}

// Thread 2 (Consumer)
void consumer() {
    if (is_ready == 1) { // Read 1
        // The programmer expects that if is_ready is 1,
        // then the write to 'data' must have happened first.
        printf("Data is %d\n", data); // Read 2
    }
}</code></pre><p>On a weakly-ordered system, it is possible for the CPU or compiler to reorder the writes in producer. The is_ready = 1 write might become globally visible to the consumer thread <em>before</em> the data = 42 write does. The consumer would then read is_ready == 1, proceed, and print "Data is 0". This is a subtle but critical form of race condition caused by memory reordering.</p><p><strong>Axiom 1: Private Caches</strong></p><ul><li><p>Each CPU core has its own private L1/L2 cache.</p></li><li><p>A write operation by a thread on Core A updates Core A's private cache first.</p></li><li><p>This write is <strong>not</strong> instantly visible to a thread on Core B. It becomes visible only after the cache coherence protocol propagates the change. This propagation is not instantaneous.</p></li></ul><p><strong>Axiom 2: Store Buffers</strong></p><ul><li><p>To avoid stalling on slow memory writes, each core has a "store buffer." When a thread performs a write, the data is often placed in this buffer first. The CPU can then continue executing other instructions while the memory system works on committing the buffered write to the cache in the background.</p></li><li><p>A write sitting in Core A's store buffer is completely invisible to Core B.</p></li></ul><p><strong>Axiom 3: Instruction Reordering</strong></p><ul><li><p>To maximize performance, the compiler and the CPU are permitted to reorder instructions.</p></li><li><p><strong>Rule 3a (Load/Load and Store/Store Reordering):</strong> Weak architectures (like ARM) can reorder two read operations or two write operations. Stronger architectures (like x86) generally do not reorder writes with other writes.</p></li><li><p><strong>Rule 3b (Store/Load Reordering):</strong> Nearly all architectures (including x86) allow a write to be reordered with a subsequent, independent read. This is the most common and problematic source of reordering. The CPU can speculatively execute a read before a preceding write has been fully committed from its store buffer.</p></li></ul><p><strong>Axiom 4: The Programmer's View vs. Reality</strong></p><ul><li><p>The order of instructions in your C code is merely a suggestion to the machine.</p></li><li><p>The final order of execution is a complex combination of compiler reordering and CPU runtime reordering, governed by the axioms above.</p></li></ul><p>With these axioms, let's analyze the execution flow.</p><div><hr></div><h3>Execution Flow: WITHOUT Locks</h3><p><strong>Shared Data:</strong><br>int data = 0;<br>int is_ready = 0;</p><p><strong>Producer Code (Thread P on Core A):</strong></p><ol><li><p>data = 42;</p></li><li><p>is_ready = 1;</p></li></ol><p><strong>Consumer Code (Thread C on Core B):</strong></p><ol><li><p>if (is_ready == 1)</p></li><li><p>print(data);</p></li></ol><p>Let's trace a possible, and perfectly valid, sequence of events according to our axioms.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gd2d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gd2d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 424w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 848w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 1272w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gd2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png" width="941" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d85dce67-234b-4d91-a029-f889ea84d114_941x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:941,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61958,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/173587543?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gd2d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 424w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 848w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 1272w, https://substackcdn.com/image/fetch/$s_!Gd2d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd85dce67-234b-4d91-a029-f889ea84d114_941x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This scenario demonstrates a <strong>producer-side reordering</strong> problem, made possible by the Store Buffer (Axiom 2). A similar scenario could be constructed for a <strong>consumer-side reordering</strong> (Axiom 3b) where the consumer reads data before it reads is_ready. The result is the same: incorrect behavior.</p><h3>Execution Flow: WITH Locks</h3><p>Now, let's introduce a new set of rules that apply when we use a mutex.</p><p><strong>The Axioms of a Mutex</strong></p><p><strong>Axiom 5: Mutual Exclusion</strong></p><ul><li><p>Only one thread can hold a given mutex at any one time. A thread calling lock() will block (wait) until the mutex is released by its current owner.</p></li></ul><p><strong>Axiom 6: The Memory Barrier Handshake</strong></p><ul><li><p>A mutex_unlock operation creates a <strong>release barrier</strong>. This acts as a command: "Commit all my buffered writes now. Ensure all my previous memory operations are visible to the rest of the system <em>before</em> the lock is actually released."</p></li><li><p>A mutex_lock operation creates an <strong>acquire barrier</strong>. This acts as a command: "Before I proceed, I must synchronize my local view of memory with the latest changes from whoever last released this lock. Invalidate my stale cache entries."</p></li><li><p>These two barriers work together. The release <em>publishes</em> changes; the acquire <em>receives</em> them. This creates a <strong>"happens-before"</strong> relationship. The unlock on Thread A <em>happens-before</em> the subsequent lock on Thread B.</p></li></ul><p><strong>Producer Code (Thread P on Core A):</strong></p><ol><li><p>lock(m);</p></li><li><p>data = 42;</p></li><li><p>is_ready = 1;</p></li><li><p>unlock(m);</p></li></ol><p><strong>Consumer Code (Thread C on Core B):</strong></p><ol><li><p>lock(m);</p></li><li><p>if (is_ready == 1)</p></li><li><p>print(data);</p></li><li><p>unlock(m);</p></li></ol><p>Let's trace the execution flow with these new, powerful rules.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0693!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0693!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 424w, https://substackcdn.com/image/fetch/$s_!0693!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 848w, https://substackcdn.com/image/fetch/$s_!0693!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 1272w, https://substackcdn.com/image/fetch/$s_!0693!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0693!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png" width="941" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:941,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/173587543?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0693!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 424w, https://substackcdn.com/image/fetch/$s_!0693!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 848w, https://substackcdn.com/image/fetch/$s_!0693!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 1272w, https://substackcdn.com/image/fetch/$s_!0693!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b3bbdf-501d-44e7-8259-899d23f03d3c_941x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Deadlock</h4><p>A <strong>deadlock</strong> is a state in which two or more threads are blocked forever, each waiting for a resource held by the other. The most common cause of deadlock with mutexes is acquiring multiple locks in an inconsistent order.</p><p><strong>The Deadly Embrace</strong></p><p>Imagine two threads, Thread A and Thread B, and two mutexes, M1 and M2.</p><ol><li><p><strong>Thread A</strong> locks M1.</p></li><li><p><strong>Thread B</strong> locks M2.</p></li><li><p><strong>Thread A</strong> tries to lock M2, but it's held by Thread B. Thread A blocks.</p></li><li><p><strong>Thread B</strong> tries to lock M1, but it's held by Thread A. Thread B blocks.</p></li></ol><pre><code>pthread_mutex_t M1, M2;

// Transferring a resource from an object protected by M1
// to an object protected by M2.

void transfer_resource() {
    // Enforce a global lock order: Lock the mutex with the lower memory address first.
    if (&amp;M1 &lt; &amp;M2) {
        pthread_mutex_lock(&amp;M1);
        pthread_mutex_lock(&amp;M2);
    } else {
        pthread_mutex_lock(&amp;M2);
        pthread_mutex_lock(&amp;M1);
    }
}</code></pre><p><strong>"Any thread that needs to lock both M1 and M2 must </strong><em><strong>always</strong></em><strong> lock M1 first, then lock M2."</strong></p><h3>Semaphores</h3><p>A <strong>semaphore</strong> is a synchronization primitive that is a generalization of a mutex. It was invented by the legendary Dutch computer scientist Edsger W. Dijkstra in the mid-1960s.</p><p>A semaphore is essentially a non-negative integer counter that is manipulated by two atomic operations:</p><ol><li><p><strong>P()</strong> (from the Dutch <em>proberen</em>, "to test" or "to probe"). Also known as wait() or down().</p></li><li><p><strong>V()</strong> (from <em>verhogen</em>, "to increment"). Also known as post(), signal(), or up().</p></li></ol><p>The semaphore's internal integer counter tracks the number of available "permits" or "resources".</p><p>The logic of the two operations is as follows:</p><ul><li><p><strong>wait(semaphore):</strong></p><ol><li><p>Atomically decrement the semaphore's internal counter.</p></li><li><p>If the counter's value becomes negative, the calling thread is blocked and placed into a waiting queue.</p></li><li><p>If the counter's value is zero or positive, the thread continues execution.</p></li></ol></li><li><p><strong>post(semaphore):</strong></p><ol><li><p>Atomically increment the semaphore's internal counter.</p></li><li><p>If there are any threads blocked waiting on this semaphore (i.e., the counter was negative before the increment), wake one of them up.</p></li></ol></li></ul><p><strong>Key Difference from a Mutex:</strong><br>A mutex has the concept of "ownership". The thread that locks it must be the one to unlock it. A semaphore is a signaling mechanism; it has no concept of ownership. <strong>Any thread can post to a semaphore, even if it never called wait.</strong> This makes it a powerful tool for more complex synchronization scenarios beyond simple mutual exclusion.</p><p>There are two main types of semaphores:</p><ul><li><p><strong>Binary Semaphore:</strong> The counter is initialized to 1. It can only be 1 (resource available) or 0 (resource unavailable). It behaves almost identically to a mutex. If you wait when it's 1, it becomes 0 and you proceed. If you wait when it's 0, you block. post changes it from 0 to 1 and wakes a waiting thread.</p></li><li><p><strong>Counting Semaphore:</strong> The counter can be initialized to any non-negative integer N. This allows up to N threads to pass the wait operation without blocking. It is used to guard a pool of N identical resources.</p></li></ul><blockquote><p>example use case :- suppose there is a function which can only run on a GPU  but your GPU only has 4 cores, so you want this function to be at a time picked up by only 4 threads, this is where semaphores can be useful</p></blockquote><h3>Spinlocks</h3><p>A <strong>spinlock</strong> is a type of lock that avoids blocking in the traditional sense. Instead of yielding the CPU when the lock is unavailable, the thread enters a tight loop, repeatedly checking the lock's status until it becomes free. This action of looping is called "spinning." The thread remains active and scheduled, consuming 100% of its CPU core while it waits.</p><h4>When and Why to Use Spinlocks</h4><p>The decision to use a spinlock versus a mutex is a pure performance trade-off, hinging on the expected wait time for the lock.</p><ul><li><p><strong>Cost of a Mutex:</strong> The overhead of a context switch. This is a relatively high, fixed cost. It involves saving the thread's state, updating scheduler data structures, and then later restoring the thread's state. This can take thousands of CPU cycles.</p></li><li><p><strong>Cost of a Spinlock:</strong> The CPU cycles wasted while spinning. This cost is variable and directly proportional to how long the lock is held by another thread.</p></li></ul><p>Use a <strong>spinlock</strong> when the critical section is <strong>very short</strong></p><p>Spinlocks are only effective on multi-core systems. On a single-core system, they are a disaster. If a thread starts spinning on a single-core machine, it is consuming the <em>only</em> available CPU</p><h3>Condition Variables</h3><p>A <strong>condition variable</strong> is a synchronization primitive that allows threads to block (wait) until a particular condition, related to shared data, becomes true</p><p>lets look at what problem they solve through an example</p><pre><code>// BROKEN AND INEFFICIENT "SOLUTION"
pthread_mutex_lock(&amp;queue_mutex);
while (is_queue_empty(&amp;my_queue)) {
    // What do we do here? If we just loop, we are holding the lock
    // and spinning, preventing the producer from ever getting the lock
    // to add an item. This is a deadlock.
}
// consume item
pthread_mutex_unlock(&amp;queue_mutex);</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!smHt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!smHt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 424w, https://substackcdn.com/image/fetch/$s_!smHt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 848w, https://substackcdn.com/image/fetch/$s_!smHt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 1272w, https://substackcdn.com/image/fetch/$s_!smHt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!smHt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png" width="481" height="448.2836468885673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:691,&quot;resizeWidth&quot;:481,&quot;bytes&quot;:69468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/173587543?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!smHt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 424w, https://substackcdn.com/image/fetch/$s_!smHt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 848w, https://substackcdn.com/image/fetch/$s_!smHt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 1272w, https://substackcdn.com/image/fetch/$s_!smHt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0876f2d-7efb-40bf-8cc4-4c09e97976fc_691x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>how condition variables solve this</strong></p><ol><li><p>A thread acquires a mutex to protect the shared data.</p></li><li><p>It checks if a condition is true.</p></li><li><p>If the condition is <strong>false</strong>, the thread atomically <strong>unlocks the mutex and goes to sleep</strong> on the condition variable. This is the magic step.</p></li><li><p>Another thread (the producer/modifier) acquires the mutex, changes the state of the shared data, and potentially makes the condition true.</p></li><li><p>The producer thread then <strong>signals</strong> the condition variable, which wakes up one or more of the waiting threads.</p></li><li><p>A woken-up thread then <strong>re-acquires the mutex</strong> and checks the condition again before proceeding.</p></li></ol><h3>Read-Write Locks</h3><ul><li><p>A <strong>shared data structure</strong> that is <em>read often</em> but <em>written rarely</em> doesn&#8217;t need every access to be exclusive.</p></li><li><p>A <strong>standard mutex</strong> forces both reads and writes to wait for one another, so even purely <em>read-only threads are serialized</em>, creating needless slowdown.</p></li><li><p>A <strong>read-write lock (rwlock)</strong> fixes this by offering<br>&#8211; <strong>read (shared) lock:</strong> many threads can read <em>simultaneously</em> because they don&#8217;t change the data.<br>&#8211; <strong>write (exclusive) lock:</strong> a writer gains sole access, ensuring updates stay consistent.</p></li></ul><p>The rules of a read-write lock are:</p><ol><li><p><strong>Multiple Readers Allowed:</strong> Any number of threads can hold a read lock on the resource simultaneously.</p></li><li><p><strong>Exclusive Writer:</strong> Only one thread can hold a write lock.</p></li><li><p><strong>Writers Exclude Everyone:</strong> If a thread holds a write lock, no other thread (neither reader nor writer) can acquire any lock.</p></li><li><p><strong>Readers Exclude Writers:</strong> If any thread holds a read lock, any thread attempting to acquire a write lock must wait until all readers have released their locks.</p></li></ol><p><strong>what if a stream of new readers keeps arriving while a writer is waiting?</strong></p><p>This leads to two common implementation strategies, which address the "reader-writer problem."</p><p><strong>Reader-Preference:</strong></p><ul><li><p><strong>Policy:</strong> If the lock is held by readers, an incoming reader is granted access immediately. A waiting writer will only get the lock when <em>all</em> readers (including the newly arrived ones) have finished.</p></li></ul><p><strong>Writer-Preference:</strong></p><ul><li><p><strong>Policy:</strong> If a writer is waiting for the lock, any newly arriving readers will be blocked. They will be queued up behind the waiting writer. The writer gets the lock as soon as the current readers finish.</p></li></ul><h3>Barriers</h3><p>A <strong>barrier</strong> is a synchronization primitive that forces a group of participating threads to all wait at a specific point (the "barrier point") until every thread in the group has reached that point. Once the last thread arrives, all threads are released simultaneously and can proceed with their next phase of computation.</p><p>Barriers are ideal for algorithms that are broken down into distinct phases or stages, where the results of one phase are needed by all threads before the next phase can begin</p><h3>Atomics</h3><p>Starting with the C11 standard, the C language gained native support for atomic operations. This is a massive improvement over relying on platform-specific inline assembly or compiler intrinsics. The features are primarily defined in the &lt;stdatomic.h&gt; header.</p><p>An <strong>atomic operation</strong> is one that is guaranteed by the hardware to execute indivisibly. When a thread performs an atomic operation on a variable, no other thread can see the variable in a half-modified state. The entire operation (read, modify, and write) completes as a single, uninterruptible step.</p><h3>C11 _Atomic Types and Operations</h3><p>You can make any integer type, pointer, or struct/union (of a certain size) atomic by using the _Atomic keyword (or the more convenient atomic_int, atomic_bool, etc. type aliases from &lt;stdatomic.h&gt;).</p><p>Once a variable is declared as atomic, you cannot use standard operators like ++, +=, or = on it. You must use the special atomic functions</p><p></p><p>that&#8217;s it for this blog guys, next will be diving deeper into Golang concurrency, until then bbye.</p>]]></content:encoded></item><item><title><![CDATA[Ensuring A Stable Pre-Train]]></title><description><![CDATA[there are majorly two reasons for unstable training of your llm]]></description><link>https://goyalayus.substack.com/p/ensuring-a-stable-pre-train</link><guid isPermaLink="false">https://goyalayus.substack.com/p/ensuring-a-stable-pre-train</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Fri, 29 Aug 2025 15:23:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>there are majorly two reasons for unstable training of your llm</p><p><strong>over-fitting</strong> &#8212;&gt; when the model instead of learning starts memorizing.</p><p><strong>Exploding Gradients</strong> &#8212;&gt; when a lot of parameters in your llm become &gt;1, because the gradients are generally multiplication of lot of parameters , this becomes NaN number and the trianing crashes</p><p><strong>Vanishing Gradients</strong> &#8212;&gt; similar to Exploading Gradients, just the parameters become &lt; 1</p><p>Surprisingly, the solutions are simple; they&#8217;re familiar you just didn&#8217;t notice their use cases.</p><h4><strong>over-fitting</strong></h4><p><strong>Dropout Layer</strong></p><ol><li><p>For a given layer's output (or input to the next layer), you apply a Dropout layer.</p></li><li><p>During each forward pass of a training step, each neuron (or activation) in that layer's output has a probability p (e.g., p=0.1) of being <strong>randomly set to zero</strong>.</p></li><li><p>The remaining non-zero neurons are <strong>scaled up</strong> by a factor of 1 / (1 - p). This is called "inverted dropout" and it ensures that the expected sum of the outputs remains the same, which keeps the learning dynamics stable.</p></li></ol><p>It prevents a neuron from becoming overly dependent on the presence of a few specific other neurons</p><p><strong>Weight Decay (L2 Regularization)</strong></p><p>L_total = L_main + (&#955;/2) * &#931;(w^2)</p><ul><li><p>&#931;(w^2) is the sum of the squares of all the individual weight parameters w in the entire model. This is the <strong>L2 norm</strong> of the weight vector.</p></li><li><p>&#955; (lambda) is a hyperparameter called the <strong>weight decay rate</strong></p></li></ul><p>By adding this term to the loss, we are telling the optimizer: "Minimize the prediction error, BUT ALSO try to keep the weights as small as possible."</p><p>A model with large weights is often a sign of overfitting. Large weights mean the model is making very sharp, specific decisions based on small changes in the input</p><h4>Exploading/Vanishing Gradients</h4><p><strong>Layer Normalization</strong></p><p>They are inserted between other layers (typically after a linear layer and before an activation function) and their sole job is to rescale their input tensor to have a <strong>standard, predictable distribution</strong>, usually with a mean of 0 and a variance of 1. This keeps the numbers flowing through the network in a "well-behaved" range</p><p>All normalization layers also introduce two learnable parameters, gamma (scale) and beta (shift), after normalizing. This allows the network to learn if, for some reason, a different mean and variance is actually optimal for the next layer. y_out = gamma * y_normalized + beta</p><p><strong>RMSNorm &#8212;&gt;</strong> just a better version of layerNorm</p><ul><li><p><strong>LayerNorm:</strong> y = (x - mean(x)) / sqrt(variance(x) + eps)</p></li><li><p><strong>RMSNorm:</strong> y = x / sqrt(mean(x^2) + eps)</p></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[You see Neural Nets Wrong]]></title><description><![CDATA[Deriving and Memorizing Forward Pass in Neural Networks]]></description><link>https://goyalayus.substack.com/p/you-see-neural-nets-wrong</link><guid isPermaLink="false">https://goyalayus.substack.com/p/you-see-neural-nets-wrong</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Fri, 29 Aug 2025 13:55:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8HBB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I always after learning Neural Networks would forget how exactly did they work, now I realize why</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MNd8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MNd8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 424w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 848w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 1272w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MNd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png" width="205" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:205,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Neural network (machine learning ...&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Neural network (machine learning ..." title="Neural network (machine learning ..." srcset="https://substackcdn.com/image/fetch/$s_!MNd8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 424w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 848w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 1272w, https://substackcdn.com/image/fetch/$s_!MNd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a59293-eed9-4461-ab22-112c73a3e3c3_205x246.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>its because of this image, this representation is flawed.<br>1. it forces you to think in scalars and becomes very confusing when you take this mental model to vectors</p><p>2. there is no description where the learnable parameters lie, so never think in this image.</p><p>here is a better way to think </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8HBB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8HBB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 424w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 848w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 1272w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8HBB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png" width="1456" height="1304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1304,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:705138,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/172244088?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8HBB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 424w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 848w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 1272w, https://substackcdn.com/image/fetch/$s_!8HBB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9632c657-5377-4510-bb82-7fc0b15e5acb_3680x3296.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[So, You Want To Make A Dating App...]]></title><description><![CDATA[Our efforts for an app, which apparently never made it to the 'peepl']]></description><link>https://goyalayus.substack.com/p/so-you-want-to-make-a-dating-app</link><guid isPermaLink="false">https://goyalayus.substack.com/p/so-you-want-to-make-a-dating-app</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Wed, 20 Aug 2025 07:07:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0yAA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8128ba-9ec3-42b2-afb2-71d35fa61d63_720x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>March 2025, I am sitting in an Investor meeting convincing him to invest in my dating app, This is my 4th meeting, after half n hour of meeting, He says &#8220;yes&#8221; and at that exact moment I knew I am shutting down this Idea.</p><p>so how did I reach here? back in August 2024, I was friends with <a href="https://x.com/Sikandar_lx">Sik</a> and <a href="https://x.com/4rpGh0st">Arnav</a>. we all wanted to work on a startup but we were not getting a good idea. one fine day we got an idea that &#8220;why are these dating apps so costly? all they have to do is run some servers, They have hired all those unnecessary people. we think that&#8217;s the reason of their cost&#8221;</p><blockquote><p>so we started building a cheap dating app</p></blockquote><p>we soon enough realized that this was not a good idea.</p><p>in dating apps only two things matter <strong>for short term success</strong></p><p><strong>For Men</strong></p><ol><li><p>why will people download your app</p></li><li><p>why will they stick</p></li></ol><p>in dating apps men stick when they are getting matches and men download your app when &#8220;they think they <strong>will</strong> get matches&#8221;</p><p><strong>For Women</strong></p><ol><li><p>Quality of Profiles</p></li></ol><p>but for long term success only one thing matters</p><p><strong>number of matches formed on your app</strong>, which depends on number of women in the app and quality of male profiles. <br><br>interestingly number of women on your app also depends on quality of male profiles and how good your marketing is.</p><p>This cheap dating app thing in short would make people feel like they will get matches because they can swipe more after purchasing the premium, but what this would result in women getting a lot of likes and swipes from a lot of low-quality profiles, which would drive women away from app</p><div><hr></div><p>we soon understood that the idea was not good enough, so we started doing more research. talking to females, approaching through linkedin (which a lot of them told us was weird, but females do not reply on whatsapp to an unknown demand like this)</p><p>after a lot of failed and stupid insights we finally got onto our two main insights</p><blockquote><p>1.</p></blockquote><p>Affluent class of India (India 1 - 120 million people / 30 million households) keep switching their dating apps, because after some years India 2 comes on the app and The experience is ruined.</p><p>so we would segregate profiles in India1 and India 2 at the time of onboarding itself and only allow matching within the groups</p><blockquote><p>   2.</p></blockquote><p>we would show you detail insights on your profile, so that even if you are not getting matches. you would get some response from the app. </p><p>twitter and substack do this very nicely, to prevent posters from getting demotivated</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TRTh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TRTh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TRTh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg" width="540" height="693" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:693,&quot;width&quot;:540,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43123,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://goyalayus.substack.com/i/171375066?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285c9fb0-5c08-4658-bfaf-7d561a13a593_540x1200.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TRTh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TRTh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87fbc93f-e644-4adf-8338-8ee1cd2eb2db_540x693.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>3.</p></blockquote><p>that&#8217;s a secret, I am going to use it on my other apps too lol</p><div><hr></div><p>The Idea sounds good right? even we had the funds. so why did we close the idea?</p><p>to be honest I still do not have any concrete answer to this, it&#8217;s mostly vibes but here are some facts</p><ol><li><p>at this time no-one is willing to invest in social apps in India. so there is funding winter in social space</p></li><li><p>we live an IIT Roorkee, no-body here uses dating apps. so to test a dating app. you have to launch in delhi or banglore and to market there you need live there which is costly</p></li><li><p>if we were in Delhi and would have friends who used dating apps, it was very easy to know what real problems are. <strong>It is totally possible that those two insights are all in the air and nobody needs it</strong></p></li><li><p>subscriptions in india don&#8217;t make money. Netflix has a revenue of mere 37 crores in India</p></li><li><p>We are not marketers and film makers. We are tech people. The CAC without crazy marketing is too high to operate on</p></li><li><p>I personally do not have time. I want to get rich fast. I hate working (even though I work almost all day long from past 4 years) and social apps take around 10 years to really become something, so the time horizons are very long</p></li></ol><p>having said all of this, This is the most exciting space to work on. </p><p>Nobody gets up being excited to do a B2B SaaS but <strong>this, </strong>I can&#8217;t even express you the addreline rush.</p><p>I did not wanted this post to be too long, so there a lot of things I did not include.</p><blockquote><p>If you have any questions do hit me up at https://x.com/goyalayus</p></blockquote>]]></content:encoded></item><item><title><![CDATA[How B2C Will Play Out After LLMS]]></title><description><![CDATA[note : This post is about B2C, I have very little Experience on B2B but still will include a section at last on my views]]></description><link>https://goyalayus.substack.com/p/how-b2c-will-play-out-after-llms</link><guid isPermaLink="false">https://goyalayus.substack.com/p/how-b2c-will-play-out-after-llms</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 13:55:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>note</strong> : This post is about B2C, I have very little Experience on B2B but still will include a section at last on my views</p><div><hr></div><p>There will be three components in the B2C world</p><ol><li><p>The Interface to Interact with LLM&#8217;s</p></li><li><p>The LLM&#8217;s</p></li><li><p>Tools LLM&#8217;s will use</p></li></ol><p>we will be deep diving one by one in each layer</p><p><strong>The Interface to Interact with LLM&#8217;s</strong></p><p>There will be some standard interface to interact with AI. <br>for example :- your Laptop AI Assistant, Meta Glasses, Your mobile phone AI Assistant, The Robot in your house.</p><p>The Thing to note is a lot of big players will be playing here and it will get commotized. as mobile phones and laptops have got commoditised. you will have Xiomi, Samsung and Apple of Home Robots</p><p>winning in this market is super hard. Getting Acquired is possible.</p><p><strong>The LLM&#8217;s</strong></p><p>These are LLM&#8217;s which the Interface providers will use.</p><p>why do I say LLM&#8217;s and not LLM? yes because just like many interfaces there would be many LLM&#8217;s but the reason for this is quite different</p><p>LLM&#8217;s can hold 3.6 bits per parameter, which is a very less of a memory. This is the reason LLM&#8217;s are forced to generalise and not remember. This is the Reason why LLM&#8217;s can not  generate Interesting Oppinions because They are made to generalize.</p><p>so Generalized LLM&#8217;s can not work great in specialized domains, there is allways chance to fine tune an LLM and make a superior domain specialized model.</p><p>which is why I say LLM<code>&#8217;s </code>and It is the reason I believe that Fine tunning is hear to stay.</p><p><strong>Tools LLM&#8217;s will use</strong></p><p>many people do not know that Perpexility uses a <strong>search engine</strong> beneath it to do search. No you can not search the web with an agent or llm, It has to use a search engine beneath. there are tools. LLM&#8217;s need to use.</p><p>A big way this tool usage is being made feasable is by <strong>MCP. </strong>Harkirat Singh has a Great video on <a href="https://www.youtube.com/watch?v=1iJ34tTjwwo&amp;pp=ygUYaGFya2lyYXQgc2luZ2ggbWNwIHZpZGVv">MCP</a></p><p>another example : if AI would have to book an AirBnB for you, it would have to use AirBnB MCP. so now AirBnB becomes a tool.</p><p>I personally think that Money is to be made in The Tools.</p><h4><strong>B2B</strong></h4><p>I think they will become tools and expose an mcp endpoint.</p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Paras Chopra Saved My Failed Startup]]></title><description><![CDATA[My notes from paras Chopra's Mental Models]]></description><link>https://goyalayus.substack.com/p/paras-chopra-saved-my-failed-startup</link><guid isPermaLink="false">https://goyalayus.substack.com/p/paras-chopra-saved-my-failed-startup</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 13:27:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>about 6 months back, I joined twitter.</p><p>twitter is weird. The first 2-3 times you use it, it sucks. then one day you open the app and it becomes addictive.</p><p>so when I used twitter initially, Paras would appear a lot on my feed. I thought he was a influencer who gives gyan and has no substance.</p><p>but then I heard the news that he sold  wingify for 300 million and just out of curiosity I searched for his writings on internet ( Idk why I did it tbh) and I found his blog.</p><p>I was just exploring and stumbled upon the mental models for Entreprenure <a href="https://invertedpassion.com/free-book-mental-models-for-startup-founders/">section</a>, and it was just too good, Almost Paul graham level writing. completed all the articles in one reading.</p><p>then some days back I was lost, had just closed my second startup. did&#8217;nt knew what to do, was just writing Tech Articles on substack, and thought to give those a re-read.</p><blockquote><p>but this time I made notes and below are my notes of all his mental model essays</p></blockquote><div><hr></div><ul><li><p>whats not going to change in next 5 years</p></li><li><p>don&#8217;t go for what people say but what do they do. spirited airlines is the most hated one but also the most profitable. this proves that saving money &gt; comfort for people</p></li><li><p>desires remain same but solutions keep changing, startups with innovation solutions replace the once with old solutions. a toothpaste company is not only competing with the other toothpaste company but also with the mouthfreshner company</p></li><li><p>Researching why customers are doing what they&#8217;re doing can provide deep insights into their desires</p></li><li><p>look for instances where customers are innovating by themselves by modifying or re-imagining existing products</p></li><li><p>salesforce did not create the desire for crms, it just provided them better fast cheaper</p></li><li><p>rather than starting with an idea and then doing research, it&#8217;s much better to start with a blank slate and start observing customer behavior and trends. Sooner or later, you&#8217;ll find yourself full of bright ideas that are derived from actual customer behavior</p></li><li><p>Both having no competitors or having too many competitors is a clear sign of wrong timing. there is a window where you can see that the first movers product is not good, learn from what early adopters liked and disliked and launch a much better product in market</p></li><li><p>cultural startups ( you have to spot it, new desires derivitives of old desired) electric cars, people started caring about enviroment. cultural shift vs technical startups (gogle) existing desires but better faster cheaper. there is allways a spectrum, uber spotter both</p></li><li><p>a narrow market also allows for laser-sharp distribution and marketing</p></li><li><p>aiming for a fully built out product on day 1 is a recipe for disaster</p></li><li><p>apple lisa is the greatest example of this</p></li><li><p>start with a simple product that excels only on one (or a few) aspects that customers care about</p></li><li><p>an improvement in one aspect will only be appreciated by that segment and get ignored by everyone else in the market. For example, if the customers in a particular segment are price-insensitive, your discounts won&#8217;t work on them. In your mind, a discount should clearly work but for a certain segment of customers, it may actually decrease the appeal of your product for them</p></li><li><p>An improvement over existing solutions is not an improvement unless a large enough customer segment cares about it</p></li><li><p>People have busy lives and they usually don&#8217;t think much about the products and services they use in their lives. It&#8217;s a myth that people are on a constant lookout to (marginally) improve their lives. it has to be atleast 2x</p></li><li><p>rather than dive headlong into development, it&#8217;s wise to spend at least a few weeks researching the competition, customer needs, pricing, distribution channels, and development challenges</p></li><li><p>But don&#8217;t be too patient. Although it doesn&#8217;t happen often, it&#8217;s possible to do too much research. Entrepreneurship requires a dash of ignorance about the difficulty of solving some problems in the market. Do too much research and you may get dissuaded to even take the first step</p></li><li><p>In my experience, for most software / consumer products, research on the order of weeks is enough. After all that research, if you&#8217;re still excited about the opportunity, go for it</p></li><li><p>Moreover, getting several things right in one go is always significantly more difficult than getting one thing right. Therefore, an entrepreneur should strive to have clarity on what few aspects of the total solution delivered to customers need originality and everything else should be borrowed from current best practices</p></li><li><p>smartphones have become one of the biggest industries because a self-supporting ecosystem has emerged around it. Without smartphones, many useful apps like Uber or Google Maps wouldn&#8217;t have been possible, but without such apps, smartphones would have limited use and likely wouldn&#8217;t be as big as it is today</p></li><li><p>similarly microsoft also took advantage and care of this thign</p></li><li><p>For truly maximizing value, the entrepreneur needs to push for the growth of the entire ecosystem (and not just her own company)</p></li><li><p>switching cost - Historical data storage, Habits and familiarity, Sunk cost, Customized solutions</p></li><li><p>Your competitors are just like you: smart and hard-working</p></li><li><p>Introspect and list down all your unfair advantages and then use all of them. Each unfair advantage matters and all of them add up to providing a significant deterrent to competitors keen on copying what you&#8217;ve built</p></li><li><p>When you are starting out, you may not have unfair advantages. If you know how to code, so do many other people. Lacking an unfair advantage, you can succeed only via good market timing and luck. As someone with no unfair advantage over others, be ready to fail multiple times to give luck and timing a chance to be in your favor</p></li><li><p>A word of caution. For an advantage to be unfair, it has to be both: an advantage and unfair. Due to our confirmation bias, it&#8217;s easy to fall into the trap of confusing non-advantages with advantages and common knowledge as something exclusively known to you. You must be totally honest about what&#8217;s an unfair advantage and what&#8217;s not. It&#8217;ll help you to take feedback on this from someone else</p></li><li><p>&#8220;If everything you do needs to work on a three-year time horizon, then you&#8217;re competing against a lot of people,&#8221;</p></li><li><p>Other competitive advantages could be doing what&#8217;s unsexy or boring (like waste management, tax, and accounting, or mainframes)</p></li><li><p>Assume most people are lazy but market to those who aren&#8217;t</p></li><li><p>All new products compete with Instagram for attention</p></li><li><p>When it comes to costs, it is important to understand that people are loss averse. They&#8217;d much rather not lose what they have than gain something new</p></li><li><p>If you make a big ask from them before they&#8217;ve gotten enough value from you, they&#8217;ll drop off. If you deliver value first and then ask for something later, they&#8217;ll oblige</p></li><li><p>What people pay for something is determined by its perceived alternatives</p></li><li><p>Consumers want to conform, companies want to differentiate</p></li><li><p>Consumers hate getting sold to, companies love it</p></li><li><p>Consumers want stuff for free, companies want to pay</p></li><li><p>Your 30 second pitch shouldn&#8217;t be about you but about the user</p></li><li><p>Get press by giving journalists something surprising</p></li><li><p>Just like different customer segments have different needs, different talent segments have different drives</p></li><li><p>Exceptional people never look for jobs; jobs look for them</p></li><li><p>If you feel people will work for you because of the salary you&#8217;re offering, you&#8217;ll end up attracting only those kinds of people who want a safe, stress-free job that pays a regular salary</p></li></ul><p></p><blockquote><p>Follow  me for more Tech and Startup blogs</p></blockquote>]]></content:encoded></item><item><title><![CDATA[How Multi-Model LLM's Work]]></title><description><![CDATA[To be honest it&#8217;s really simple, but please read my previous posts on Transformer Architecture and Tokenizers.]]></description><link>https://goyalayus.substack.com/p/how-multi-model-llms-work</link><guid isPermaLink="false">https://goyalayus.substack.com/p/how-multi-model-llms-work</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 12:56:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>To be honest it&#8217;s really simple, but please read my previous posts on <a href="https://goyalayus.substack.com/p/tell-me-how-you-work-gpt?r=2fa55s">Transformer</a> Architecture and <a href="https://goyalayus.substack.com/p/how-llm-tokenizers-work?r=2fa55s">Tokenizers</a>.</p><p>For many computer vision tasks, images are resized to a standard square dimension before being fed into the model. A very common size, inherited from famous datasets like ImageNet, is <strong>224x224 pixels</strong>.</p><p>A Transformer model doesn't look at one pixel at a time. That would be computationally enormous. Instead, it breaks the image down into a grid of smaller, non-overlapping square "patches." A standard patch size used in many ViT (Vision Transformer) models is <strong>16x16 pixels</strong>.</p><p>Now, how many of these 16x16 patches fit into the 224x224 image.</p><ul><li><p><strong>Along the width:</strong> How many 16-pixel patches fit into 224 pixels?<br>224 / 16 = 14 patches</p></li><li><p><strong>Along the height:</strong> How many 16-pixel patches fit into 224 pixels?<br>224 / 16 = 14 patches</p></li></ul><p>This creates a grid that is <strong>14 patches wide by 14 patches tall</strong>.</p><p>To get the total number of patches, you multiply the number of patches along the width by the number along the height:</p><p>14 patches &#215; 14 patches = 196 patches</p><p>now suppose there are 50k tokens in gpt-2 vocab, we create 8k more reserved tokens (from 1 - 8k) for images and text tokens now start from 8k-58k</p><p>so when you pass an text+image to gpt, it is passed text tokens (8k - 56k tokens) and these extra 196 tokens ( each of these token can be anything between 0-8k).</p><p>and the llm is then fine-tuned on a missive text + image dataset</p><div><hr></div><p>now we will see how it is decided which patch will get which token.</p><p>we train a saperate model( it is generally an CNN) which is trained for doing one thing that is reconstructing the patch from the initial patch and the loss is calculated against how good it is able to reconstruct the image, in this process in between we make it output a token between 0-8k ( I know it feels a little vague, actually this is done through codebooks, you can read about them online or ask gpt about them )</p><blockquote><p>That&#8217;s it for this essay, almost all the multimodal llms works in similar ways. suscribe for more of these &#8220;How things really work Lectures&#8221;</p></blockquote><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[How LLM Tokenizers Work]]></title><description><![CDATA[Understanding Byte-Pair Encoding (BPE)]]></description><link>https://goyalayus.substack.com/p/how-llm-tokenizers-work</link><guid isPermaLink="false">https://goyalayus.substack.com/p/how-llm-tokenizers-work</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 10:44:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog was written before I used substack to write.</p><p>you can read this <a href="https://goyalayus.github.io/blog/how-llm-tokenizers-work-bpe.html">here</a> on my personal website and I promise a great experience.</p>]]></content:encoded></item><item><title><![CDATA[Distribution for PMF]]></title><description><![CDATA[This blog was written before I used substack to write.]]></description><link>https://goyalayus.substack.com/p/distribution-for-pmf</link><guid isPermaLink="false">https://goyalayus.substack.com/p/distribution-for-pmf</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 10:42:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog was written before I used substack to write.</p><p>you can read this <a href="https://goyalayus.github.io/blog/distribution-for-pmf.html">here</a> on my personal website and I promise a great experience.</p>]]></content:encoded></item><item><title><![CDATA[2 Startups]]></title><description><![CDATA[Why Startup Advice sometimes feel contradictory]]></description><link>https://goyalayus.substack.com/p/2-startups</link><guid isPermaLink="false">https://goyalayus.substack.com/p/2-startups</guid><dc:creator><![CDATA[ayush goyal]]></dc:creator><pubDate>Tue, 19 Aug 2025 10:41:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fw-Q!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9046c7a4-853e-4d19-b5ad-89484997b678_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog was written before I used substack to write.</p><p>you can read this <a href="https://goyalayus.github.io/blog/two-startups.html">here</a> on my personal website and I promise a great experience.</p>]]></content:encoded></item></channel></rss>