How Multi-Model LLM's Work
To be honest it’s really simple, but please read my previous posts on Transformer Architecture and Tokenizers.
For many computer vision tasks, images are resized to a standard square dimension before being fed into the model. A very common size, inherited from famous datasets like ImageNet, is 224x224 pixels.
A Transformer model doesn't look at one pixel at a time. That would be computationally enormous. Instead, it breaks the image down into a grid of smaller, non-overlapping square "patches." A standard patch size used in many ViT (Vision Transformer) models is 16x16 pixels.
Now, how many of these 16x16 patches fit into the 224x224 image.
Along the width: How many 16-pixel patches fit into 224 pixels?
224 / 16 = 14 patchesAlong the height: How many 16-pixel patches fit into 224 pixels?
224 / 16 = 14 patches
This creates a grid that is 14 patches wide by 14 patches tall.
To get the total number of patches, you multiply the number of patches along the width by the number along the height:
14 patches × 14 patches = 196 patches
now suppose there are 50k tokens in gpt-2 vocab, we create 8k more reserved tokens (from 1 - 8k) for images and text tokens now start from 8k-58k
so when you pass an text+image to gpt, it is passed text tokens (8k - 56k tokens) and these extra 196 tokens ( each of these token can be anything between 0-8k).
and the llm is then fine-tuned on a missive text + image dataset
now we will see how it is decided which patch will get which token.
we train a saperate model( it is generally an CNN) which is trained for doing one thing that is reconstructing the patch from the initial patch and the loss is calculated against how good it is able to reconstruct the image, in this process in between we make it output a token between 0-8k ( I know it feels a little vague, actually this is done through codebooks, you can read about them online or ask gpt about them )
That’s it for this essay, almost all the multimodal llms works in similar ways. suscribe for more of these “How things really work Lectures”

