Spaces:

nico-x
/

transformer-mnist-demo

Sleeping

App Files Files Community

nico-x commited on 28 days ago

Commit

b54146b

0 Parent(s):

codebase withouth model

Browse files

Files changed (16) hide show

.gitignore +19 -0
README.md +54 -0
app.py +8 -0
app/gradio_app.py +85 -0
app/preprocess.py +26 -0
dataset.py +72 -0
eval.py +73 -0
launch.py +8 -0
model/decoder.py +125 -0
model/encoder.py +93 -0
model/feature_extractor.py +31 -0
model/model.py +46 -0
requirements.txt +6 -0
task.txt +48 -0
train.py +65 -0
utils/tokenizer.py +33 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,19 @@

+# Byte-compiled / cache
+__pycache__/
+*.py[cod]
+*.so
+*.ipynb_checkpoints
+# Virtual environments
+.venv/
+env/
+venv/
+# System files
+.DS_Store
+# PyTorch checkpoints
+*.pt
+# Gradio session files
+gradio_cached_examples/

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# Transformer MNIST 2×2 — Image-to-Sequence Prediction
+This project implements a minimal Transformer-based model that takes a 2×2 grid of MNIST digits as input and autoregressively predicts the corresponding 4-digit sequence. It serves as a practical deep dive into the inner workings of the Transformer architecture and basic multimodality concepts, combining vision (image patches) with language modeling (digit sequences).
+## 1. Project Overview
+The goal is to understand how a vanilla Transformer encoder-decoder can be applied to a simple multimodal task: mapping an image input to a discrete token sequence. This project focuses on building each architectural component from scratch and wiring them together cleanly.
+## 2. Task Definition
+- **Input:** a 2×2 grid composed of 4 random MNIST digits, forming a 56×56 grayscale image.
+- **Output:** the 4-digit sequence corresponding to the digits in the grid (top-left → bottom-right).
+- **Modeling approach:** sequence-to-sequence using an autoregressive decoder with special `<start>` and `<finish>` tokens.
+## 3. Model Architecture
+The model follows a clean encoder-decoder Transformer architecture:
+- **Feature Extractor:** splits the 56×56 image into 16 non-overlapping patches of 14×14 pixels and projects each to a 64-dimensional embedding.
+- **Transformer Encoder:** processes the 16 patch embeddings using standard multi-head self-attention, positional embeddings, and MLP blocks.
+- **Transformer Decoder:** autoregressively predicts the digit sequence:
+  - Uses masked self-attention over token embeddings
+  - Attends to encoder output via cross-attention
+  - Outputs a sequence of logits over a vocabulary of 13 tokens (digits 0–9, `<start>`, `<finish>`)
+- **Tokenizer:** handles token ↔ digit conversions and input preparation.
+## 4. Training Setup
+- **Dataset:** MNIST, wrapped into a custom `MNIST_2x2` PyTorch dataset that returns the stitched image and 4-digit target.
+- **Batch size:** 64
+- **Epochs:** 10
+- **Loss:** `CrossEntropyLoss` over vocabulary tokens
+- **Optimizer:** Adam
+- **Hardware:** Apple M4 with `mps` acceleration
+- **Logging:** `tqdm` per-batch loss tracking for clear training progress
+## 5. Evaluation
+Evaluation is done on the held-out MNIST test set using greedy decoding:
+- Starts with <start> token
+- Predicts one token at a time (no teacher forcing)
+- Stops after 4 tokens or if <finish> is predicted
+### Evaluation Metrics
+- **Sequence accuracy:** % of samples where all 4 digits are predicted correctly
+- **Per-digit accuracy:** % of individual digits predicted correctly across all positions
+### final results after 10 epochs of training
+- **training loss at epoch 10:** 0.0101
+- **Sequence accuracy:** 93.77% on held-out test set
+- **Per digit accuracy:** 98.43% on held-out test set

app.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# entrpoint for HuggingFace Space
+import sys
+sys.path.append('.')
+from app.gradio_app import demo
+demo.launch()

app/gradio_app.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import sys
+sys.path.append('.')
+import gradio as gr
+import torch
+import numpy as np
+from PIL import Image, ImageDraw
+from model.model import ImageToDigitTransformer
+from utils.tokenizer import START, FINISH, decode
+from app.preprocess import preprocess_canvases
+# Load model
+device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
+model = ImageToDigitTransformer(vocab_size=13).to(device)
+model.load_state_dict(torch.load("checkpoints/transformer_mnist.pt", map_location=device))
+model.eval()
+def split_into_quadrants(image):
+    """Split a PIL Image or numpy array into 4 quadrants (TL, TR, BL, BR)."""
+    if isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    w, h = image.size
+    return [
+        np.array(image.crop((0, 0, w // 2, h // 2))),
+        np.array(image.crop((w // 2, 0, w, h // 2))),
+        np.array(image.crop((0, h // 2, w // 2, h))),
+        np.array(image.crop((w // 2, h // 2, w, h))),
+    ]
+def predict_digit_sequence(editor_data):
+    """Predicts 4-digit sequence from 2×2 canvas image."""
+    if editor_data is None or "composite" not in editor_data:
+        return "No image provided."
+    img = editor_data["composite"]
+    quadrants = split_into_quadrants(img)
+    image_tensor = preprocess_canvases(quadrants).to(device)
+    decoded = [START]
+    for _ in range(4):
+        input_ids = torch.tensor(decoded, dtype=torch.long).unsqueeze(0).to(device)
+        with torch.no_grad():
+            logits = model(image_tensor, input_ids)
+        next_token = torch.argmax(logits[:, -1, :], dim=-1).item()
+        decoded.append(next_token)
+        if next_token == FINISH:
+            break
+    pred = decoded[1:]
+    return "".join(decode(pred[:4]))
+def create_black_canvas(size=(800, 800)):
+    """Create a black canvas with a 2×2 light gray grid overlay."""
+    img = Image.new("L", size, color=0)
+    draw = ImageDraw.Draw(img)
+    w, h = size
+    draw.line([(w // 2, 0), (w // 2, h)], fill=128, width=2)
+    draw.line([(0, h // 2), (w, h // 2)], fill=128, width=2)
+    return img
+# === UI ===
+canvas_size = 800
+with gr.Blocks() as demo:
+    gr.Markdown("## Draw 4 digits in a 2×2 grid using a white brush")
+    canvas = gr.ImageEditor(
+        label="White brush only on black canvas (no uploads)",
+        value=create_black_canvas(),
+        image_mode="L",
+        height=canvas_size,
+        width=canvas_size,
+        sources=[],  # disables uploads
+        type="pil",
+        brush=gr.Brush(colors=["#FFFFFF"], default_color="#FFFFFF", default_size=15, color_mode="fixed")
+    )
+    predict_btn = gr.Button("Predict")
+    clear_btn = gr.Button("Erase")
+    output = gr.Textbox(label="Predicted 4-digit sequence", interactive=True)
+    predict_btn.click(fn=predict_digit_sequence, inputs=[canvas], outputs=[output])
+    clear_btn.click(fn=lambda: create_black_canvas(), inputs=[], outputs=[canvas])
+demo.launch()

app/preprocess.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import numpy as np
+import torch
+from PIL import Image
+def preprocess_canvases(images):
+    """
+    Takes a list of 4 RGBA images (top-left, top-right, bottom-left, bottom-right),
+    resizes to 28x28, converts to grayscale, stitches to (1, 56, 56) tensor.
+    """
+    assert len(images) == 4, "Expected 4 images"
+    digits = []
+    for img in images:
+        img = Image.fromarray(img).convert("L")  # convert to grayscale
+        img = img.resize((28, 28))
+        img = np.array(img).astype(np.float32) / 255.0  # scale to [0, 1]
+        digits.append(img)
+    top = np.hstack([digits[0], digits[1]])
+    bottom = np.hstack([digits[2], digits[3]])
+    grid = np.vstack([top, bottom])  # shape (56, 56)
+    # Normalize like MNIST
+    grid = (grid - 0.1307) / 0.3081
+    grid = torch.tensor(grid).unsqueeze(0).unsqueeze(0)  # shape (1, 1, 56, 56)
+    return grid

dataset.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import torch
+from torch.utils.data import Dataset
+from torchvision import datasets, transforms
+from utils.tokenizer import prepare_decoder_labels, encode, decode
+class MNIST_2x2(Dataset):
+    def __init__(self, base_dataset, transform=None, seed=42):
+        self.base_dataset = base_dataset
+        self.transform = transform
+        self.length = len(base_dataset)
+        torch.manual_seed(seed)
+        self.index_map = [
+            torch.randint(0, self.length, (4,))
+            for _ in range(self.length)
+        ]
+    def __len__(self):
+        return self.length
+    def __getitem__(self, idx):
+        indices = self.index_map[idx]
+        images = [self.base_dataset[i][0] for i in indices]
+        top_row = torch.cat([images[0], images[1]], dim=2)
+        bottom_row = torch.cat([images[2], images[3]], dim=2)
+        grid_image = torch.cat([top_row, bottom_row], dim=1)
+        labels = [self.base_dataset[i][1] for i in indices]
+        decoder_input_ids, decoder_target_ids = prepare_decoder_labels(labels)
+        decoder_input = torch.tensor(decoder_input_ids, dtype=torch.long)
+        decoder_target = torch.tensor(decoder_target_ids, dtype=torch.long)
+        return grid_image, decoder_input, decoder_target
+# test the dataset and visualize a few samples
+if __name__ == "__main__":
+    import matplotlib.pyplot as plt
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])
+    mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
+    mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)
+    train_dataset = MNIST_2x2(mnist_train, seed=42)
+    test_dataset = MNIST_2x2(mnist_test, seed=42)
+    def show_grid_image(grid_tensor, decoder_target):
+        # Undo normalization for visualization
+        img = grid_tensor.clone()
+        img = img * 0.3081 + 0.1307
+        img = img.squeeze().numpy()
+        # Decode token IDs into digit strings
+        digits = decode(decoder_target.tolist()[:-1])  # Remove <finish> for display
+        label_str = " ".join(digits)
+        plt.imshow(img, cmap="gray")
+        plt.title(f"Digits: {label_str}")
+        plt.axis("off")
+        plt.show()
+    # Visualize a few samples
+    for i in range(3):
+        grid_image, decoder_input, decoder_target = train_dataset[i]
+        show_grid_image(grid_image, decoder_target)

eval.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import torch
+from torch.utils.data import DataLoader
+from torchvision import datasets, transforms
+from tqdm import tqdm
+from dataset import MNIST_2x2
+from model.model import ImageToDigitTransformer
+from utils.tokenizer import START, FINISH, decode
+# device
+device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
+print(f"Using device: {device}")
+# config
+VOCAB_SIZE = 13
+MAX_LEN = 5  # length of decoder input: [<start>, d1, d2, d3, d4]
+SEQ_LEN = 4  # number of predicted digits
+transform = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.1307,), (0.3081,))
+])
+mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)
+test_dataset = MNIST_2x2(mnist_test, seed=42)
+test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)
+model = ImageToDigitTransformer(vocab_size=VOCAB_SIZE).to(device)
+model.load_state_dict(torch.load("checkpoints/transformer_mnist.pt", map_location=device))
+model.eval()
+# Evaluation Loop
+correct_sequences = 0
+digit_correct = 0
+digit_total = 0
+with torch.no_grad():
+    loop = tqdm(test_loader, desc="Evaluating", leave=False)
+    for image, _, target_ids in loop:
+        image = image.to(device)
+        target_ids = target_ids.squeeze(0).tolist()[:-1]  # remove <finish>
+        decoded = [START]
+        for _ in range(SEQ_LEN):
+            input_ids = torch.tensor(decoded, dtype=torch.long).unsqueeze(0).to(device)
+            logits = model(image, input_ids)
+            next_token = torch.argmax(logits[:, -1, :], dim=-1).item()
+            decoded.append(next_token)
+            if next_token == FINISH:
+                break
+        pred = decoded[1:][:SEQ_LEN]
+        target = target_ids
+        if pred == target:
+            correct_sequences += 1
+        digit_correct += sum(p == t for p, t in zip(pred, target))
+        digit_total += len(target)
+        seq_acc = 100.0 * correct_sequences / (digit_total // SEQ_LEN)
+        digit_acc = 100.0 * digit_correct / digit_total
+        loop.set_postfix(seq_acc=f"{seq_acc:.2f}%", digit_acc=f"{digit_acc:.2f}%")
+# final results
+total_samples = len(test_loader)
+seq_acc = 100.0 * correct_sequences / total_samples
+digit_acc = 100.0 * digit_correct / digit_total
+print(f"\nFinal Evaluation Results:")
+print(f"  Sequence accuracy: {seq_acc:.2f}%")
+print(f"  Per-digit accuracy: {digit_acc:.2f}%")

launch.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# entrpoint for HuggingFace Space
+import sys
+sys.path.append('.')
+from app.gradio_app import demo
+demo.launch()

model/decoder.py ADDED Viewed

	@@ -0,0 +1,125 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+class DecoderLayer(nn.Module):
+    def __init__(self, d_model=64, n_heads=4, ff_dim=128):
+        super().__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        assert d_model % n_heads == 0, "d_model must be divisible by number of heads"
+        # Self-attention: Q, K, V from decoder input
+        self.self_attn_proj = nn.Linear(d_model, 3 * d_model)
+        # Cross-attention: Q from decoder input, K/V from encoder output
+        self.cross_attn_q = nn.Linear(d_model, d_model)
+        self.cross_attn_kv = nn.Linear(d_model, 2 * d_model)
+        # Output projections
+        self.self_out = nn.Linear(d_model, d_model)
+        self.cross_out = nn.Linear(d_model, d_model)
+        # Feedforward MLP
+        self.mlp = nn.Sequential(
+            nn.Linear(d_model, ff_dim),
+            nn.GELU(),
+            nn.Linear(ff_dim, d_model)
+        )
+        # LayerNorms
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.norm3 = nn.LayerNorm(d_model)
+    def forward(self, x, enc_out):
+            """
+            x: (B, T, D) - decoder input embeddings
+            enc_out: (B, N, D) - encoder outputs (image patch representations)
+            Returns: (B, T, D)
+            """
+            B, T, D = x.shape
+            _, N, _ = enc_out.shape
+            # Masked Self-Attention
+            x_norm = self.norm1(x)
+            qkv = self.self_attn_proj(x_norm).reshape(B, T, 3, self.n_heads, self.head_dim)
+            qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, n_heads, T, head_dim)
+            q, k, v = qkv[0], qkv[1], qkv[2]
+            attn_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)  # (B, n_heads, T, T)
+            # Causal mask: prevent attention to future positions
+            mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0).unsqueeze(0)  # (1, 1, T, T)
+            attn_scores = attn_scores.masked_fill(mask == 0, float("-inf"))
+            attn_weights = F.softmax(attn_scores, dim=-1)
+            attn_out = attn_weights @ v  # (B, n_heads, T, head_dim)
+            attn_out = attn_out.transpose(1, 2).reshape(B, T, D)
+            attn_out = self.self_out(attn_out)
+            x = x + attn_out  # Residual
+            # Cross-Attention
+            x_norm = self.norm2(x)
+            q = self.cross_attn_q(x_norm).reshape(B, T, self.n_heads, self.head_dim).transpose(1, 2)  # (B, n_heads, T, head_dim)
+            kv = self.cross_attn_kv(enc_out).reshape(B, N, 2, self.n_heads, self.head_dim).permute(2, 0, 3, 1, 4)
+            k, v = kv[0], kv[1]  # (B, n_heads, N, head_dim)
+            cross_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)  # (B, n_heads, T, N)
+            cross_weights = F.softmax(cross_scores, dim=-1)
+            cross_out = cross_weights @ v  # (B, n_heads, T, head_dim)
+            cross_out = cross_out.transpose(1, 2).reshape(B, T, D)
+            cross_out = self.cross_out(cross_out)
+            x = x + cross_out  # Residual
+            # Feedforward
+            x_norm = self.norm3(x)
+            x = x + self.mlp(x_norm)  # Residual
+            return x
+# implement the entire decoder
+class TransformerDecoder(nn.Module):
+    def __init__(self, vocab_size=13, max_len=5, d_model=64, n_heads=4, ff_dim=128, depth=2):
+        super().__init__()
+        self.token_embedding = nn.Embedding(vocab_size, d_model)
+        self.pos_embedding = nn.Parameter(torch.randn(1, max_len, d_model))  # (1, 5, 64)
+        self.layers = nn.ModuleList([
+            DecoderLayer(d_model=d_model, n_heads=n_heads, ff_dim=ff_dim)
+            for _ in range(depth)
+        ])
+        self.output_proj = nn.Linear(d_model, vocab_size)  # Final projection to logits
+    def forward(self, decoder_input_ids, encoder_output):
+        """
+        decoder_input_ids: (B, T) token IDs
+        encoder_output: (B, N, d_model) from image encoder
+        returns: logits over vocab, shape (B, T, vocab_size)
+        """
+        x = self.token_embedding(decoder_input_ids)  # (B, T, d_model)
+        x = x + self.pos_embedding[:, :x.size(1), :]  # Add positional embedding
+        for layer in self.layers:
+            x = layer(x, encoder_output)  # (B, T, d_model)
+        logits = self.output_proj(x)  # (B, T, vocab_size)
+        return logits
+# quick test
+if __name__ == "__main__":
+    decoder = TransformerDecoder()
+    decoder_input = torch.randint(0, 13, (4, 5))  # (B=4, T=5)
+    encoder_out = torch.randn(4, 16, 64)  # (B=4, N=16, D=64)
+    logits = decoder(decoder_input, encoder_out)
+    print("Logits shape:", logits.shape)  # (4, 5, 13)

model/encoder.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model=64, n_heads=4, ff_dim=128):
+        super().__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        #attention projections
+        self.qkv_proj = nn.Linear(d_model, d_model * 3) #efficient way of projecting to q, k, v
+        self.out_proj = nn.Linear(d_model, d_model)
+        #FF MLP
+        self.mlp = nn.Sequential(
+            nn.Linear(d_model, ff_dim),
+            nn.GELU(),
+            nn.Linear(ff_dim, d_model)
+        )
+        #layernorms
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+    def forward(self, x):
+        B, N, D = x.shape
+        #multi-head attention
+        x_norm = self.norm1(x)
+        qkv = self.qkv_proj(x_norm)
+        qkv = qkv.reshape(B, N, 3, self.n_heads, self.head_dim).permute(2, 0, 3, 1, 4) # qkv: (3, B, n_heads, N, head_dim)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        attn_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)  # (B, n_heads, N, N)
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        attn_output = attn_weights @ v  # (B, n_heads, N, head_dim)
+        attn_output = attn_output.transpose(1, 2).reshape(B, N, D)  # (B, N, D)
+        attn_output = self.out_proj(attn_output)
+        x = x + attn_output  # Residual connection
+        # === Feedforward ===
+        x_norm = self.norm2(x)
+        x = x + self.mlp(x_norm)  # Residual
+        return x
+class TransformerEncoder(nn.Module):
+    def __init__(self, depth=4, d_model=64, n_heads=4, ff_dim=128, num_patches=16):
+        super().__init__()
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches, d_model))  # (1, 16, 64)
+        self.layers = nn.ModuleList([
+            EncoderLayer(d_model=d_model, n_heads=n_heads, ff_dim=ff_dim)
+            for _ in range(depth)
+        ])
+    def forward(self, x):
+        """
+        x: Tensor of shape (B, num_patches, d_model)
+        returns: Tensor of same shape (B, num_patches, d_model)
+        """
+        x = x + self.pos_embedding
+        for layer in self.layers:
+            x = layer(x)
+        return x
+# simple testing of dimensions
+if __name__ == "__main__":
+    import torch
+    B = 4  # batch size
+    N = 16  # number of patches
+    D = 64  # embedding dim
+    dummy_input = torch.randn(B, N, D)
+    print("Testing EncoderLayer...")
+    layer = EncoderLayer(d_model=D, n_heads=4, ff_dim=128)
+    out = layer(dummy_input)
+    print("EncoderLayer output shape:", out.shape)  # (B, N, D) torch.Size([4, 16, 64])
+    print("Testing TransformerEncoder...")
+    encoder = TransformerEncoder(depth=3, d_model=D, n_heads=4, ff_dim=128, num_patches=N)
+    out = encoder(dummy_input)
+    print("TransformerEncoder output shape:", out.shape)  # (B, N, D) torch.Size([4, 16, 64])

model/feature_extractor.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import torch.nn as nn
+import torch
+class FeatureExtractor(nn.Module):
+    def __init__(self, patch_size=14, emb_dim=64):
+        super().__init__()
+        self.patch_size = patch_size
+        self.emb_dim = emb_dim
+        self.proj = nn.Linear(patch_size * patch_size, emb_dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        x: Tensor of shape (B, 1, 56, 56)
+        returns patch_embeddings of shape (B, 16, emb_dim)"""
+        B, C, H, W = x.shape
+        patches = x.unfold(2, self.patch_size, self.patch_size).unfold(3, self.patch_size, self.patch_size)
+        patches = patches.contiguous().view(B, -1, self.patch_size * self.patch_size)
+        patch_embeddings = self.proj(patches)
+        return patch_embeddings
+if __name__ == "__main__":
+    feature_extractor = FeatureExtractor()
+    dummy_input = torch.randn(8, 1, 56, 56)
+    out = feature_extractor(dummy_input)
+    print(out.shape) # should expect (8, 16, 64)

model/model.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import torch
+import torch.nn as nn
+from .feature_extractor import FeatureExtractor
+from .encoder import TransformerEncoder
+from .decoder import TransformerDecoder
+class ImageToDigitTransformer(nn.Module):
+    def __init__(self, vocab_size=13, d_model=64, n_heads=4, ff_dim=128,
+                 encoder_depth=4, decoder_depth=2, num_patches=16, max_seq_len=5):
+        super().__init__()
+        self.feature_extractor = FeatureExtractor(patch_size=14, emb_dim=d_model)
+        self.encoder = TransformerEncoder(
+            depth=encoder_depth,
+            d_model=d_model,
+            n_heads=n_heads,
+            ff_dim=ff_dim,
+            num_patches=num_patches
+        )
+        self.decoder = TransformerDecoder(
+            vocab_size=vocab_size,
+            max_len=max_seq_len,
+            d_model=d_model,
+            n_heads=n_heads,
+            ff_dim=ff_dim,
+            depth=decoder_depth
+        )
+    def forward(self, image_tensor, decoder_input_ids):
+        """
+        image_tensor: (B, 1, 56, 56)
+        decoder_input_ids: (B, 5)
+        Returns:
+            logits: (B, 5, vocab_size)
+        """
+        patch_embeddings = self.feature_extractor(image_tensor)       # (B, 16, 64)
+        encoder_output = self.encoder(patch_embeddings)               # (B, 16, 64)
+        logits = self.decoder(decoder_input_ids, encoder_output)     # (B, 5, 13)
+        return logits
+if __name__ == '__main__':
+    model = ImageToDigitTransformer()
+    img = torch.randn(4, 1, 56, 56)
+    tokens = torch.randint(0, 13, (4, 5))
+    logits = model(img, tokens)
+    print(logits.shape)  # Expected: (4, 5, 13)

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+torch
+torchvision
+gradio
+numpy
+Pillow
+tqdm

task.txt ADDED Viewed

	@@ -0,0 +1,48 @@

+# MNIST multimodal transformer - Overview
+## Task 1
+Our goal is to build and train a multimodal transformer from scratch.
+The task is to predict the sequence of digits from a image composed of 4x4 MNIST images tiled together.
+the transformer should be able to predict the labels from tope left, to top right, bottom left, bottom right.
+## Outcome
+clean minimal and well organized project folder structure
+Clean and minimal pytorch code, well organized across dataset, dataloader and model classes
+clear evaluation metrics
+# Execution
+## Dataset
+create a dataset class that returns a single example of:
+- an image made of 2x2 MNIST images picked at random (from training split) and stitched together
+- the 4 labels organized in top-left top-right, bottom-left, bottom-right
+## Model
+create a transformer architecture, encoder decoder for this task. The architecture is made of three main elements:
+- Feature extractor
+- Encoder
+- Decoder
+### feature extractor
+each image is cut into 16 patches of dim 14x14px (given my stitched 2x2 image is now 56x56 pixels)
+and linearly projected to a dimension of 64, which is the constant latent vector size D for the encoder.
+these represent the image embeddings that are fed as input to the encoder block
+### Encoder
+should follow closely the "attention is all you need" vanilla implementation, similarly to the ViT Vision Transformer paper
+- positional embeddings are added to the patch embeddings to retain positional information
+using standard learnable 1D position embeddings
+- encoder consists then of alternating layers of multi-headed self-attention and MLP blocks
+- layernorm is applied before every attention block and MLP block
+- residual connections are applied after every block
+the output of the encoder is going to be a set of encoded representations of the image patches (16x64)
+### Decoder

train.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from torchvision import datasets, transforms
+from tqdm import tqdm
+import os
+from dataset import MNIST_2x2
+from model.model import ImageToDigitTransformer
+# Use MPS if available (Apple Silicon)
+device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
+print(f"Using device: {device}")
+# Config
+BATCH_SIZE = 64
+EPOCHS = 10
+LR = 1e-3
+VOCAB_SIZE = 13
+# Transforms
+transform = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.1307,), (0.3081,))
+])
+# Dataset & DataLoader
+train_base = datasets.MNIST('./data', train=True, download=True, transform=transform)
+train_dataset = MNIST_2x2(train_base, seed=42)
+train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
+# Model, Loss, Optimizer
+model = ImageToDigitTransformer(vocab_size=VOCAB_SIZE).to(device)
+loss_fn = nn.CrossEntropyLoss()
+optimizer = torch.optim.Adam(model.parameters(), lr=LR)
+# Training Loop
+model.train()
+for epoch in range(EPOCHS):
+    total_loss = 0.0
+    loop = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}", leave=False)
+    for images, dec_input, dec_target in loop:
+        images = images.to(device)
+        dec_input = dec_input.to(device)
+        dec_target = dec_target.to(device)
+        logits = model(images, dec_input)
+        loss = loss_fn(logits.view(-1, VOCAB_SIZE), dec_target.view(-1))
+        loss.backward()
+        optimizer.step()
+        optimizer.zero_grad()
+        total_loss += loss.item()
+        # Update tqdm every batch
+        loop.set_postfix(batch_loss=loss.item())
+    avg_loss = total_loss / len(train_loader)
+    print(f"Epoch {epoch + 1}/{EPOCHS} - Loss: {avg_loss:.4f}")
+# save weights
+os.makedirs("checkpoints", exist_ok=True)
+torch.save(model.state_dict(), "checkpoints/transformer_mnist.pt")
+print("Model saved to checkpoints/transformer_mnist.pt")

utils/tokenizer.py ADDED Viewed

	@@ -0,0 +1,33 @@

+token_to_id = {
+    str(i): i for i in range(10)
+}
+token_to_id["<start>"] = 10
+token_to_id["<finish>"] = 11
+id_to_token = {v: k for k, v in token_to_id.items()}
+START = token_to_id["<start>"]
+FINISH = token_to_id["<finish>"]
+# i don't need padding or a pad token because the input is a fixed length sequence of 5
+def encode(label_list):
+    return [token_to_id[str(d)] for d in label_list]
+def decode(token_ids):
+    return [id_to_token[t] for t in token_ids]
+def prepare_decoder_labels(labels):
+    """
+    Prepare decoder input and target sequences for training.
+    Input labels: [7, 7, 6, 9]
+    Output:
+        decoder_input  = [<start>, 7, 7, 6, 9]
+        decoder_target = [7, 7, 6, 9, <finish>]
+    """
+    token_ids = encode(labels)
+    decoder_input = [START] + token_ids
+    decoder_target = token_ids + [FINISH]
+    return decoder_input, decoder_target