import streamlit as st def render_report(): st.title("Group 5: Term Project Report") # Title Page Information st.markdown(""" **Course:** CSE 555 — Introduction to Pattern Recognition **Authors:** Saksham Lakhera and Ahmed Zaher **Date:** July 2025 """) # Abstract st.header("Abstract") st.subheader("NLP Engineering Perspective") st.markdown(""" This project addresses the challenge of improving recipe recommendation systems through advanced semantic search capabilities using transformer-based language models. Traditional keyword-based search methods often fail to capture the nuanced relationships between ingredients, cooking techniques, and user preferences in culinary contexts. Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture semantic relationships between ingredients and tags. At inference time we generate embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes for a user query. """) # Introduction st.header("Introduction") st.markdown(""" This term project serves primarily as an educational exercise aimed at giving students end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can substantially improve retrieval quality over simple keyword matching. **Key Contributions:** - A cleaned, category-labelled recipe subset of 15,000 recipes - Training scripts that yield domain-adapted contextual embeddings - A production-ready retrieval service that returns top-K most relevant recipes - Comparative evaluation against classical baselines """) # Dataset and Preprocessing st.header("Dataset and Pre-processing") st.subheader("Data Sources") st.markdown(""" The project draws from two CSV files: - **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients* - **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text* """) st.subheader("Corpus Filtering and Subset Selection") st.markdown(""" 1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags 2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning 3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity 4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs) """) st.subheader("Text Pre-processing Pipeline") st.markdown(""" - **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped - **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed - **Ingredient ordering** – re-ordered into sequence: **protein → vegetables → grains → dairy → other** - **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion* - **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens """) # Methodology st.header("Methodology") st.subheader("Model Architecture") st.markdown(""" - **Base Model:** `bert-base-uncased` checkpoint - **Additional Layers:** Single linear classification layer (768 → 1) with dropout (p = 0.1) - **Training Objective:** Triplet-margin loss with margin of 1.0 """) st.subheader("Hyperparameters") col1, col2 = st.columns(2) with col1: st.markdown(""" - **Batch size:** 8 - **Max sequence length:** 128 tokens - **Learning rate:** 2 × 10⁻⁵ - **Weight decay:** 0.01 """) with col2: st.markdown(""" - **Optimizer:** AdamW - **Epochs:** 3 - **Hardware:** Google Colab A100 GPU (40 GB VRAM) - **Training time:** ~75 minutes per run """) # Mathematical Formulations st.header("Mathematical Formulations") st.subheader("Query Embedding and Similarity Calculation") st.latex(r""" \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|} """) st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.") st.subheader("Final Score Calculation") st.latex(r""" \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i """) # Results st.header("Results") st.subheader("Training and Validation Loss") results_data = { "Run": [1, 2, 3, 4], "Configuration": [ "Raw, no cleaning/ordering", "Cleaned text, unordered", "Cleaned text + dropout", "Cleaned text + dropout + ordering" ], "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119], "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067] } st.table(results_data) st.markdown(""" **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance between low validation loss and meaningful retrieval quality. """) st.subheader("Qualitative Retrieval Examples") st.markdown(""" **Query: "beef steak dinner"** - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre* **Query: "chicken italian pasta"** - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake* **Query: "vegetarian salad healthy"** - Run 1 (Raw): (irrelevant hits) - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad* """) # Discussion and Conclusion st.header("Discussion and Conclusion") st.markdown(""" The experimental evidence underscores the importance of disciplined pre-processing when adapting large language models to niche domains. The breakthrough came with **ingredient-ordering** (protein → vegetables → grains → dairy → other) which supplied consistent positional signals. **Key Achievements:** - End-to-end recipe recommendation system with semantic search - Sub-second latency across 231k recipes - Meaningful semantic understanding of culinary content - Reproducible blueprint for domain-specific NLP applications **Limitations:** - Private dataset relatively small (15k samples) compared to public corpora - Minimal hyperparameter search conducted - Single-machine deployment tested """) # Technical Specifications st.header("Technical Specifications") col1, col2 = st.columns(2) with col1: st.markdown(""" **Dataset:** - Total Recipes: 231,630 - Training Set: 15,000 recipes - Average Tags per Recipe: ~6 - Ingredients per Recipe: 3-20 """) with col2: st.markdown(""" **Infrastructure:** - Python 3.10 - PyTorch 2.1 (CUDA 11.8) - Transformers 4.38 - Google Colab A100 GPU """) # References st.header("References") st.markdown(""" [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017. [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019. [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019. [4] Hugging Face, "BERT Model Documentation," 2024. """) st.markdown("---") st.markdown("© 2025 CSE 555 Term Project. All rights reserved.") # Render the report render_report()