Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it ๐ฅน > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B ๐ฃ๏ธ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive โฏ๏ธ based on Qwen/Qwen2.5-Omni-7B
NEW: Real-time conversational AI models can now run 100% locally in your browser! ๐คฏ
๐ Privacy by design (no data leaves your device) ๐ฐ Completely free... forever ๐ฆ Zero installation required, just visit a website โก๏ธ Blazingly-fast WebGPU-accelerated inference
For those interested, here's how it works: - Silero VAD for voice activity detection - Whisper for speech recognition - SmolLM2-1.7B for text generation - Kokoro for text to speech
Powered by Transformers.js and ONNX Runtime Web! ๐ค I hope you like it!
vision LMs are saturated over benchmarks, so we built vibe eval ๐ฌ
> compare different models with refreshed in-the-wild examples in different categories ๐ค > submit your favorite model for eval no numbers -- just vibes!
emerging trend: models that can understand image + text and generate image + text
don't miss out โคต๏ธ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! ๐ฑ
multimodal ๐ฌ๐ผ๏ธ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) ๐ > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM ๐ฌ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
LLMs > Mistral released Devstral, a 24B coding assistant (OS) ๐ฉ๐ปโ๐ป > Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS) > NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model > sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS) > samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)
image generation ๐จ > MTVCrafter is a new human motion animation generator
> first reasoning model for robotics > based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM ๐ค > comes with SFT & alignment datasets and a new benchmark ๐
LLMs ๐ฌ > Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS) > II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet > TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)
Multimodal ๐ผ๏ธ๐ฌ > Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output ๐ฌitโs based on an image encoder, a text decoder and a DiT, and comes in 8B > They also released pre-training and fine-tuning datasets > MMMG is a multimodal generation benchmark for image, audio, text (interleaved)
Image Generation โฏ๏ธ > Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS) > ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS) > multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux > LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks > Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators
Audio ๐ฃ๏ธ > stabilityai released stable-audio-open-small new text-to-audio model > TEN-framework released ten-vad, voice activity detection model (OS)
We just shipped a blog on everything latest on vision language models, including ๐ค GUI agents, agentic VLMs, omni models ๐ multimodal RAG โฏ๏ธ video LMs ๐ค๐ป smol models ..and more! https://huggingface.co/blog/vlms-2025
๐ฌ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B ๐คฏ as well as Qwen2.5-Omni, any-to-any model in 3B and 7B! > Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes) > NVIDIA released new CoT reasoning datasets ๐ผ๏ธ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model > Meta released EdgeTAM, an on-device object tracking model (SAM2 variant) ๐ฃ๏ธ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model > Nari released Dia, a 1.6B text-to-speech model > Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model ๐ฉ๐ปโ๐ป JetBrains released Melium models in base and SFT for coding > Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model ๐คฉ