-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
Collections
Discover the best community collections!
Collections including paper arxiv:2410.15316
-
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12 -
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper β’ 2410.19168 β’ Published β’ 21 -
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Paper β’ 2504.19838 β’ Published β’ 22
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper β’ 2405.18503 β’ Published β’ 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper β’ 2405.20289 β’ Published β’ 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper β’ 2406.02897 β’ Published β’ 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper β’ 2406.03344 β’ Published β’ 21
-
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Paper β’ 2502.20583 β’ Published β’ 13 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12 -
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Paper β’ 2503.01710 β’ Published β’ 6
-
80
Dailypapershackernews
π -
Prithvi WxC: Foundation Model for Weather and Climate
Paper β’ 2409.13598 β’ Published β’ 45 -
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Paper β’ 2410.05262 β’ Published β’ 11 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Paper β’ 2502.20583 β’ Published β’ 13 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12 -
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Paper β’ 2503.01710 β’ Published β’ 6
-
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12 -
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper β’ 2410.19168 β’ Published β’ 21 -
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Paper β’ 2504.19838 β’ Published β’ 22
-
80
Dailypapershackernews
π -
Prithvi WxC: Foundation Model for Weather and Climate
Paper β’ 2409.13598 β’ Published β’ 45 -
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Paper β’ 2410.05262 β’ Published β’ 11 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper β’ 2410.15316 β’ Published β’ 12
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper β’ 2405.18503 β’ Published β’ 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper β’ 2405.20289 β’ Published β’ 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper β’ 2406.02897 β’ Published β’ 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper β’ 2406.03344 β’ Published β’ 21