The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 1 day ago • 16
view article Article Tiny Agents in Python: a MCP-powered agent in ~70 lines of code By celinah and 3 others • 15 days ago • 122
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated about 13 hours ago • 3
Common Pile v0.1 Collection All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text • 4 items • Updated about 11 hours ago • 10
BioReason Collection BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model • 3 items • Updated 5 days ago • 8
StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation Paper • 2505.10292 • Published 22 days ago • 3
olmOCR Collection olmOCR is a document recognition pipeline for efficiently converting documents into plain text. olmocr.allenai.org • 3 items • Updated 18 days ago • 114
view article Article The Transformers Library: standardizing model definitions By lysandre and 3 others • 23 days ago • 110
view article Article Blazingly fast whisper transcriptions with Inference Endpoints By mfuntowicz and 5 others • 25 days ago • 67
Annif models Collection Annif models for text classification and subject indexing. FintoAI prefixed models are in use at Finto AI: https://ai.finto.fi • 6 items • Updated Feb 13 • 3
Eynollah models Collection Eynollah models for document image processing and layout analysis tasks. • 14 items • Updated Mar 27 • 3
YOLOv8 Datasets Collection This collection contains all our datasets for YOLOv8 Object detection trainings. • 1 item • Updated Aug 20, 2024 • 1
YOLOv8 Models Collection This collection includes models designed for Object detection using YOLOv8. • 1 item • Updated Aug 20, 2024 • 1
Datasets ATR line-level Collection This collection contains all our datasets for Automatic Text Recognition on line images. • 12 items • Updated Mar 14, 2024 • 4
SpaCy Collection This collection includes models designed for Named Entity Recognition. • 3 items • Updated Mar 13, 2024 • 1
Doc-UFCN Collection This Doc-UFCN collection contains models designed to run various DLA tasks like the text line detection or page segmentation. • 4 items • Updated Mar 13, 2024 • 3