Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Abstract
A framework for OCR error correction and linguistic surface form detection in digitized corpora utilizing a Large Language Model is introduced and applied to a new dataset of 19th-century Latin American press texts.
This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper