DIDA

Swedish historical handwritten document

DIDA: The largest historical handwritten digit dataset with 250k digits

DIDA is a new image-based historical handwritten digit dataset and collected from the Swedish historical handwritten document images between the year 1800 and 1940. It is the largest historical handwritten digit dataset which is introduced to the Optical Character Recognition (OCR) community to help the researchers to test their optical handwritten character recognition methods. To generate DIDA, 250,000 single digits and 100,000 multi-digits are cropped from 75,000 different document images. The dataset has multiple unique characteristics as explained below:

The DIDA single digits dataset has 250,000 handwritten digit samples with 10 different classes from 0 to 9, and each class contains 20,000-25,000 single digit images. To the best of our knowledge, this dataset is the largest one to present historical handwritten single digit samples in RGB color space with the original sizes and appearances (a). This dataset is in contrast with the existing publicly available handwritten digit datasets (e.g. MNIST (b)), where the digit images are size-normalized, denoised and cleaned.

DIDA vs MNIST

12k, 30k, and 58k digit string data-set are generated from different document images.

DIGITNET Model and Weights

DIGITNET and Weights

A Swedish Historical Handwritten Character Dataset with 116,000 characters, 30,000 Swedish names, and 1000 region names

CArDIS Dataset

If you use any of these datasets, please cite:

Reference:

BibTeX: