Synthetic text images for North, South, Lule and Inare Sámi

This dataset contains synthetic line images meant for fitting OCR models for North, South, Lule and Inari Sámi. Clean line images are created using Pillow and they are subsequently distorted using Augraphy.

The text in this dataset comes from Giellatekno’s corpus.

The dataset is split randomly by file so 71 % of the files (307387 lines) are in the training split, 9 % of the files (40765 lines) are in the validation split and 20 % of the files (84534 lines) are in the test split. Each split has a unique set of typefaces and text/background colors.
|
See the documentation file for more information.

The text in this dataset comes from Giellatekno’s corpus.

Download resources

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-101&metadataPrefix=cmdi

dc:type	toolService
dc:title	Synthetic text images for North, South, Lule and Inare Sámi
dc:identifier	oai:nb.no:sbr-101
dc:description	This dataset contains synthetic line images meant for fitting OCR models for North, South, Lule and Inari Sámi. Clean line images are created using Pillow and they are subsequently distorted using Augraphy. The text in this dataset comes from Giellatekno's corpus. The dataset is split randomly by file so 71 % of the files (307387 lines) are in the training split, 9 % of the files (40765 lines) are in the validation split and 20 % of the files (84534 lines) are in the test split. Each split has a unique set of typefaces and text/background colors. \| See the documentation file for more information.
dc:publisher
dc:format	downloadable
dc:date	2024-10-01
dc:date	2025-01-28
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY (CC-BY)
dc:rights	https://creativecommons.org/licenses/by/3.0/
dc:creator	National Library of Norway
dc:lang

Synthetic text images for North, South, Lule and Inare Sámi

Download resources

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)