Synthetic text images for North, South, Lule and Inare Sámi
Extended metadata
- resource Common Info
- resource Type: toolService
- identification Info
- resource Name: Syntetiske tekstbilder for nord-, sør-, lule- og inaresamisk
- resource Name: Synthetic text images for North, South, Lule and Inare Sámi
- description: Dette datasettet inneholder syntetiske linjebilder som kan brukes til å finjustere OCR-modeller for nord-, sør-, lule- og inaresamisk. Fremgangsmåten for å lage disse bildene er å lage 'rene' linjebilder og tilføre støy ved hjelp av Augraphy. Teksten i datasettet kommer fra Giellatekno sitt korpus. Datasettet er tilfeldig delt opp slik at 71% av filene (307387 linjer) er i treningsdelen, 9% av filene (40765 linjer) er i valideringsdelen og 20% av filene er i (84534 linjer) testdelen. Hver del har en unik mengde skrifttyper og tekst- og bakgrunnsfarger. Se dokumentasjonsfilen for mer informasjon.
- description: This dataset contains synthetic line images meant for fitting OCR models for North, South, Lule and Inari Sámi. Clean line images are created using Pillow and they are subsequently distorted using Augraphy. The text in this dataset comes from Giellatekno's corpus. The dataset is split randomly by file so 71 % of the files (307387 lines) are in the training split, 9 % of the files (40765 lines) are in the validation split and 20 % of the files (84534 lines) are in the test split. Each split has a unique set of typefaces and text/background colors. | See the documentation file for more information.
- url:
- P I D: hdl:21.11146/101
- identifier: sbr-101
- distribution Info
- licence Info
- user Category: Public
- distribution Access Medium: downloadable
- download Location:
- attribution Text: Please cite 1. Enstad T, Trosterud T, Røsok MI, Beyer Y, Roald M. 'Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway.' Accepted for publication in Proceedings of the 25th Nordic Conference on Computational Linguistics (NoDaLiDa) 2025, 2. SIKOR UiT The Arctic University of Norway and the Norwegian Saami Parliament's Saami text collection,, Version 01.12.2021 [Data set]. (Also note that the SIKOR dataset to get Sámi text for the images is CC-BY 3.0 licensed.)
- licence
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-BY (CC-BY)
- licence Url:
- conditions Of Use: BY
- licensor:
- actor Info
- actor Type: organization
- role: Licensor
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email:
- url:
- licence Info
- contact
- actor Info
- actor Type: organization
- role: Contact
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email:
- url:
- actor Info
- metadata Info
- metadata Creation Date: 28.01.2025
- metadata Language Name: Norwegian Bokmål
- metadata Language Name: English
- metadata Language Id: nb
- metadata Language Id: en
- metadata Last Date Updated: 28.01.2025
- metadata Creator
- actor Info
- actor Type: organization
- role: Metadata Creator
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email:
- url:
- actor Info
- resource Creation Info
- creation Start Date: 01.10.2024
- creation End Date: 28.01.2025
- resource Creator
- actor Info
- actor Type: organization
- role: Resource Creator
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email:
- url:
- actor Info
- tool Info
- description: Synthetic text images for Sámi Languages
- input Info
- media Type: image
- output Info
- media Type: text
- Service
- Name: Synthetic images for Sámi Languages
- Service Description Location:
- Location:
- Operations:
- Operation
- Name: OCR
- Output:
- Parameter Group
- Parameters:
- Parameter
dc:type | toolService |
dc:title | Synthetic text images for North, South, Lule and Inare Sámi |
dc:identifier | |
dc:description | This dataset contains synthetic line images meant for fitting OCR models for North, South, Lule and Inari Sámi. Clean line images are created using Pillow and they are subsequently distorted using Augraphy. The text in this dataset comes from Giellatekno's corpus. The dataset is split randomly by file so 71 % of the files (307387 lines) are in the training split, 9 % of the files (40765 lines) are in the validation split and 20 % of the files (84534 lines) are in the test split. Each split has a unique set of typefaces and text/background colors. | See the documentation file for more information. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2024-10-01 |
dc:date | 2025-01-28 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-BY (CC-BY) |
dc:rights | |
dc:creator | National Library of Norway |
dc:lang |