Translation Memory from Doffin
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Translation Memory from Doffin
- resource Name: Omsetjingsminne frå Doffin
- description: This corpus contains data from Doffin, the Norwegian web-based database for notices of public procurement and procurement in the utility sector, managed by The Norwegian Agency for Public and Financial Management. The Language Bank received the data in the form of an XML database dump. The dump consisted of 41,143 document pairs (original and translation). 40,631 of these were translations from Norwegian to English. Only the latter are included in the corpus. Of the originally Norwegian documents, 39,893 were in Norwegian Bokmål and 736 in Norwegian Nynorsk. Original and translation were first aligned on document level using an internal document identifier, then the sentences were extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign. Duplicate translations (exact duplicates) were discarded. We recorded a total of 293,649 translation units (TUs) for Norwegian Bokmål to English, and 6,342 TUs for Norwegian Nynorsk to English. A TU is a translation pair with an original text and a parallelized translation, and usually corresponds to a more or less meaningful linguistic unit, typically a sentence, a heading etc. A TU may also consist of a single word or several clauses. The translation units for the two languages are distributed as two separate files, both in TMX 1.4 format (a variant of XML).
- description: Dette korpuset inneheld data frå Doffin, den nasjonale kunngjeringsbasen for offentlege anskaffingar, forvalta av Direktoratet for Forvaltning og Økonomistyring (DFØ). Språkbanken fekk dataa i from av ein dump av ein XML-database. Dumpen bestod av 41.143 dokumentpar (originalar og omsetjingar). 40.631 av desse var omsetjingar frå norsk til engelsk. Berre desse er inkluderte i korpuset. Av dei opphavleg norske dokumenta er 39.893 på bokmål og 736 på nynorsk. Original og omsetjing vart først parallelliserte på dokumentnivå ved hjelp av ein intern dokumentidentifikator, deretter vart setningane identifiserte med NLTK Punkt Sentence Tokenizer og parallelliserte ved å nytte Hunalign. Dupliserte omsetjingar (eksakte duplikat) vart kasserte. Totalt fann me 293.649 omsetjingseiningar (Translation Units – TU) for bokmål til engelsk, og 6.342 TUar for nynorsk til engelsk. Ein TU er eit omsetjingspar med ei originaltekst og ei parallellstilt omsetjing, og svarar vanlegvis til ei meir eller mindre meiningsberande språkleg eining, typisk ei setning, overskrift eller liknande. Ein TU kan òg bestå ev eit enkeltord eller fleire setningar. Omsetjingseiningane for bokmål og nynorsk vert distribuerte som to separate filer, begge i TMX 1.4-format (ein variant av XML).
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-63/
- P I D: hdl:21.11146/63
- identifier: sbr-63
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-63/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Norwegian Agency for Public and Financial Management
- organization Name: Direktoratet for Forvaltning og Økonomistyring
- organization Short Name: DFØ
- organization Short Name: DFØ
- department Name: Doffin
- department Name: Doffin
- distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: Norwegian Agency for Public and Financial Management
- organization Name: Direktoratet for Forvaltning og Økonomistyring
- organization Short Name: DFØ
- organization Short Name: DFØ
- department Name: Doffin
- department Name: Doffin
- corpus Info:
- corpus Type: Written Corpus
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: application/x-tmx+xml
- size Per Text Format:
- size Info:
- size: 299991
- size Unit: units
- size Info:
- size: 2
- size Unit: files
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: multilingual
- multilinguality Type: parallel
- multilinguality Type Details: translation memory
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- language Info:
- language Id: en
- language Name: English
- modality Info:
- modality Type: writtenLanguage
- size Info:
- size: 299991
- size Unit: units
- size Info:
- size: 2
- size Unit: files
- annotation Info:
- annotation Type: alignment
- segmentation Level: sentence
- annotation Mode: automatic
- annotator:
- actor Info:
- actor Type: person
- role: Resource Annotator
- person Info:
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
dc:type | corpus |
dc:title | Translation Memory from Doffin |
dc:identifier | oai:nb.no:sbr-63 |
dc:description | This corpus contains data from Doffin, the Norwegian web-based database for notices of public procurement and procurement in the utility sector, managed by The Norwegian Agency for Public and Financial Management. The Language Bank received the data in the form of an XML database dump. The dump consisted of 41,143 document pairs (original and translation). 40,631 of these were translations from Norwegian to English. Only the latter are included in the corpus. Of the originally Norwegian documents, 39,893 were in Norwegian Bokmål and 736 in Norwegian Nynorsk. Original and translation were first aligned on document level using an internal document identifier, then the sentences were extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign. Duplicate translations (exact duplicates) were discarded. We recorded a total of 293,649 translation units (TUs) for Norwegian Bokmål to English, and 6,342 TUs for Norwegian Nynorsk to English. A TU is a translation pair with an original text and a parallelized translation, and usually corresponds to a more or less meaningful linguistic unit, typically a sentence, a heading etc. A TU may also consist of a single word or several clauses. The translation units for the two languages are distributed as two separate files, both in TMX 1.4 format (a variant of XML). |
dc:publisher | |
dc:format | downloadable |
dc:date | |
dc:date | 2020-11-04 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | Norwegian Agency for Public and Financial Management |
dc:creator | Magnus Breder Birkenes |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
dc:lang | English |