This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.
Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.
This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.
Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.
Extended metadata
resource Common Info:
resource Type: corpus
identification Info:
resource Name: NST N-gram – Danish News Text
resource Name: NST N-gram – dansk nyhendetekst
description: This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.
Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.
description: Dette korpuset inneheld n-gram på dansk, henta frå eit korpus på 290 millionar ord med nyhendetekst på dansk frå avisene Berlingske Tidende, Ekstrabladet og Politiken. Avisene er frå tidsperioden 1995-1999. Korpuset vart opprinneleg utvikla av Nordisk Språkteknologi (NST) i perioden 1997-2003. N-gramma vart laga av Uni Research for Nasjonalbiblioteket og Språkbanken.
Sekvensar av eitt til seks ord er genererte (unigram, bigram, trigram, 4-gram, 5-gram og 6-gram), og deretter sorterte alfabetisk og etter frekvens. Det er òg laga ein forenkla versjon for nedlasting med dei 1000 mest frekvente n-gramma av alle typar nemnde ovanfor.
This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.
Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.