List of text corpora

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

English language

European languages

Middle Eastern Languages

East Asian Languages

Parallel corpora of diverse languages

Comparable Corpora

See also

References

  1. Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  2. "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  3. "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
  4. "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
  5. https://ucnk.ff.cuni.cz/english/index.php
  6. (Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
  7. "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
  8. "Available from CLARIN".
  9. 1 2 "University of Tehran NLP Lab". ece.ut.ac.ir. Retrieved 12 January 2014.
  10. "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
  11. "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
  12. "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
  13. "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
  14. "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 13 January 2014.
  15. Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174.
  16. Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  17. Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of The 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
This article is issued from Wikipedia - version of the 11/15/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.