Available datasets
# Datasets
The framework provides 2241 datasets from 62 sources in 164 languages. The languages are as follows: Afrikaans, Amharic, Aragonese, Arabic, Arz, Assamese, Ast, Avaric, Azerbaijani, Azb, Bashkir, Belarusian, Bulgarian, Bihari, Bengali, Tibetan, Bpy, Breton, Bosnian, Bxr, Catalan, Chechen, Ceb, Ckb, Code, Czech, Chuvash, Welsh, Danish, German, Dsb, Dhivehi, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Gaelic, Galician, Guaraní, Gom, Gsw, Gujarati, Hausa, Hebrew, Hindi, Croatian, Hsb, Haitian, Hungarian, Armenian, Interlingua, Indonesian, Interlingue, Igbo, Ilo, Ido, Icelandic, Italian, Japanese, Jbo, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Krc, Kurdish, Komi, Cornish, Kirghiz, Latin, Luxembourgish, Lez, Limburgish, Lmo, Lao, Lithuanian, Latvian, Mai, Malagasy, Mhr, Min, Macedonian, Malayalam, Mongolian, Marathi, Mrj, Malay, Maltese, Multi, Mwl, Burmese, Mzn, Nah, Nds, Nepali, New, Dutch, Norwegian Nynorsk, Norwegian, Chichewa, Occitan, Oromo, Oriya, Ossetian, Panjabi, Polish, Pms, Pnb, Pashto, Portuguese, Quechua, Romanian, Russian, Kinyarwanda, Sanskrit, Sah, Sindhi, Serbo-Croatian, Sinhalese, Slovak, Slovene, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tigrinya, Turkmen, Tagalog, Turkish, Tatar, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Walloon, War, Wuu, X-Eml, Xal, Xhosa, Xmf, Yiddish, Yoruba, Chinese, Zu
Languages

| language | reported_tokens | 
|---|---|
| af | N/A | 
| am | N/A | 
| an | N/A | 
| ar | N/A | 
| arz | N/A | 
| as | N/A | 
| ast | N/A | 
| av | N/A | 
| az | N/A | 
| azb | N/A | 
| ba | N/A | 
| be | N/A | 
| bg | 13 B | 
| bh | N/A | 
| bn | N/A | 
| bo | N/A | 
| bpy | N/A | 
| br | N/A | 
| bs | N/A | 
| bxr | N/A | 
| ca | 4 B | 
| ce | N/A | 
| ceb | N/A | 
| ckb | N/A | 
| code | 250 B | 
| cs | 21 B | 
| cv | N/A | 
| cy | N/A | 
| da | 11 B | 
| de | 26 B | 
| dsb | N/A | 
| dv | N/A | 
| el | 24 B | 
| en | 117 B | 
| eo | N/A | 
| es | 20 B | 
| et | 5 B | 
| eu | 982 M | 
| fa | N/A | 
| fi | 9 B | 
| fr | 60 B | 
| fy | N/A | 
| ga | 669 M | 
| gd | N/A | 
| gl | 36 M | 
| gn | N/A | 
| gom | N/A | 
| gsw | N/A | 
| gu | N/A | 
| ha | N/A | 
| he | N/A | 
| hi | N/A | 
| hr | 8 B | 
| hsb | N/A | 
| ht | N/A | 
| hu | 12 B | 
| hy | N/A | 
| ia | N/A | 
| id | N/A | 
| ie | N/A | 
| ig | N/A | 
| ilo | N/A | 
| io | N/A | 
| is | N/A | 
| it | 14 B | 
| ja | N/A | 
| jbo | N/A | 
| jv | N/A | 
| ka | N/A | 
| kk | N/A | 
| km | N/A | 
| kn | N/A | 
| ko | N/A | 
| krc | N/A | 
| ku | N/A | 
| kv | N/A | 
| kw | N/A | 
| ky | N/A | 
| la | N/A | 
| lb | N/A | 
| lez | N/A | 
| li | N/A | 
| lmo | N/A | 
| lo | N/A | 
| lt | 5 B | 
| lv | 4 B | 
| mai | N/A | 
| mg | N/A | 
| mhr | N/A | 
| min | N/A | 
| mk | N/A | 
| ml | N/A | 
| mn | N/A | 
| mr | N/A | 
| mrj | N/A | 
| ms | N/A | 
| mt | 4 B | 
| multi | N/A | 
| mwl | N/A | 
| my | N/A | 
| mzn | N/A | 
| nah | N/A | 
| nds | N/A | 
| ne | N/A | 
| new | N/A | 
| nl | 26 B | 
| nn | 301 M | 
| no | 5 B | 
| ny | N/A | 
| oc | N/A | 
| om | N/A | 
| or | N/A | 
| os | N/A | 
| pa | N/A | 
| pl | 25 B | 
| pms | N/A | 
| pnb | N/A | 
| ps | N/A | 
| pt | 24 B | 
| qu | N/A | 
| ro | 9 B | 
| ru | N/A | 
| rw | N/A | 
| sa | N/A | 
| sah | N/A | 
| sd | N/A | 
| sh | 58 k | 
| si | N/A | 
| sk | 18 B | 
| sl | 9 B | 
| sn | N/A | 
| so | N/A | 
| sq | N/A | 
| sr | 3 B | 
| st | N/A | 
| su | N/A | 
| sv | 13 B | 
| sw | N/A | 
| ta | N/A | 
| te | N/A | 
| tg | N/A | 
| th | N/A | 
| ti | N/A | 
| tk | N/A | 
| tl | N/A | 
| tr | N/A | 
| tt | N/A | 
| ug | N/A | 
| uk | 11 B | 
| ur | N/A | 
| uz | N/A | 
| vi | N/A | 
| vo | N/A | 
| wa | N/A | 
| war | N/A | 
| wuu | N/A | 
| x-eml | N/A | 
| xal | N/A | 
| xh | N/A | 
| xmf | N/A | 
| yi | N/A | 
| yo | N/A | 
| zh | N/A | 
| zu | N/A | 
Data sources

| source_id | reported_tokens | 
|---|---|
| curlicat | 410 M | 
| macocu | 23 B | 
| redpajama | 46 B | 
| wura | N/A | 
| wikihow | 2 M | 
| pes2o | 42 B | 
| proof_pile | 8 B | 
| pile_of_law | N/A | 
| math_amps | N/A | 
| edgarcorpus | 7 B | 
| bulgarian_news | 283 M | 
| bulnc | 567 M | 
| openlegaldata | 10 B | 
| dewac | 2 B | 
| ga_bilingual_legistation | 4 M | 
| ga_universal_dependencies | 3 M | 
| hrwac | 1 B | 
| styria_news | 409 M | 
| croatian_news_engri | 695 M | 
| itwac | 2 B | 
| korpus_malti | 366 M | 
| sonar | 500 M | 
| cc_gigafida | 127 M | 
| academic_slovene_kas | 1 B | 
| slwac_web | 1 B | 
| sk_court_decisions | 11 B | 
| sk_laws | 45 M | 
| syn_v9 | 5 B | 
| cs_en_parallel | N/A | 
| danish_gigaword | 1 B | 
| danewsroom | 472 M | 
| dk_clarin | 441 M | 
| cabernet | 712 M | 
| norwegian_cc | 5 B | 
| pl_nkjp | 1 M | 
| pl_parliamentary_corpus | 671 M | 
| parlamento_pt | 819 M | 
| brwac | 3 B | 
| seimas_lt_en | 48 k | 
| state_related_latvian_web | 1 M | 
| greek_legal_code | 45 M | 
| greek_web_corpus | 3 B | 
| estonian_reference_corpus | 175 M | 
| enc2021 | N/A | 
| ekspress | N/A | 
| euscrawl | 846 M | 
| spanish_legal | 3 B | 
| ylenews | N/A | 
| sv_gigaword | 1 B | 
| srpkor | N/A | 
| marcell_legislative_subcorpus_v2 | 31 M | 
| uk_laws | 579 M | 
| eurlex | 121 B | 
| legal_mc4 | 29 B | 
| wiki | 12 B | 
| wikibooks | 353 M | 
| wikiquote | 268 M | 
| wikinews | 79 M | 
| wikisource | 2 B | 
| wikivoyage | 132 M | 
| colossal_oscar | 154 B | 
| starcoder | 250 B | 
This page is automatically generated.