Available datasets
# Datasets
The framework provides 2241 datasets from 62 sources in 164 languages. The languages are as follows: Afrikaans, Amharic, Aragonese, Arabic, Arz, Assamese, Ast, Avaric, Azerbaijani, Azb, Bashkir, Belarusian, Bulgarian, Bihari, Bengali, Tibetan, Bpy, Breton, Bosnian, Bxr, Catalan, Chechen, Ceb, Ckb, Code, Czech, Chuvash, Welsh, Danish, German, Dsb, Dhivehi, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Gaelic, Galician, Guaraní, Gom, Gsw, Gujarati, Hausa, Hebrew, Hindi, Croatian, Hsb, Haitian, Hungarian, Armenian, Interlingua, Indonesian, Interlingue, Igbo, Ilo, Ido, Icelandic, Italian, Japanese, Jbo, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Krc, Kurdish, Komi, Cornish, Kirghiz, Latin, Luxembourgish, Lez, Limburgish, Lmo, Lao, Lithuanian, Latvian, Mai, Malagasy, Mhr, Min, Macedonian, Malayalam, Mongolian, Marathi, Mrj, Malay, Maltese, Multi, Mwl, Burmese, Mzn, Nah, Nds, Nepali, New, Dutch, Norwegian Nynorsk, Norwegian, Chichewa, Occitan, Oromo, Oriya, Ossetian, Panjabi, Polish, Pms, Pnb, Pashto, Portuguese, Quechua, Romanian, Russian, Kinyarwanda, Sanskrit, Sah, Sindhi, Serbo-Croatian, Sinhalese, Slovak, Slovene, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tigrinya, Turkmen, Tagalog, Turkish, Tatar, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Walloon, War, Wuu, X-Eml, Xal, Xhosa, Xmf, Yiddish, Yoruba, Chinese, Zu
Languages

| language | reported_tokens |
|---|---|
| af | N/A |
| am | N/A |
| an | N/A |
| ar | N/A |
| arz | N/A |
| as | N/A |
| ast | N/A |
| av | N/A |
| az | N/A |
| azb | N/A |
| ba | N/A |
| be | N/A |
| bg | 13 B |
| bh | N/A |
| bn | N/A |
| bo | N/A |
| bpy | N/A |
| br | N/A |
| bs | N/A |
| bxr | N/A |
| ca | 4 B |
| ce | N/A |
| ceb | N/A |
| ckb | N/A |
| code | 250 B |
| cs | 21 B |
| cv | N/A |
| cy | N/A |
| da | 11 B |
| de | 26 B |
| dsb | N/A |
| dv | N/A |
| el | 24 B |
| en | 117 B |
| eo | N/A |
| es | 20 B |
| et | 5 B |
| eu | 982 M |
| fa | N/A |
| fi | 9 B |
| fr | 60 B |
| fy | N/A |
| ga | 669 M |
| gd | N/A |
| gl | 36 M |
| gn | N/A |
| gom | N/A |
| gsw | N/A |
| gu | N/A |
| ha | N/A |
| he | N/A |
| hi | N/A |
| hr | 8 B |
| hsb | N/A |
| ht | N/A |
| hu | 12 B |
| hy | N/A |
| ia | N/A |
| id | N/A |
| ie | N/A |
| ig | N/A |
| ilo | N/A |
| io | N/A |
| is | N/A |
| it | 14 B |
| ja | N/A |
| jbo | N/A |
| jv | N/A |
| ka | N/A |
| kk | N/A |
| km | N/A |
| kn | N/A |
| ko | N/A |
| krc | N/A |
| ku | N/A |
| kv | N/A |
| kw | N/A |
| ky | N/A |
| la | N/A |
| lb | N/A |
| lez | N/A |
| li | N/A |
| lmo | N/A |
| lo | N/A |
| lt | 5 B |
| lv | 4 B |
| mai | N/A |
| mg | N/A |
| mhr | N/A |
| min | N/A |
| mk | N/A |
| ml | N/A |
| mn | N/A |
| mr | N/A |
| mrj | N/A |
| ms | N/A |
| mt | 4 B |
| multi | N/A |
| mwl | N/A |
| my | N/A |
| mzn | N/A |
| nah | N/A |
| nds | N/A |
| ne | N/A |
| new | N/A |
| nl | 26 B |
| nn | 301 M |
| no | 5 B |
| ny | N/A |
| oc | N/A |
| om | N/A |
| or | N/A |
| os | N/A |
| pa | N/A |
| pl | 25 B |
| pms | N/A |
| pnb | N/A |
| ps | N/A |
| pt | 24 B |
| qu | N/A |
| ro | 9 B |
| ru | N/A |
| rw | N/A |
| sa | N/A |
| sah | N/A |
| sd | N/A |
| sh | 58 k |
| si | N/A |
| sk | 18 B |
| sl | 9 B |
| sn | N/A |
| so | N/A |
| sq | N/A |
| sr | 3 B |
| st | N/A |
| su | N/A |
| sv | 13 B |
| sw | N/A |
| ta | N/A |
| te | N/A |
| tg | N/A |
| th | N/A |
| ti | N/A |
| tk | N/A |
| tl | N/A |
| tr | N/A |
| tt | N/A |
| ug | N/A |
| uk | 11 B |
| ur | N/A |
| uz | N/A |
| vi | N/A |
| vo | N/A |
| wa | N/A |
| war | N/A |
| wuu | N/A |
| x-eml | N/A |
| xal | N/A |
| xh | N/A |
| xmf | N/A |
| yi | N/A |
| yo | N/A |
| zh | N/A |
| zu | N/A |
Data sources

| source_id | reported_tokens |
|---|---|
| curlicat | 410 M |
| macocu | 23 B |
| redpajama | 46 B |
| wura | N/A |
| wikihow | 2 M |
| pes2o | 42 B |
| proof_pile | 8 B |
| pile_of_law | N/A |
| math_amps | N/A |
| edgarcorpus | 7 B |
| bulgarian_news | 283 M |
| bulnc | 567 M |
| openlegaldata | 10 B |
| dewac | 2 B |
| ga_bilingual_legistation | 4 M |
| ga_universal_dependencies | 3 M |
| hrwac | 1 B |
| styria_news | 409 M |
| croatian_news_engri | 695 M |
| itwac | 2 B |
| korpus_malti | 366 M |
| sonar | 500 M |
| cc_gigafida | 127 M |
| academic_slovene_kas | 1 B |
| slwac_web | 1 B |
| sk_court_decisions | 11 B |
| sk_laws | 45 M |
| syn_v9 | 5 B |
| cs_en_parallel | N/A |
| danish_gigaword | 1 B |
| danewsroom | 472 M |
| dk_clarin | 441 M |
| cabernet | 712 M |
| norwegian_cc | 5 B |
| pl_nkjp | 1 M |
| pl_parliamentary_corpus | 671 M |
| parlamento_pt | 819 M |
| brwac | 3 B |
| seimas_lt_en | 48 k |
| state_related_latvian_web | 1 M |
| greek_legal_code | 45 M |
| greek_web_corpus | 3 B |
| estonian_reference_corpus | 175 M |
| enc2021 | N/A |
| ekspress | N/A |
| euscrawl | 846 M |
| spanish_legal | 3 B |
| ylenews | N/A |
| sv_gigaword | 1 B |
| srpkor | N/A |
| marcell_legislative_subcorpus_v2 | 31 M |
| uk_laws | 579 M |
| eurlex | 121 B |
| legal_mc4 | 29 B |
| wiki | 12 B |
| wikibooks | 353 M |
| wikiquote | 268 M |
| wikinews | 79 M |
| wikisource | 2 B |
| wikivoyage | 132 M |
| colossal_oscar | 154 B |
| starcoder | 250 B |
This page is automatically generated.