Skip to content

Available datasets

# Datasets

The framework provides 2241 datasets from 62 sources in 164 languages. The languages are as follows: Afrikaans, Amharic, Aragonese, Arabic, Arz, Assamese, Ast, Avaric, Azerbaijani, Azb, Bashkir, Belarusian, Bulgarian, Bihari, Bengali, Tibetan, Bpy, Breton, Bosnian, Bxr, Catalan, Chechen, Ceb, Ckb, Code, Czech, Chuvash, Welsh, Danish, German, Dsb, Dhivehi, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Gaelic, Galician, Guaraní, Gom, Gsw, Gujarati, Hausa, Hebrew, Hindi, Croatian, Hsb, Haitian, Hungarian, Armenian, Interlingua, Indonesian, Interlingue, Igbo, Ilo, Ido, Icelandic, Italian, Japanese, Jbo, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Krc, Kurdish, Komi, Cornish, Kirghiz, Latin, Luxembourgish, Lez, Limburgish, Lmo, Lao, Lithuanian, Latvian, Mai, Malagasy, Mhr, Min, Macedonian, Malayalam, Mongolian, Marathi, Mrj, Malay, Maltese, Multi, Mwl, Burmese, Mzn, Nah, Nds, Nepali, New, Dutch, Norwegian Nynorsk, Norwegian, Chichewa, Occitan, Oromo, Oriya, Ossetian, Panjabi, Polish, Pms, Pnb, Pashto, Portuguese, Quechua, Romanian, Russian, Kinyarwanda, Sanskrit, Sah, Sindhi, Serbo-Croatian, Sinhalese, Slovak, Slovene, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tigrinya, Turkmen, Tagalog, Turkish, Tatar, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Walloon, War, Wuu, X-Eml, Xal, Xhosa, Xmf, Yiddish, Yoruba, Chinese, Zu

Languages

Tokens by language

language reported_tokens
af N/A
am N/A
an N/A
ar N/A
arz N/A
as N/A
ast N/A
av N/A
az N/A
azb N/A
ba N/A
be N/A
bg 13 B
bh N/A
bn N/A
bo N/A
bpy N/A
br N/A
bs N/A
bxr N/A
ca 4 B
ce N/A
ceb N/A
ckb N/A
code 250 B
cs 21 B
cv N/A
cy N/A
da 11 B
de 26 B
dsb N/A
dv N/A
el 24 B
en 117 B
eo N/A
es 20 B
et 5 B
eu 982 M
fa N/A
fi 9 B
fr 60 B
fy N/A
ga 669 M
gd N/A
gl 36 M
gn N/A
gom N/A
gsw N/A
gu N/A
ha N/A
he N/A
hi N/A
hr 8 B
hsb N/A
ht N/A
hu 12 B
hy N/A
ia N/A
id N/A
ie N/A
ig N/A
ilo N/A
io N/A
is N/A
it 14 B
ja N/A
jbo N/A
jv N/A
ka N/A
kk N/A
km N/A
kn N/A
ko N/A
krc N/A
ku N/A
kv N/A
kw N/A
ky N/A
la N/A
lb N/A
lez N/A
li N/A
lmo N/A
lo N/A
lt 5 B
lv 4 B
mai N/A
mg N/A
mhr N/A
min N/A
mk N/A
ml N/A
mn N/A
mr N/A
mrj N/A
ms N/A
mt 4 B
multi N/A
mwl N/A
my N/A
mzn N/A
nah N/A
nds N/A
ne N/A
new N/A
nl 26 B
nn 301 M
no 5 B
ny N/A
oc N/A
om N/A
or N/A
os N/A
pa N/A
pl 25 B
pms N/A
pnb N/A
ps N/A
pt 24 B
qu N/A
ro 9 B
ru N/A
rw N/A
sa N/A
sah N/A
sd N/A
sh 58 k
si N/A
sk 18 B
sl 9 B
sn N/A
so N/A
sq N/A
sr 3 B
st N/A
su N/A
sv 13 B
sw N/A
ta N/A
te N/A
tg N/A
th N/A
ti N/A
tk N/A
tl N/A
tr N/A
tt N/A
ug N/A
uk 11 B
ur N/A
uz N/A
vi N/A
vo N/A
wa N/A
war N/A
wuu N/A
x-eml N/A
xal N/A
xh N/A
xmf N/A
yi N/A
yo N/A
zh N/A
zu N/A

Data sources

Tokens by source

source_id reported_tokens
curlicat 410 M
macocu 23 B
redpajama 46 B
wura N/A
wikihow 2 M
pes2o 42 B
proof_pile 8 B
pile_of_law N/A
math_amps N/A
edgarcorpus 7 B
bulgarian_news 283 M
bulnc 567 M
openlegaldata 10 B
dewac 2 B
ga_bilingual_legistation 4 M
ga_universal_dependencies 3 M
hrwac 1 B
styria_news 409 M
croatian_news_engri 695 M
itwac 2 B
korpus_malti 366 M
sonar 500 M
cc_gigafida 127 M
academic_slovene_kas 1 B
slwac_web 1 B
sk_court_decisions 11 B
sk_laws 45 M
syn_v9 5 B
cs_en_parallel N/A
danish_gigaword 1 B
danewsroom 472 M
dk_clarin 441 M
cabernet 712 M
norwegian_cc 5 B
pl_nkjp 1 M
pl_parliamentary_corpus 671 M
parlamento_pt 819 M
brwac 3 B
seimas_lt_en 48 k
state_related_latvian_web 1 M
greek_legal_code 45 M
greek_web_corpus 3 B
estonian_reference_corpus 175 M
enc2021 N/A
ekspress N/A
euscrawl 846 M
spanish_legal 3 B
ylenews N/A
sv_gigaword 1 B
srpkor N/A
marcell_legislative_subcorpus_v2 31 M
uk_laws 579 M
eurlex 121 B
legal_mc4 29 B
wiki 12 B
wikibooks 353 M
wikiquote 268 M
wikinews 79 M
wikisource 2 B
wikivoyage 132 M
colossal_oscar 154 B
starcoder 250 B

This page is automatically generated.