The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.

Dataset Card for "oscar"

Dataset Summary

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.

The version here is the original OSCAR 2019 release: https://oscar-project.org/post/oscar-2019/

For more recent versions, visit the oscar-corpus organization on the Hub:

Supported Tasks and Leaderboards

OSCAR is mainly inteded to pretrain language models and word represantations.

Languages

All the data is distributed by language, both the original and the deduplicated versions of the data are available. 166 different languages are available. The table in subsection Data Splits Sample Size provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR.

Dataset Structure

We show detailed information for all the configurations of the dataset.

Data Instances

Click to expand the Data/size information for each language (deduplicated)

unshuffled_deduplicated_af

  • Size of downloaded dataset files: 65.99 MB
  • Size of the generated dataset: 172.30 MB
  • Total amount of disk used: 238.29 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}

unshuffled_deduplicated_als

  • Size of downloaded dataset files: 1.26 MB
  • Size of the generated dataset: 2.96 MB
  • Total amount of disk used: 4.22 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}

unshuffled_deduplicated_am

  • Size of downloaded dataset files: 61.35 MB
  • Size of the generated dataset: 216.15 MB
  • Total amount of disk used: 277.50 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}

unshuffled_deduplicated_an

  • Size of downloaded dataset files: 0.14 MB
  • Size of the generated dataset: 0.85 MB
  • Total amount of disk used: 0.99 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}

unshuffled_deduplicated_ar

  • Size of downloaded dataset files: 9.67 GB
  • Size of the generated dataset: 33.57 GB
  • Total amount of disk used: 43.23 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}

unshuffled_deduplicated_arz

  • Size of downloaded dataset files: 10.02 MB
  • Size of the generated dataset: 35.91 MB
  • Total amount of disk used: 45.94 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}

unshuffled_deduplicated_as

  • Size of downloaded dataset files: 15.51 MB
  • Size of the generated dataset: 74.07 MB
  • Total amount of disk used: 89.58 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}

unshuffled_deduplicated_ast

  • Size of downloaded dataset files: 0.86 MB
  • Size of the generated dataset: 2.17 MB
  • Total amount of disk used: 3.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}

unshuffled_deduplicated_av

  • Size of downloaded dataset files: 0.07 MB
  • Size of the generated dataset: 0.34 MB
  • Total amount of disk used: 0.41 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}

unshuffled_deduplicated_az

  • Size of downloaded dataset files: 521.74 MB
  • Size of the generated dataset: 1.53 GB
  • Total amount of disk used: 2.05 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}

unshuffled_deduplicated_azb

  • Size of downloaded dataset files: 5.19 MB
  • Size of the generated dataset: 20.08 MB
  • Total amount of disk used: 25.27 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}

unshuffled_deduplicated_ba

  • Size of downloaded dataset files: 25.98 MB
  • Size of the generated dataset: 93.84 MB
  • Total amount of disk used: 119.82 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}

unshuffled_deduplicated_bar

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "                                                                                                                                          vo"
}

unshuffled_deduplicated_bcl

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}

unshuffled_deduplicated_be

  • Size of downloaded dataset files: 306.70 MB
  • Size of the generated dataset: 1.08 GB
  • Total amount of disk used: 1.39 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}

unshuffled_deduplicated_bg

  • Size of downloaded dataset files: 3.85 GB
  • Size of the generated dataset: 14.45 GB
  • Total amount of disk used: 18.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}

unshuffled_deduplicated_bh

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.04 MB
  • Total amount of disk used: 0.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}

unshuffled_deduplicated_bn

  • Size of downloaded dataset files: 1.26 GB
  • Size of the generated dataset: 6.24 GB
  • Total amount of disk used: 7.50 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nTagged with ডায়োজিনি..."
}

unshuffled_deduplicated_bo

  • Size of downloaded dataset files: 22.37 MB
  • Size of the generated dataset: 144.65 MB
  • Total amount of disk used: 167.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}

unshuffled_deduplicated_bpy

  • Size of downloaded dataset files: 0.19 MB
  • Size of the generated dataset: 1.78 MB
  • Total amount of disk used: 1.97 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}

unshuffled_deduplicated_br

  • Size of downloaded dataset files: 6.47 MB
  • Size of the generated dataset: 17.00 MB
  • Total amount of disk used: 23.47 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}

unshuffled_deduplicated_bs

  • Size of downloaded dataset files: 0.04 MB
  • Size of the generated dataset: 0.15 MB
  • Total amount of disk used: 0.18 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}

unshuffled_deduplicated_bxr

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}

unshuffled_deduplicated_ca

  • Size of downloaded dataset files: 1.73 GB
  • Size of the generated dataset: 4.57 GB
  • Total amount of disk used: 6.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}

unshuffled_deduplicated_cbk

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}

unshuffled_deduplicated_ce

  • Size of downloaded dataset files: 1.87 MB
  • Size of the generated dataset: 7.04 MB
  • Total amount of disk used: 8.90 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}

unshuffled_deduplicated_ceb

  • Size of downloaded dataset files: 7.12 MB
  • Size of the generated dataset: 24.83 MB
  • Total amount of disk used: 31.95 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}

unshuffled_deduplicated_ckb

  • Size of downloaded dataset files: 60.32 MB
  • Size of the generated dataset: 237.72 MB
  • Total amount of disk used: 298.05 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}

unshuffled_deduplicated_cs

  • Size of downloaded dataset files: 10.49 GB
  • Size of the generated dataset: 25.71 GB
  • Total amount of disk used: 36.20 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}

unshuffled_deduplicated_cv

  • Size of downloaded dataset files: 7.47 MB
  • Size of the generated dataset: 27.49 MB
  • Total amount of disk used: 34.95 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}

unshuffled_deduplicated_cy

  • Size of downloaded dataset files: 53.63 MB
  • Size of the generated dataset: 141.22 MB
  • Total amount of disk used: 194.86 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}

unshuffled_deduplicated_da

  • Size of downloaded dataset files: 3.82 GB
  • Size of the generated dataset: 10.24 GB
  • Total amount of disk used: 14.06 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}

unshuffled_deduplicated_de

  • Size of downloaded dataset files: 60.80 GB
  • Size of the generated dataset: 156.30 GB
  • Total amount of disk used: 217.10 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}

unshuffled_deduplicated_diq

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}

unshuffled_deduplicated_dsb

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}

unshuffled_deduplicated_dv

  • Size of downloaded dataset files: 16.84 MB
  • Size of the generated dataset: 82.19 MB
  • Total amount of disk used: 99.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}

unshuffled_deduplicated_el

  • Size of downloaded dataset files: 7.91 GB
  • Size of the generated dataset: 28.74 GB
  • Total amount of disk used: 36.65 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}

unshuffled_deduplicated_eml

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}

unshuffled_deduplicated_en

  • Size of downloaded dataset files: 496.50 GB
  • Size of the generated dataset: 1299.75 GB
  • Total amount of disk used: 1796.24 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}

unshuffled_deduplicated_eo

  • Size of downloaded dataset files: 92.86 MB
  • Size of the generated dataset: 240.12 MB
  • Total amount of disk used: 332.99 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}

unshuffled_deduplicated_es

  • Size of downloaded dataset files: 60.46 GB
  • Size of the generated dataset: 160.86 GB
  • Total amount of disk used: 221.32 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}

unshuffled_deduplicated_et

  • Size of downloaded dataset files: 966.79 MB
  • Size of the generated dataset: 2.45 GB
  • Total amount of disk used: 3.41 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}

unshuffled_deduplicated_eu

  • Size of downloaded dataset files: 134.68 MB
  • Size of the generated dataset: 363.93 MB
  • Total amount of disk used: 498.61 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}

unshuffled_deduplicated_fa

  • Size of downloaded dataset files: 10.46 GB
  • Size of the generated dataset: 40.06 GB
  • Total amount of disk used: 50.52 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}

unshuffled_deduplicated_fi

  • Size of downloaded dataset files: 5.38 GB
  • Size of the generated dataset: 13.99 GB
  • Total amount of disk used: 19.37 GB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}

unshuffled_deduplicated_fr

  • Size of downloaded dataset files: 55.46 GB
  • Size of the generated dataset: 148.28 GB
  • Total amount of disk used: 203.75 GB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}

unshuffled_deduplicated_frr

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}

unshuffled_deduplicated_fy

  • Size of downloaded dataset files: 10.27 MB
  • Size of the generated dataset: 26.73 MB
  • Total amount of disk used: 37.00 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}

unshuffled_deduplicated_ga

  • Size of downloaded dataset files: 22.22 MB
  • Size of the generated dataset: 63.86 MB
  • Total amount of disk used: 86.08 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}

unshuffled_deduplicated_gd

  • Size of downloaded dataset files: 0.42 MB
  • Size of the generated dataset: 1.36 MB
  • Total amount of disk used: 1.78 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}

unshuffled_deduplicated_gl

  • Size of downloaded dataset files: 155.85 MB
  • Size of the generated dataset: 408.34 MB
  • Total amount of disk used: 564.19 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}

unshuffled_deduplicated_gn

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"º ѐÆÚÓ À Ã Ð É Æ ¾ Ä ΠÀ ¼ Æ É ÄÛ = Ü Ý\\\"Þ ß†à á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}

unshuffled_deduplicated_gom

  • Size of downloaded dataset files: 0.38 MB
  • Size of the generated dataset: 1.87 MB
  • Total amount of disk used: 2.24 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}

unshuffled_deduplicated_gu

  • Size of downloaded dataset files: 162.97 MB
  • Size of the generated dataset: 759.34 MB
  • Total amount of disk used: 922.32 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}

unshuffled_deduplicated_he

  • Size of downloaded dataset files: 3.04 GB
  • Size of the generated dataset: 10.47 GB
  • Total amount of disk used: 13.51 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}

unshuffled_deduplicated_hi

  • Size of downloaded dataset files: 2.01 GB
  • Size of the generated dataset: 9.57 GB
  • Total amount of disk used: 11.58 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्‍सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}

unshuffled_deduplicated_hr

  • Size of downloaded dataset files: 46.74 MB
  • Size of the generated dataset: 121.50 MB
  • Total amount of disk used: 168.23 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}

unshuffled_deduplicated_hsb

  • Size of downloaded dataset files: 0.72 MB
  • Size of the generated dataset: 1.89 MB
  • Total amount of disk used: 2.61 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}

unshuffled_deduplicated_ht

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}

unshuffled_deduplicated_hu

  • Size of downloaded dataset files: 7.37 GB
  • Size of the generated dataset: 19.09 GB
  • Total amount of disk used: 26.46 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}

unshuffled_deduplicated_hy

  • Size of downloaded dataset files: 393.62 MB
  • Size of the generated dataset: 1.56 GB
  • Total amount of disk used: 1.96 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}

unshuffled_deduplicated_ia

  • Size of downloaded dataset files: 0.05 MB
  • Size of the generated dataset: 0.38 MB
  • Total amount of disk used: 0.43 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}

unshuffled_deduplicated_id

  • Size of downloaded dataset files: 6.00 GB
  • Size of the generated dataset: 17.05 GB
  • Total amount of disk used: 23.05 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}

unshuffled_deduplicated_ie

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}

unshuffled_deduplicated_ilo

  • Size of downloaded dataset files: 0.23 MB
  • Size of the generated dataset: 0.68 MB
  • Total amount of disk used: 0.91 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}

unshuffled_deduplicated_io

  • Size of downloaded dataset files: 0.04 MB
  • Size of the generated dataset: 0.14 MB
  • Total amount of disk used: 0.19 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}

unshuffled_deduplicated_is

  • Size of downloaded dataset files: 332.87 MB
  • Size of the generated dataset: 894.28 MB
  • Total amount of disk used: 1.23 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}

unshuffled_deduplicated_it

  • Size of downloaded dataset files: 27.93 GB
  • Size of the generated dataset: 74.09 GB
  • Total amount of disk used: 102.03 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}

unshuffled_deduplicated_ja

  • Size of downloaded dataset files: 40.80 GB
  • Size of the generated dataset: 113.63 GB
  • Total amount of disk used: 154.44 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}

unshuffled_deduplicated_jbo

  • Size of downloaded dataset files: 0.20 MB
  • Size of the generated dataset: 0.70 MB
  • Total amount of disk used: 0.91 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}

unshuffled_deduplicated_jv

  • Size of downloaded dataset files: 0.21 MB
  • Size of the generated dataset: 0.62 MB
  • Total amount of disk used: 0.82 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}

unshuffled_deduplicated_ka

  • Size of downloaded dataset files: 377.23 MB
  • Size of the generated dataset: 1.99 GB
  • Total amount of disk used: 2.36 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}

unshuffled_deduplicated_kk

  • Size of downloaded dataset files: 389.12 MB
  • Size of the generated dataset: 1.59 GB
  • Total amount of disk used: 1.97 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}

unshuffled_deduplicated_km

  • Size of downloaded dataset files: 114.48 MB
  • Size of the generated dataset: 610.61 MB
  • Total amount of disk used: 725.09 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}

unshuffled_deduplicated_kn

  • Size of downloaded dataset files: 215.52 MB
  • Size of the generated dataset: 1.08 GB
  • Total amount of disk used: 1.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}

unshuffled_deduplicated_ko

  • Size of downloaded dataset files: 4.46 GB
  • Size of the generated dataset: 12.00 GB
  • Total amount of disk used: 16.47 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}

unshuffled_deduplicated_krc

  • Size of downloaded dataset files: 0.62 MB
  • Size of the generated dataset: 2.41 MB
  • Total amount of disk used: 3.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}

unshuffled_deduplicated_ku

  • Size of downloaded dataset files: 23.34 MB
  • Size of the generated dataset: 63.09 MB
  • Total amount of disk used: 86.43 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}

unshuffled_deduplicated_kv

  • Size of downloaded dataset files: 0.33 MB
  • Size of the generated dataset: 1.21 MB
  • Total amount of disk used: 1.54 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}

unshuffled_deduplicated_kw

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..."
}

unshuffled_deduplicated_ky

  • Size of downloaded dataset files: 106.22 MB
  • Size of the generated dataset: 408.40 MB
  • Total amount of disk used: 514.61 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}

unshuffled_deduplicated_la

  • Size of downloaded dataset files: 3.42 MB
  • Size of the generated dataset: 9.79 MB
  • Total amount of disk used: 13.22 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}

unshuffled_deduplicated_lb

  • Size of downloaded dataset files: 8.30 MB
  • Size of the generated dataset: 21.42 MB
  • Total amount of disk used: 29.72 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}

unshuffled_deduplicated_lez

  • Size of downloaded dataset files: 0.77 MB
  • Size of the generated dataset: 3.08 MB
  • Total amount of disk used: 3.84 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}

unshuffled_deduplicated_li

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.03 MB
  • Total amount of disk used: 0.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}

unshuffled_deduplicated_lmo

  • Size of downloaded dataset files: 0.10 MB
  • Size of the generated dataset: 0.46 MB
  • Total amount of disk used: 0.57 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}

unshuffled_deduplicated_lo

  • Size of downloaded dataset files: 23.63 MB
  • Size of the generated dataset: 119.29 MB
  • Total amount of disk used: 142.92 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}

unshuffled_deduplicated_lrc

  • Size of downloaded dataset files: 0.02 MB
  • Size of the generated dataset: 0.06 MB
  • Total amount of disk used: 0.08 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}

unshuffled_deduplicated_lt

  • Size of downloaded dataset files: 1.65 GB
  • Size of the generated dataset: 4.20 GB
  • Total amount of disk used: 5.86 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}

unshuffled_deduplicated_lv

  • Size of downloaded dataset files: 710.45 MB
  • Size of the generated dataset: 1.91 GB
  • Total amount of disk used: 2.62 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}

unshuffled_deduplicated_mai

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}

unshuffled_deduplicated_mg

  • Size of downloaded dataset files: 4.30 MB
  • Size of the generated dataset: 13.59 MB
  • Total amount of disk used: 17.89 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}

unshuffled_deduplicated_mhr

  • Size of downloaded dataset files: 1.63 MB
  • Size of the generated dataset: 6.26 MB
  • Total amount of disk used: 7.89 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}

unshuffled_deduplicated_min

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.31 MB
  • Total amount of disk used: 0.33 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ..."
}

unshuffled_deduplicated_mk

  • Size of downloaded dataset files: 303.12 MB
  • Size of the generated dataset: 1.19 GB
  • Total amount of disk used: 1.49 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}

unshuffled_deduplicated_ml

  • Size of downloaded dataset files: 496.80 MB
  • Size of the generated dataset: 2.69 GB
  • Total amount of disk used: 3.18 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"സ്ത്രീ പ്രവേശനം സര്‍ക്കാര്‍ പൂര്‍ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില്‍ ഇടപെടുമെന്നും സര്‍ക്കാര്‍ ഹൈക്കോടതിയില്‍\\..."
}

unshuffled_deduplicated_mn

  • Size of downloaded dataset files: 219.52 MB
  • Size of the generated dataset: 883.46 MB
  • Total amount of disk used: 1.10 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"МУБИС-ын багш мэргэжлийн хөрвөх сургалтыг төгссөн багшид багшлах эрх олгох тухай ~ БМДИ-ийн захирлын тушаал - Багшийн мэргэжил ..."
}

unshuffled_deduplicated_mr

  • Size of downloaded dataset files: 299.68 MB
  • Size of the generated dataset: 1.49 GB
  • Total amount of disk used: 1.79 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}

unshuffled_deduplicated_mrj

  • Size of downloaded dataset files: 0.29 MB
  • Size of the generated dataset: 1.10 MB
  • Total amount of disk used: 1.38 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}

unshuffled_deduplicated_ms

  • Size of downloaded dataset files: 16.39 MB
  • Size of the generated dataset: 49.45 MB
  • Total amount of disk used: 65.85 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}

unshuffled_deduplicated_mt

  • Size of downloaded dataset files: 5.90 MB
  • Size of the generated dataset: 17.68 MB
  • Total amount of disk used: 23.58 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}

unshuffled_deduplicated_mwl

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}

unshuffled_deduplicated_my

  • Size of downloaded dataset files: 207.14 MB
  • Size of the generated dataset: 1.11 GB
  • Total amount of disk used: 1.32 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}

unshuffled_deduplicated_myv

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}

unshuffled_deduplicated_mzn

  • Size of downloaded dataset files: 0.16 MB
  • Size of the generated dataset: 0.63 MB
  • Total amount of disk used: 0.79 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنی‌یه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}

unshuffled_deduplicated_nah

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}

unshuffled_deduplicated_nap

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}

unshuffled_deduplicated_nds

  • Size of downloaded dataset files: 5.27 MB
  • Size of the generated dataset: 13.48 MB
  • Total amount of disk used: 18.76 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}

unshuffled_deduplicated_ne

  • Size of downloaded dataset files: 240.63 MB
  • Size of the generated dataset: 1.24 GB
  • Total amount of disk used: 1.48 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}

unshuffled_deduplicated_new

  • Size of downloaded dataset files: 0.83 MB
  • Size of the generated dataset: 4.26 MB
  • Total amount of disk used: 5.09 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}

unshuffled_deduplicated_nl

  • Size of downloaded dataset files: 15.73 GB
  • Size of the generated dataset: 41.91 GB
  • Total amount of disk used: 57.65 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}

unshuffled_deduplicated_nn

  • Size of downloaded dataset files: 23.58 MB
  • Size of the generated dataset: 58.32 MB
  • Total amount of disk used: 81.90 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}

unshuffled_deduplicated_no

  • Size of downloaded dataset files: 1.96 GB
  • Size of the generated dataset: 5.11 GB
  • Total amount of disk used: 7.07 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}

unshuffled_deduplicated_oc

  • Size of downloaded dataset files: 1.34 MB
  • Size of the generated dataset: 4.00 MB
  • Total amount of disk used: 5.34 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}

unshuffled_deduplicated_or

  • Size of downloaded dataset files: 38.72 MB
  • Size of the generated dataset: 197.63 MB
  • Total amount of disk used: 236.36 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}

unshuffled_deduplicated_os

  • Size of downloaded dataset files: 2.83 MB
  • Size of the generated dataset: 11.00 MB
  • Total amount of disk used: 13.83 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}

unshuffled_deduplicated_pa

  • Size of downloaded dataset files: 102.39 MB
  • Size of the generated dataset: 483.04 MB
  • Total amount of disk used: 585.42 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}

unshuffled_deduplicated_pam

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}

unshuffled_deduplicated_pl

  • Size of downloaded dataset files: 20.19 GB
  • Size of the generated dataset: 50.59 GB
  • Total amount of disk used: 70.78 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}

unshuffled_deduplicated_pms

  • Size of downloaded dataset files: 0.71 MB
  • Size of the generated dataset: 2.00 MB
  • Total amount of disk used: 2.72 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}

unshuffled_deduplicated_pnb

  • Size of downloaded dataset files: 2.58 MB
  • Size of the generated dataset: 9.44 MB
  • Total amount of disk used: 12.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}

unshuffled_deduplicated_ps

  • Size of downloaded dataset files: 71.83 MB
  • Size of the generated dataset: 254.79 MB
  • Total amount of disk used: 326.61 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}

unshuffled_deduplicated_pt

  • Size of downloaded dataset files: 26.00 GB
  • Size of the generated dataset: 68.37 GB
  • Total amount of disk used: 94.37 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}

unshuffled_deduplicated_qu

  • Size of downloaded dataset files: 0.02 MB
  • Size of the generated dataset: 0.07 MB
  • Total amount of disk used: 0.09 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}

unshuffled_deduplicated_rm

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}

unshuffled_deduplicated_ro

  • Size of downloaded dataset files: 4.48 GB
  • Size of the generated dataset: 11.66 GB
  • Total amount of disk used: 16.14 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}

unshuffled_deduplicated_ru

  • Size of downloaded dataset files: 166.68 GB
  • Size of the generated dataset: 611.70 GB
  • Total amount of disk used: 778.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}

unshuffled_deduplicated_sa

  • Size of downloaded dataset files: 7.27 MB
  • Size of the generated dataset: 38.33 MB
  • Total amount of disk used: 45.60 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्‍प्रति समाप्‍ता अस्ति । तस्‍य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}

unshuffled_deduplicated_sah

  • Size of downloaded dataset files: 7.01 MB
  • Size of the generated dataset: 27.46 MB
  • Total amount of disk used: 34.49 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}

unshuffled_deduplicated_scn

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}

unshuffled_deduplicated_sd

  • Size of downloaded dataset files: 74.17 MB
  • Size of the generated dataset: 275.48 MB
  • Total amount of disk used: 349.66 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}

unshuffled_deduplicated_sh

  • Size of downloaded dataset files: 1.45 MB
  • Size of the generated dataset: 6.44 MB
  • Total amount of disk used: 7.87 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}

unshuffled_deduplicated_si

  • Size of downloaded dataset files: 175.62 MB
  • Size of the generated dataset: 842.57 MB
  • Total amount of disk used: 1.02 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}

unshuffled_deduplicated_sk

  • Size of downloaded dataset files: 1.96 GB
  • Size of the generated dataset: 4.80 GB
  • Total amount of disk used: 6.76 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}

unshuffled_deduplicated_sl

  • Size of downloaded dataset files: 523.22 MB
  • Size of the generated dataset: 1.32 GB
  • Total amount of disk used: 1.85 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}

unshuffled_deduplicated_so

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}

unshuffled_deduplicated_sq

  • Size of downloaded dataset files: 445.36 MB
  • Size of the generated dataset: 1.21 GB
  • Total amount of disk used: 1.66 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}

unshuffled_deduplicated_sr

  • Size of downloaded dataset files: 665.03 MB
  • Size of the generated dataset: 2.36 GB
  • Total amount of disk used: 3.03 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}

unshuffled_deduplicated_su

  • Size of downloaded dataset files: 0.05 MB
  • Size of the generated dataset: 0.16 MB
  • Total amount of disk used: 0.21 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}

unshuffled_deduplicated_sv

  • Size of downloaded dataset files: 10.19 GB
  • Size of the generated dataset: 26.33 GB
  • Total amount of disk used: 36.51 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}

unshuffled_deduplicated_sw

  • Size of downloaded dataset files: 2.95 MB
  • Size of the generated dataset: 8.98 MB
  • Total amount of disk used: 11.92 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}

unshuffled_deduplicated_ta

  • Size of downloaded dataset files: 971.12 MB
  • Size of the generated dataset: 5.48 GB
  • Total amount of disk used: 6.45 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}

unshuffled_deduplicated_te

  • Size of downloaded dataset files: 342.43 MB
  • Size of the generated dataset: 1.70 GB
  • Total amount of disk used: 2.04 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}

unshuffled_deduplicated_tg

  • Size of downloaded dataset files: 62.90 MB
  • Size of the generated dataset: 261.68 MB
  • Total amount of disk used: 324.60 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}

unshuffled_deduplicated_th

  • Size of downloaded dataset files: 3.54 GB
  • Size of the generated dataset: 17.11 GB
  • Total amount of disk used: 20.65 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}

unshuffled_deduplicated_tk

  • Size of downloaded dataset files: 2.22 MB
  • Size of the generated dataset: 7.12 MB
  • Total amount of disk used: 9.34 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}

unshuffled_deduplicated_tl

  • Size of downloaded dataset files: 151.34 MB
  • Size of the generated dataset: 431.69 MB
  • Total amount of disk used: 583.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}

unshuffled_deduplicated_tr

  • Size of downloaded dataset files: 10.39 GB
  • Size of the generated dataset: 28.47 GB
  • Total amount of disk used: 38.86 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}

unshuffled_deduplicated_tt

  • Size of downloaded dataset files: 85.89 MB
  • Size of the generated dataset: 321.37 MB
  • Total amount of disk used: 407.26 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}

unshuffled_deduplicated_tyv

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}

unshuffled_deduplicated_ug

  • Size of downloaded dataset files: 20.53 MB
  • Size of the generated dataset: 86.44 MB
  • Total amount of disk used: 106.97 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}

unshuffled_deduplicated_uk

  • Size of downloaded dataset files: 8.04 GB
  • Size of the generated dataset: 29.86 GB
  • Total amount of disk used: 37.90 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}

unshuffled_deduplicated_ur

  • Size of downloaded dataset files: 483.59 MB
  • Size of the generated dataset: 1.82 GB
  • Total amount of disk used: 2.31 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}

unshuffled_deduplicated_uz

  • Size of downloaded dataset files: 4.30 MB
  • Size of the generated dataset: 12.00 MB
  • Total amount of disk used: 16.29 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}

unshuffled_deduplicated_vec

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}

unshuffled_deduplicated_vi

  • Size of downloaded dataset files: 10.71 GB
  • Size of the generated dataset: 33.60 GB
  • Total amount of disk used: 44.31 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}

unshuffled_deduplicated_vo

  • Size of downloaded dataset files: 0.30 MB
  • Size of the generated dataset: 2.10 MB
  • Total amount of disk used: 2.40 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}

unshuffled_deduplicated_wa

  • Size of downloaded dataset files: 0.08 MB
  • Size of the generated dataset: 0.22 MB
  • Total amount of disk used: 0.29 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}

unshuffled_deduplicated_war

  • Size of downloaded dataset files: 0.55 MB
  • Size of the generated dataset: 2.36 MB
  • Total amount of disk used: 2.90 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}

unshuffled_deduplicated_wuu

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.03 MB
  • Total amount of disk used: 0.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}

unshuffled_deduplicated_xal

  • Size of downloaded dataset files: 0.03 MB
  • Size of the generated dataset: 0.12 MB
  • Total amount of disk used: 0.15 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}

unshuffled_deduplicated_xmf

  • Size of downloaded dataset files: 0.94 MB
  • Size of the generated dataset: 4.63 MB
  • Total amount of disk used: 5.58 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}

unshuffled_deduplicated_yi

  • Size of downloaded dataset files: 22.20 MB
  • Size of the generated dataset: 88.29 MB
  • Total amount of disk used: 110.49 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}

unshuffled_deduplicated_yo

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.03 MB
  • Total amount of disk used: 0.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}

unshuffled_deduplicated_yue

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}

unshuffled_deduplicated_zh

  • Size of downloaded dataset files: 99.98 GB
  • Size of the generated dataset: 267.88 GB
  • Total amount of disk used: 367.86 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}
Click to expand the Data/size information for each language (original)

unshuffled_original_af

  • Size of downloaded dataset files: 85.79 MB
  • Size of the generated dataset: 254.08 MB
  • Total amount of disk used: 339.87 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}

unshuffled_original_als

  • Size of downloaded dataset files: 1.49 MB
  • Size of the generated dataset: 5.30 MB
  • Total amount of disk used: 6.78 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}

unshuffled_original_am

  • Size of downloaded dataset files: 102.79 MB
  • Size of the generated dataset: 378.06 MB
  • Total amount of disk used: 480.85 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}

unshuffled_original_an

  • Size of downloaded dataset files: 0.15 MB
  • Size of the generated dataset: 1.33 MB
  • Total amount of disk used: 1.48 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}

unshuffled_original_ar

  • Size of downloaded dataset files: 22.23 GB
  • Size of the generated dataset: 87.94 GB
  • Total amount of disk used: 110.17 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}

unshuffled_original_arz

  • Size of downloaded dataset files: 15.90 MB
  • Size of the generated dataset: 70.13 MB
  • Total amount of disk used: 86.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}

unshuffled_original_as

  • Size of downloaded dataset files: 21.43 MB
  • Size of the generated dataset: 117.73 MB
  • Total amount of disk used: 139.17 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}

unshuffled_original_ast

  • Size of downloaded dataset files: 0.92 MB
  • Size of the generated dataset: 2.54 MB
  • Total amount of disk used: 3.46 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}

unshuffled_original_av

  • Size of downloaded dataset files: 0.08 MB
  • Size of the generated dataset: 0.42 MB
  • Total amount of disk used: 0.50 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}

unshuffled_original_az

  • Size of downloaded dataset files: 927.76 MB
  • Size of the generated dataset: 2.96 GB
  • Total amount of disk used: 3.89 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}

unshuffled_original_azb

  • Size of downloaded dataset files: 6.64 MB
  • Size of the generated dataset: 28.47 MB
  • Total amount of disk used: 35.11 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}

unshuffled_original_ba

  • Size of downloaded dataset files: 33.22 MB
  • Size of the generated dataset: 133.70 MB
  • Total amount of disk used: 166.92 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}

unshuffled_original_bar

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "                                                                                                                                          vo"
}

unshuffled_original_bcl

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}

unshuffled_original_be

  • Size of downloaded dataset files: 498.29 MB
  • Size of the generated dataset: 1.88 GB
  • Total amount of disk used: 2.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}

unshuffled_original_bg

  • Size of downloaded dataset files: 8.34 GB
  • Size of the generated dataset: 33.75 GB
  • Total amount of disk used: 42.09 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}

unshuffled_original_bh

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.12 MB
  • Total amount of disk used: 0.13 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}

unshuffled_original_bn

  • Size of downloaded dataset files: 2.14 GB
  • Size of the generated dataset: 10.77 GB
  • Total amount of disk used: 12.91 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nভড়ং সর্বস্ব বাংলা আর..."
}

unshuffled_original_bo

  • Size of downloaded dataset files: 28.94 MB
  • Size of the generated dataset: 195.40 MB
  • Total amount of disk used: 224.34 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}

unshuffled_original_bpy

  • Size of downloaded dataset files: 0.34 MB
  • Size of the generated dataset: 4.35 MB
  • Total amount of disk used: 4.69 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}

unshuffled_original_br

  • Size of downloaded dataset files: 9.18 MB
  • Size of the generated dataset: 30.20 MB
  • Total amount of disk used: 39.38 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}

unshuffled_original_bs

  • Size of downloaded dataset files: 0.05 MB
  • Size of the generated dataset: 0.48 MB
  • Total amount of disk used: 0.53 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}

unshuffled_original_bxr

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}

unshuffled_original_ca

  • Size of downloaded dataset files: 3.10 GB
  • Size of the generated dataset: 8.62 GB
  • Total amount of disk used: 11.73 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}

unshuffled_original_cbk

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}

unshuffled_original_ce

  • Size of downloaded dataset files: 2.09 MB
  • Size of the generated dataset: 8.73 MB
  • Total amount of disk used: 10.82 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}

unshuffled_original_ceb

  • Size of downloaded dataset files: 11.07 MB
  • Size of the generated dataset: 40.97 MB
  • Total amount of disk used: 52.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}

unshuffled_original_ckb

  • Size of downloaded dataset files: 111.88 MB
  • Size of the generated dataset: 510.97 MB
  • Total amount of disk used: 622.85 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}

unshuffled_original_cs

  • Size of downloaded dataset files: 21.72 GB
  • Size of the generated dataset: 57.08 GB
  • Total amount of disk used: 78.80 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}

unshuffled_original_cv

  • Size of downloaded dataset files: 9.40 MB
  • Size of the generated dataset: 41.05 MB
  • Total amount of disk used: 50.45 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}

unshuffled_original_cy

  • Size of downloaded dataset files: 81.74 MB
  • Size of the generated dataset: 224.93 MB
  • Total amount of disk used: 306.67 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}

unshuffled_original_da

  • Size of downloaded dataset files: 6.00 GB
  • Size of the generated dataset: 16.76 GB
  • Total amount of disk used: 22.76 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}

unshuffled_original_de

  • Size of downloaded dataset files: 119.51 GB
  • Size of the generated dataset: 331.22 GB
  • Total amount of disk used: 450.73 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}

unshuffled_original_diq

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}

unshuffled_original_dsb

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}

unshuffled_original_dv

  • Size of downloaded dataset files: 24.91 MB
  • Size of the generated dataset: 131.63 MB
  • Total amount of disk used: 156.54 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}

unshuffled_original_el

  • Size of downloaded dataset files: 17.31 GB
  • Size of the generated dataset: 66.27 GB
  • Total amount of disk used: 83.58 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}

unshuffled_original_eml

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}

unshuffled_original_en

  • Size of downloaded dataset files: 903.83 GB
  • Size of the generated dataset: 2525.44 GB
  • Total amount of disk used: 3429.27 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}

unshuffled_original_eo

  • Size of downloaded dataset files: 117.07 MB
  • Size of the generated dataset: 314.18 MB
  • Total amount of disk used: 431.27 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}

unshuffled_original_es

  • Size of downloaded dataset files: 106.04 GB
  • Size of the generated dataset: 298.49 GB
  • Total amount of disk used: 404.53 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}

unshuffled_original_et

  • Size of downloaded dataset files: 1.88 GB
  • Size of the generated dataset: 5.17 GB
  • Total amount of disk used: 7.06 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}

unshuffled_original_eu

  • Size of downloaded dataset files: 248.19 MB
  • Size of the generated dataset: 894.83 MB
  • Total amount of disk used: 1.14 GB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}

unshuffled_original_fa

  • Size of downloaded dataset files: 20.96 GB
  • Size of the generated dataset: 84.21 GB
  • Total amount of disk used: 105.17 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}

unshuffled_original_fi

  • Size of downloaded dataset files: 9.97 GB
  • Size of the generated dataset: 28.57 GB
  • Total amount of disk used: 38.54 GB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}

unshuffled_original_fr

  • Size of downloaded dataset files: 105.32 GB
  • Size of the generated dataset: 303.19 GB
  • Total amount of disk used: 408.51 GB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}

unshuffled_original_frr

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}

unshuffled_original_fy

  • Size of downloaded dataset files: 12.40 MB
  • Size of the generated dataset: 36.24 MB
  • Total amount of disk used: 48.64 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}

unshuffled_original_ga

  • Size of downloaded dataset files: 29.27 MB
  • Size of the generated dataset: 92.37 MB
  • Total amount of disk used: 121.63 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}

unshuffled_original_gd

  • Size of downloaded dataset files: 0.52 MB
  • Size of the generated dataset: 2.02 MB
  • Total amount of disk used: 2.55 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}

unshuffled_original_gl

  • Size of downloaded dataset files: 235.38 MB
  • Size of the generated dataset: 656.48 MB
  • Total amount of disk used: 891.87 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}

unshuffled_original_gn

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.04 MB
  • Total amount of disk used: 0.05 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"º ѐÆÚÓ À Ã Ð É Æ ¾ Ä ΠÀ ¼ Æ É ÄÛ = Ü Ý\\\"Þ ß†à á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}

unshuffled_original_gom

  • Size of downloaded dataset files: 0.44 MB
  • Size of the generated dataset: 2.25 MB
  • Total amount of disk used: 2.71 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}

unshuffled_original_gu

  • Size of downloaded dataset files: 232.02 MB
  • Size of the generated dataset: 1.09 GB
  • Total amount of disk used: 1.33 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}

unshuffled_original_he

  • Size of downloaded dataset files: 5.66 GB
  • Size of the generated dataset: 21.11 GB
  • Total amount of disk used: 26.77 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}

unshuffled_original_hi

  • Size of downloaded dataset files: 3.66 GB
  • Size of the generated dataset: 17.93 GB
  • Total amount of disk used: 21.59 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्‍सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}

unshuffled_original_hr

  • Size of downloaded dataset files: 79.42 MB
  • Size of the generated dataset: 243.83 MB
  • Total amount of disk used: 323.24 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}

unshuffled_original_hsb

  • Size of downloaded dataset files: 1.39 MB
  • Size of the generated dataset: 4.49 MB
  • Total amount of disk used: 5.87 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}

unshuffled_original_ht

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}

unshuffled_original_hu

  • Size of downloaded dataset files: 15.69 GB
  • Size of the generated dataset: 43.07 GB
  • Total amount of disk used: 58.77 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}

unshuffled_original_hy

  • Size of downloaded dataset files: 897.36 MB
  • Size of the generated dataset: 3.94 GB
  • Total amount of disk used: 4.84 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}

unshuffled_original_ia

  • Size of downloaded dataset files: 0.08 MB
  • Size of the generated dataset: 0.69 MB
  • Total amount of disk used: 0.78 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}

unshuffled_original_id

  • Size of downloaded dataset files: 10.60 GB
  • Size of the generated dataset: 32.32 GB
  • Total amount of disk used: 42.91 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}

unshuffled_original_ie

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}

unshuffled_original_ilo

  • Size of downloaded dataset files: 0.27 MB
  • Size of the generated dataset: 0.92 MB
  • Total amount of disk used: 1.20 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}

unshuffled_original_io

  • Size of downloaded dataset files: 0.04 MB
  • Size of the generated dataset: 0.16 MB
  • Total amount of disk used: 0.20 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}

unshuffled_original_is

  • Size of downloaded dataset files: 533.03 MB
  • Size of the generated dataset: 1.52 GB
  • Total amount of disk used: 2.06 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}

unshuffled_original_it

  • Size of downloaded dataset files: 52.16 GB
  • Size of the generated dataset: 147.38 GB
  • Total amount of disk used: 199.54 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}

unshuffled_original_ja

  • Size of downloaded dataset files: 79.56 GB
  • Size of the generated dataset: 232.22 GB
  • Total amount of disk used: 311.78 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}

unshuffled_original_jbo

  • Size of downloaded dataset files: 0.21 MB
  • Size of the generated dataset: 0.77 MB
  • Total amount of disk used: 0.98 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}

unshuffled_original_jv

  • Size of downloaded dataset files: 0.22 MB
  • Size of the generated dataset: 0.69 MB
  • Total amount of disk used: 0.91 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}

unshuffled_original_ka

  • Size of downloaded dataset files: 680.74 MB
  • Size of the generated dataset: 3.77 GB
  • Total amount of disk used: 4.45 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}

unshuffled_original_kk

  • Size of downloaded dataset files: 615.06 MB
  • Size of the generated dataset: 2.83 GB
  • Total amount of disk used: 3.45 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}

unshuffled_original_km

  • Size of downloaded dataset files: 193.28 MB
  • Size of the generated dataset: 1.10 GB
  • Total amount of disk used: 1.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}

unshuffled_original_kn

  • Size of downloaded dataset files: 342.15 MB
  • Size of the generated dataset: 1.76 GB
  • Total amount of disk used: 2.11 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}

unshuffled_original_ko

  • Size of downloaded dataset files: 8.81 GB
  • Size of the generated dataset: 25.29 GB
  • Total amount of disk used: 34.10 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}

unshuffled_original_krc

  • Size of downloaded dataset files: 0.66 MB
  • Size of the generated dataset: 2.68 MB
  • Total amount of disk used: 3.34 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}

unshuffled_original_ku

  • Size of downloaded dataset files: 33.38 MB
  • Size of the generated dataset: 99.06 MB
  • Total amount of disk used: 132.44 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}

unshuffled_original_kv

  • Size of downloaded dataset files: 0.40 MB
  • Size of the generated dataset: 2.38 MB
  • Total amount of disk used: 2.78 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}

unshuffled_original_kw

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.04 MB
  • Total amount of disk used: 0.05 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..."
}

unshuffled_original_ky

  • Size of downloaded dataset files: 152.64 MB
  • Size of the generated dataset: 630.79 MB
  • Total amount of disk used: 783.43 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}

unshuffled_original_la

  • Size of downloaded dataset files: 5.46 MB
  • Size of the generated dataset: 27.80 MB
  • Total amount of disk used: 33.26 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}

unshuffled_original_lb

  • Size of downloaded dataset files: 10.73 MB
  • Size of the generated dataset: 30.60 MB
  • Total amount of disk used: 41.32 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}

unshuffled_original_lez

  • Size of downloaded dataset files: 0.83 MB
  • Size of the generated dataset: 3.38 MB
  • Total amount of disk used: 4.20 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}

unshuffled_original_li

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.03 MB
  • Total amount of disk used: 0.04 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}

unshuffled_original_lmo

  • Size of downloaded dataset files: 0.10 MB
  • Size of the generated dataset: 0.47 MB
  • Total amount of disk used: 0.58 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}

unshuffled_original_lo

  • Size of downloaded dataset files: 33.92 MB
  • Size of the generated dataset: 182.36 MB
  • Total amount of disk used: 216.28 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}

unshuffled_original_lrc

  • Size of downloaded dataset files: 0.02 MB
  • Size of the generated dataset: 0.07 MB
  • Total amount of disk used: 0.09 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}

unshuffled_original_lt

  • Size of downloaded dataset files: 3.44 GB
  • Size of the generated dataset: 9.45 GB
  • Total amount of disk used: 12.89 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}

unshuffled_original_lv

  • Size of downloaded dataset files: 1.49 GB
  • Size of the generated dataset: 4.27 GB
  • Total amount of disk used: 5.75 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}

unshuffled_original_mai

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.33 MB
  • Total amount of disk used: 0.34 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}

unshuffled_original_mg

  • Size of downloaded dataset files: 6.22 MB
  • Size of the generated dataset: 21.79 MB
  • Total amount of disk used: 28.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}

unshuffled_original_mhr

  • Size of downloaded dataset files: 1.84 MB
  • Size of the generated dataset: 7.55 MB
  • Total amount of disk used: 9.38 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}

unshuffled_original_min

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.63 MB
  • Total amount of disk used: 0.64 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ‏‏‎ ..."
}

unshuffled_original_mk

  • Size of downloaded dataset files: 508.24 MB
  • Size of the generated dataset: 2.20 GB
  • Total amount of disk used: 2.71 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}

unshuffled_original_ml

  • Size of downloaded dataset files: 938.69 MB
  • Size of the generated dataset: 5.24 GB
  • Total amount of disk used: 6.18 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"സ്ത്രീ പ്രവേശനം സര്‍ക്കാര്‍ പൂര്‍ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില്‍ ഇടപെടുമെന്നും സര്‍ക്കാര്‍ ഹൈക്കോടതിയില്‍\\..."
}

unshuffled_original_mn

  • Size of downloaded dataset files: 472.36 MB
  • Size of the generated dataset: 2.33 GB
  • Total amount of disk used: 2.81 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Монгол улс, Улаанбаатар хот - 14191 Энхтайваны өргөн чөлөө - 10, Багш хөгжлийн ордон, Багшийн мэргэжил дээшлүүлэх институт\\nБаг..."
}

unshuffled_original_mr

  • Size of downloaded dataset files: 525.31 MB
  • Size of the generated dataset: 2.82 GB
  • Total amount of disk used: 3.34 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}

unshuffled_original_mrj

  • Size of downloaded dataset files: 0.30 MB
  • Size of the generated dataset: 1.16 MB
  • Total amount of disk used: 1.47 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}

unshuffled_original_ms

  • Size of downloaded dataset files: 28.46 MB
  • Size of the generated dataset: 122.33 MB
  • Total amount of disk used: 150.79 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}

unshuffled_original_mt

  • Size of downloaded dataset files: 7.53 MB
  • Size of the generated dataset: 24.47 MB
  • Total amount of disk used: 32.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}

unshuffled_original_mwl

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}

unshuffled_original_my

  • Size of downloaded dataset files: 369.85 MB
  • Size of the generated dataset: 2.02 GB
  • Total amount of disk used: 2.39 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}

unshuffled_original_myv

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}

unshuffled_original_mzn

  • Size of downloaded dataset files: 0.18 MB
  • Size of the generated dataset: 0.72 MB
  • Total amount of disk used: 0.90 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنی‌یه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}

unshuffled_original_nah

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}

unshuffled_original_nap

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.02 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}

unshuffled_original_nds

  • Size of downloaded dataset files: 6.74 MB
  • Size of the generated dataset: 18.23 MB
  • Total amount of disk used: 24.99 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}

unshuffled_original_ne

  • Size of downloaded dataset files: 355.29 MB
  • Size of the generated dataset: 1.87 GB
  • Total amount of disk used: 2.22 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}

unshuffled_original_new

  • Size of downloaded dataset files: 1.03 MB
  • Size of the generated dataset: 5.77 MB
  • Total amount of disk used: 6.79 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}

unshuffled_original_nl

  • Size of downloaded dataset files: 29.35 GB
  • Size of the generated dataset: 83.23 GB
  • Total amount of disk used: 112.58 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}

unshuffled_original_nn

  • Size of downloaded dataset files: 32.86 MB
  • Size of the generated dataset: 90.84 MB
  • Total amount of disk used: 123.70 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}

unshuffled_original_no

  • Size of downloaded dataset files: 3.11 GB
  • Size of the generated dataset: 8.65 GB
  • Total amount of disk used: 11.76 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}

unshuffled_original_oc

  • Size of downloaded dataset files: 1.57 MB
  • Size of the generated dataset: 6.12 MB
  • Total amount of disk used: 7.71 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}

unshuffled_original_or

  • Size of downloaded dataset files: 49.84 MB
  • Size of the generated dataset: 260.15 MB
  • Total amount of disk used: 309.99 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}

unshuffled_original_os

  • Size of downloaded dataset files: 3.09 MB
  • Size of the generated dataset: 12.90 MB
  • Total amount of disk used: 15.99 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}

unshuffled_original_pa

  • Size of downloaded dataset files: 164.21 MB
  • Size of the generated dataset: 801.16 MB
  • Total amount of disk used: 965.37 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}

unshuffled_original_pam

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}

unshuffled_original_pl

  • Size of downloaded dataset files: 42.88 GB
  • Size of the generated dataset: 117.12 GB
  • Total amount of disk used: 160.01 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}

unshuffled_original_pms

  • Size of downloaded dataset files: 0.75 MB
  • Size of the generated dataset: 2.15 MB
  • Total amount of disk used: 2.92 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}

unshuffled_original_pnb

  • Size of downloaded dataset files: 3.22 MB
  • Size of the generated dataset: 12.04 MB
  • Total amount of disk used: 15.26 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}

unshuffled_original_ps

  • Size of downloaded dataset files: 103.66 MB
  • Size of the generated dataset: 379.51 MB
  • Total amount of disk used: 483.17 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}

unshuffled_original_pt

  • Size of downloaded dataset files: 47.26 GB
  • Size of the generated dataset: 132.64 GB
  • Total amount of disk used: 179.89 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}

unshuffled_original_qu

  • Size of downloaded dataset files: 0.02 MB
  • Size of the generated dataset: 0.08 MB
  • Total amount of disk used: 0.10 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}

unshuffled_original_rm

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}

unshuffled_original_ro

  • Size of downloaded dataset files: 9.53 GB
  • Size of the generated dataset: 26.87 GB
  • Total amount of disk used: 36.40 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}

unshuffled_original_ru

  • Size of downloaded dataset files: 319.76 GB
  • Size of the generated dataset: 1241.63 GB
  • Total amount of disk used: 1561.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}

unshuffled_original_sa

  • Size of downloaded dataset files: 17.52 MB
  • Size of the generated dataset: 97.06 MB
  • Total amount of disk used: 114.58 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्‍प्रति समाप्‍ता अस्ति । तस्‍य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}

unshuffled_original_sah

  • Size of downloaded dataset files: 9.08 MB
  • Size of the generated dataset: 43.82 MB
  • Total amount of disk used: 52.90 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}

unshuffled_original_scn

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

{
    "id": 0,
    "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}

unshuffled_original_sd

  • Size of downloaded dataset files: 90.62 MB
  • Size of the generated dataset: 364.25 MB
  • Total amount of disk used: 454.88 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}

unshuffled_original_sh

  • Size of downloaded dataset files: 3.46 MB
  • Size of the generated dataset: 25.84 MB
  • Total amount of disk used: 29.30 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}

unshuffled_original_si

  • Size of downloaded dataset files: 310.93 MB
  • Size of the generated dataset: 1.47 GB
  • Total amount of disk used: 1.78 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}

unshuffled_original_sk

  • Size of downloaded dataset files: 3.71 GB
  • Size of the generated dataset: 9.81 GB
  • Total amount of disk used: 13.52 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}

unshuffled_original_sl

  • Size of downloaded dataset files: 956.20 MB
  • Size of the generated dataset: 2.68 GB
  • Total amount of disk used: 3.63 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}

unshuffled_original_so

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.06 MB
  • Total amount of disk used: 0.06 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}

unshuffled_original_sq

  • Size of downloaded dataset files: 861.84 MB
  • Size of the generated dataset: 2.44 GB
  • Total amount of disk used: 3.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}

unshuffled_original_sr

  • Size of downloaded dataset files: 1.08 GB
  • Size of the generated dataset: 4.13 GB
  • Total amount of disk used: 5.21 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}

unshuffled_original_su

  • Size of downloaded dataset files: 0.06 MB
  • Size of the generated dataset: 0.23 MB
  • Total amount of disk used: 0.28 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}

unshuffled_original_sv

  • Size of downloaded dataset files: 17.18 GB
  • Size of the generated dataset: 47.00 GB
  • Total amount of disk used: 64.18 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}

unshuffled_original_sw

  • Size of downloaded dataset files: 3.71 MB
  • Size of the generated dataset: 14.07 MB
  • Total amount of disk used: 17.78 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}

unshuffled_original_ta

  • Size of downloaded dataset files: 1.74 GB
  • Size of the generated dataset: 9.93 GB
  • Total amount of disk used: 11.67 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}

unshuffled_original_te

  • Size of downloaded dataset files: 522.47 MB
  • Size of the generated dataset: 2.61 GB
  • Total amount of disk used: 3.13 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}

unshuffled_original_tg

  • Size of downloaded dataset files: 90.97 MB
  • Size of the generated dataset: 397.43 MB
  • Total amount of disk used: 488.41 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}

unshuffled_original_th

  • Size of downloaded dataset files: 7.38 GB
  • Size of the generated dataset: 38.29 GB
  • Total amount of disk used: 45.67 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}

unshuffled_original_tk

  • Size of downloaded dataset files: 2.96 MB
  • Size of the generated dataset: 10.66 MB
  • Total amount of disk used: 13.62 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}

unshuffled_original_tl

  • Size of downloaded dataset files: 204.89 MB
  • Size of the generated dataset: 606.30 MB
  • Total amount of disk used: 811.19 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}

unshuffled_original_tr

  • Size of downloaded dataset files: 21.96 GB
  • Size of the generated dataset: 63.58 GB
  • Total amount of disk used: 85.54 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}

unshuffled_original_tt

  • Size of downloaded dataset files: 151.06 MB
  • Size of the generated dataset: 703.42 MB
  • Total amount of disk used: 854.47 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}

unshuffled_original_tyv

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.01 MB
  • Total amount of disk used: 0.01 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}

unshuffled_original_ug

  • Size of downloaded dataset files: 27.92 MB
  • Size of the generated dataset: 127.42 MB
  • Total amount of disk used: 155.35 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}

unshuffled_original_uk

  • Size of downloaded dataset files: 14.42 GB
  • Size of the generated dataset: 56.44 GB
  • Total amount of disk used: 70.86 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}

unshuffled_original_ur

  • Size of downloaded dataset files: 712.61 MB
  • Size of the generated dataset: 2.80 GB
  • Total amount of disk used: 3.51 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}

unshuffled_original_uz

  • Size of downloaded dataset files: 5.78 MB
  • Size of the generated dataset: 21.46 MB
  • Total amount of disk used: 27.24 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}

unshuffled_original_vec

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.02 MB
  • Total amount of disk used: 0.03 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}

unshuffled_original_vi

  • Size of downloaded dataset files: 21.50 GB
  • Size of the generated dataset: 72.23 GB
  • Total amount of disk used: 93.73 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}

unshuffled_original_vo

  • Size of downloaded dataset files: 0.30 MB
  • Size of the generated dataset: 2.12 MB
  • Total amount of disk used: 2.42 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}

unshuffled_original_wa

  • Size of downloaded dataset files: 0.09 MB
  • Size of the generated dataset: 0.29 MB
  • Total amount of disk used: 0.38 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}

unshuffled_original_war

  • Size of downloaded dataset files: 0.64 MB
  • Size of the generated dataset: 2.68 MB
  • Total amount of disk used: 3.32 MB

An example of 'train' looks as follows.

{
    "id": 1,
    "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}

unshuffled_original_wuu

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.12 MB
  • Total amount of disk used: 0.13 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}

unshuffled_original_xal

  • Size of downloaded dataset files: 0.03 MB
  • Size of the generated dataset: 0.12 MB
  • Total amount of disk used: 0.15 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}

unshuffled_original_xmf

  • Size of downloaded dataset files: 1.05 MB
  • Size of the generated dataset: 6.12 MB
  • Total amount of disk used: 7.17 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}

unshuffled_original_yi

  • Size of downloaded dataset files: 33.33 MB
  • Size of the generated dataset: 147.60 MB
  • Total amount of disk used: 180.94 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}

unshuffled_original_yo

  • Size of downloaded dataset files: 0.01 MB
  • Size of the generated dataset: 0.06 MB
  • Total amount of disk used: 0.06 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 0,
    "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}

unshuffled_original_yue

  • Size of downloaded dataset files: 0.00 MB
  • Size of the generated dataset: 0.00 MB
  • Total amount of disk used: 0.00 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}

unshuffled_original_zh

  • Size of downloaded dataset files: 206.00 GB
  • Size of the generated dataset: 545.61 GB
  • Total amount of disk used: 751.61 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": 1,
    "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}

Data Fields

The data fields are the same among all configs.

  • id: a int64 feature.
  • text: a string feature.

Data Splits

Click to expand the number of samples per configuration
Language Language code Name original Train original Words original Size original Name deduplicated Train deduplicated Words deduplicated Size deduplicated
Afrikaans af unshuffled_original_af 201117 43,482,801 241M unshuffled_deduplicated_af 130640 29,533,437 163M
Albanian sq unshuffled_original_sq 672077 374,196,110 2.3G unshuffled_deduplicated_sq 461598 186,856,699 1.2G
Alemannic als unshuffled_original_als 7324 841,750 5.0M unshuffled_deduplicated_als 4518 459,001 2.8M
Amharic am unshuffled_original_am 83663 28,301,601 360M unshuffled_deduplicated_am 43102 16,086,628 206M
Arabic ar unshuffled_original_ar 16365602 8,117,162,828 82G unshuffled_deduplicated_ar 9006977 3,171,221,354 32G
Aragonese an unshuffled_original_an 2449 52,896 1.3M unshuffled_deduplicated_an 2025 45,669 801K
Armenian hy unshuffled_original_hy 659430 273,919,388 3.7G unshuffled_deduplicated_hy 396093 110,196,043 1.5G
Assamese as unshuffled_original_as 14985 6,956,663 113M unshuffled_deduplicated_as 9212 4,366,570 71M
Asturian ast unshuffled_original_ast 6999 381,005 2.4M unshuffled_deduplicated_ast 5343 325,237 2.0M
Avaric av unshuffled_original_av 456 24,720 409K unshuffled_deduplicated_av 360 19,478 324K
Azerbaijani az unshuffled_original_az 912330 322,641,710 2.8G unshuffled_deduplicated_az 626796 167,742,296 1.5G
Bashkir ba unshuffled_original_ba 42551 9,796,764 128M unshuffled_deduplicated_ba 27050 6,922,589 90M
Basque eu unshuffled_original_eu 506883 120,456,652 848M unshuffled_deduplicated_eu 256513 45,359,710 342M
Bavarian bar unshuffled_original_bar 4 399 503 unshuffled_deduplicated_bar 4 399 503
Belarusian be unshuffled_original_be 586031 144,579,630 1.8G unshuffled_deduplicated_be 307405 83,499,037 1.1G
Bengali bn unshuffled_original_bn 1675515 623,575,733 11G unshuffled_deduplicated_bn 1114481 363,766,143 5.8G
Bihari bh unshuffled_original_bh 336 8,848 110K unshuffled_deduplicated_bh 82 2,875 34K
Bishnupriya bpy unshuffled_original_bpy 6046 198,286 4.1M unshuffled_deduplicated_bpy 1770 96,940 1.7M
Bosnian bs unshuffled_original_bs 2143 106,448 447K unshuffled_deduplicated_bs 702 20,485 116K
Breton br unshuffled_original_br 37085 5,013,241 29M unshuffled_deduplicated_br 14724 2,890,384 16M
Bulgarian bg unshuffled_original_bg 5869686 2,947,648,106 32G unshuffled_deduplicated_bg 3398679 1,268,114,977 14G
Burmese my unshuffled_original_my 232329 56,111,184 1.9G unshuffled_deduplicated_my 136639 30,102,173 1.1G
Catalan ca unshuffled_original_ca 4390754 1,360,212,450 8.0G unshuffled_deduplicated_ca 2458067 729,333,440 4.3G
Cebuano ceb unshuffled_original_ceb 56248 6,603,567 39M unshuffled_deduplicated_ceb 26145 3,675,024 24M
Central Bikol bcl unshuffled_original_bcl 1 312 885 unshuffled_deduplicated_bcl 1 312 885
Central Khmer km unshuffled_original_km 159363 20,690,610 1.1G unshuffled_deduplicated_km 108346 10,082,245 581M
Central Kurdish ckb unshuffled_original_ckb 103639 48,478,334 487M unshuffled_deduplicated_ckb 68210 18,726,721 226M
Chavacano cbk unshuffled_original_cbk 1 130 520 unshuffled_deduplicated_cbk 1 130 520
Chechen ce unshuffled_original_ce 4042 711,051 8.3M unshuffled_deduplicated_ce 2984 568,146 6.7M
Chinese zh unshuffled_original_zh 60137667 14,986,424,850 508G unshuffled_deduplicated_zh 41708901 6,350,215,113 249G
Chuvash cv unshuffled_original_cv 20281 3,041,614 39M unshuffled_deduplicated_cv 10130 2,054,810 26M
Cornish kw unshuffled_original_kw 203 8,329 44K unshuffled_deduplicated_kw 68 2,704 14K
Croatian hr unshuffled_original_hr 582219 34,232,765 226M unshuffled_deduplicated_hr 321484 16,727,640 110M
Czech cs unshuffled_original_cs 21001388 7,715,977,441 53G unshuffled_deduplicated_cs 12308039 3,540,997,509 24G
Danish da unshuffled_original_da 7664010 2,637,463,889 16G unshuffled_deduplicated_da 4771098 1,620,091,317 9.5G
Dhivehi dv unshuffled_original_dv 21018 7,559,472 126M unshuffled_deduplicated_dv 17024 4,726,660 79M
Dimli diq unshuffled_original_diq 1 19 146 unshuffled_deduplicated_diq 1 19 146
Dutch nl unshuffled_original_nl 34682142 13,020,136,373 78G unshuffled_deduplicated_nl 20812149 6,598,786,137 39G
Eastern Mari mhr unshuffled_original_mhr 3212 565,992 7.2M unshuffled_deduplicated_mhr 2515 469,297 6.0M
Egyptian Arabic arz unshuffled_original_arz 158113 7,305,151 66M unshuffled_deduplicated_arz 79928 3,659,419 33M
Emilian-Romagnol eml unshuffled_original_eml 84 6,376 25K unshuffled_deduplicated_eml 80 6,121 24K
English en unshuffled_original_en 455994980 418,187,793,408 2.3T unshuffled_deduplicated_en 304230423 215,841,256,971 1.2T
Erzya myv unshuffled_original_myv 6 90 1.4K unshuffled_deduplicated_myv 5 78 1.2K
Esperanto eo unshuffled_original_eo 121171 48,486,161 299M unshuffled_deduplicated_eo 84752 37,324,446 228M
Estonian et unshuffled_original_et 2093621 643,163,730 4.8G unshuffled_deduplicated_et 1172041 309,931,463 2.3G
Finnish fi unshuffled_original_fi 8557453 3,196,666,419 27G unshuffled_deduplicated_fi 5326443 1,597,855,468 13G
French fr unshuffled_original_fr 96742378 46,896,036,417 282G unshuffled_deduplicated_fr 59448891 23,206,776,649 138G
Galician gl unshuffled_original_gl 544388 102,011,291 620M unshuffled_deduplicated_gl 284320 63,600,602 384M
Georgian ka unshuffled_original_ka 563916 171,950,621 3.6G unshuffled_deduplicated_ka 372158 91,569,739 1.9G
German de unshuffled_original_de 104913504 44,878,908,446 308G unshuffled_deduplicated_de 62398034 21,529,164,172 145G
Goan Konkani gom unshuffled_original_gom 640 124,277 2.2M unshuffled_deduplicated_gom 484 102,306 1.8M
Guarani gn unshuffled_original_gn 106 7,382 36K unshuffled_deduplicated_gn 68 4,680 24K
Gujarati gu unshuffled_original_gu 240691 72,045,701 1.1G unshuffled_deduplicated_gu 169834 50,023,432 722M
Haitian ht unshuffled_original_ht 13 1,014 3.9K unshuffled_deduplicated_ht 9 832 3.3K
Hebrew he unshuffled_original_he 3808397 2,067,753,528 20G unshuffled_deduplicated_he 2375030 1,032,018,056 9.8G
Hindi hi unshuffled_original_hi 3264660 1,372,234,782 17G unshuffled_deduplicated_hi 1909387 745,774,934 8.9G
Hungarian hu unshuffled_original_hu 11197780 5,163,936,345 40G unshuffled_deduplicated_hu 6582908 2,339,127,555 18G
Icelandic is unshuffled_original_is 625673 219,900,094 1.5G unshuffled_deduplicated_is 389515 129,818,331 846M
Ido io unshuffled_original_io 694 25,702 147K unshuffled_deduplicated_io 617 22,773 130K
Iloko ilo unshuffled_original_ilo 2638 142,942 874K unshuffled_deduplicated_ilo 1578 105,564 636K
Indonesian id unshuffled_original_id 16236463 4,574,692,265 30G unshuffled_deduplicated_id 9948521 2,394,957,629 16G
Interlingua ia unshuffled_original_ia 1040 180,231 662K unshuffled_deduplicated_ia 529 100,019 360K
Interlingue ie unshuffled_original_ie 101 5,352 24K unshuffled_deduplicated_ie 11 602 1.6K
Irish ga unshuffled_original_ga 83223 14,483,593 88M unshuffled_deduplicated_ga 46493 10,017,303 60M
Italian it unshuffled_original_it 46981781 22,248,707,341 137G unshuffled_deduplicated_it 28522082 11,250,012,896 69G
Japanese ja unshuffled_original_ja 62721527 4,962,979,182 216G unshuffled_deduplicated_ja 39496439 1,123,067,063 106G
Javanese jv unshuffled_original_jv 1445 104,896 659K unshuffled_deduplicated_jv 1163 86,654 583K
Kalmyk xal unshuffled_original_xal 39 10,277 113K unshuffled_deduplicated_xal 36 10,155 112K
Kannada kn unshuffled_original_kn 350363 81,186,863 1.7G unshuffled_deduplicated_kn 251064 49,343,462 1.1G
Karachay-Balkar krc unshuffled_original_krc 1581 185,436 2.6M unshuffled_deduplicated_krc 1377 166,496 2.3M
Kazakh kk unshuffled_original_kk 524591 191,126,469 2.7G unshuffled_deduplicated_kk 338073 108,388,743 1.5G
Kirghiz ky unshuffled_original_ky 146993 44,194,823 600M unshuffled_deduplicated_ky 86561 28,982,620 388M
Komi kv unshuffled_original_kv 1549 201,404 2.3M unshuffled_deduplicated_kv 924 95,243 1.2M
Korean ko unshuffled_original_ko 7345075 2,368,765,142 24G unshuffled_deduplicated_ko 3675420 1,120,375,149 12G
Kurdish ku unshuffled_original_ku 46535 15,561,003 94M unshuffled_deduplicated_ku 29054 9,946,440 60M
Lao lo unshuffled_original_lo 52910 4,133,311 174M unshuffled_deduplicated_lo 32652 2,583,342 114M
Latin la unshuffled_original_la 94588 4,122,201 26M unshuffled_deduplicated_la 18808 1,328,038 8.3M
Latvian lv unshuffled_original_lv 1593820 520,761,977 4.0G unshuffled_deduplicated_lv 843195 236,428,905 1.8G
Lezghian lez unshuffled_original_lez 1485 247,646 3.3M unshuffled_deduplicated_lez 1381 224,871 3.0M
Limburgan li unshuffled_original_li 137 4,730 29K unshuffled_deduplicated_li 118 4,283 27K
Lithuanian lt unshuffled_original_lt 2977757 1,159,661,742 8.8G unshuffled_deduplicated_lt 1737411 516,183,525 3.9G
Lojban jbo unshuffled_original_jbo 832 154,330 736K unshuffled_deduplicated_jbo 617 141,973 678K
Lombard lmo unshuffled_original_lmo 1401 75,229 443K unshuffled_deduplicated_lmo 1374 73,665 433K
Low German nds unshuffled_original_nds 18174 2,906,347 18M unshuffled_deduplicated_nds 8714 2,146,417 13M
Lower Sorbian dsb unshuffled_original_dsb 65 1,787 13K unshuffled_deduplicated_dsb 37 966 7.1K
Luxembourgish lb unshuffled_original_lb 34807 4,403,577 29M unshuffled_deduplicated_lb 21735 3,087,650 21M
Macedonian mk unshuffled_original_mk 437871 189,289,873 2.1G unshuffled_deduplicated_mk 299457 102,849,595 1.2G
Maithili mai unshuffled_original_mai 123 69,161 317K unshuffled_deduplicated_mai 25 874 11K
Malagasy mg unshuffled_original_mg 17957 3,068,360 21M unshuffled_deduplicated_mg 13343 1,872,044 13M
Malay ms unshuffled_original_ms 534016 16,696,882 111M unshuffled_deduplicated_ms 183443 6,045,753 42M
Malayalam ml unshuffled_original_ml 603937 189,534,472 4.9G unshuffled_deduplicated_ml 453904 95,892,551 2.5G
Maltese mt unshuffled_original_mt 26598 2,995,654 24M unshuffled_deduplicated_mt 16383 2,163,358 17M
Marathi mr unshuffled_original_mr 326804 162,609,404 2.7G unshuffled_deduplicated_mr 212556 82,130,803 1.4G
Mazanderani mzn unshuffled_original_mzn 1055 73,870 691K unshuffled_deduplicated_mzn 917 64,481 602K
Minangkabau min unshuffled_original_min 220 5,682 608K unshuffled_deduplicated_min 166 4,825 310K
Mingrelian xmf unshuffled_original_xmf 3783 299,098 5.8M unshuffled_deduplicated_xmf 2418 228,629 4.4M
Mirandese mwl unshuffled_original_mwl 8 171 1.2K unshuffled_deduplicated_mwl 7 152 1.1K
Modern Greek el unshuffled_original_el 10425596 5,479,180,137 62G unshuffled_deduplicated_el 6521169 2,412,419,435 27G
Mongolian mn unshuffled_original_mn 395605 181,307,167 2.2G unshuffled_deduplicated_mn 197878 68,362,013 838M
Nahuatl languages nah unshuffled_original_nah 61 1,234 12K unshuffled_deduplicated_nah 58 1,193 11K
Neapolitan nap unshuffled_original_nap 73 5,282 17K unshuffled_deduplicated_nap 55 4,147 13K
Nepali ne unshuffled_original_ne 299938 107,448,208 1.8G unshuffled_deduplicated_ne 219334 71,628,317 1.2G
Newari new unshuffled_original_new 4696 564,697 5.5M unshuffled_deduplicated_new 2126 288,995 4.1M
Northern Frisian frr unshuffled_original_frr 7 1,516 4.4K unshuffled_deduplicated_frr 7 1,516 4.4K
Northern Luri lrc unshuffled_original_lrc 88 8,022 76K unshuffled_deduplicated_lrc 72 6,740 63K
Norwegian no unshuffled_original_no 5546211 1,344,326,388 8.0G unshuffled_deduplicated_no 3229940 804,894,377 4.7G
Norwegian Nynorsk nn unshuffled_original_nn 185884 14,764,980 85M unshuffled_deduplicated_nn 109118 9,435,139 54M
Occitan oc unshuffled_original_oc 10709 750,301 5.8M unshuffled_deduplicated_oc 6485 512,678 3.7M
Oriya or unshuffled_original_or 59463 14,938,567 248M unshuffled_deduplicated_or 44230 11,321,740 188M
Ossetian os unshuffled_original_os 5213 1,031,268 13M unshuffled_deduplicated_os 2559 878,765 11M
Pampanga pam unshuffled_original_pam 3 130 760 unshuffled_deduplicated_pam 1 52 304
Panjabi pa unshuffled_original_pa 127467 61,847,806 763M unshuffled_deduplicated_pa 87235 37,555,835 460M
Persian fa unshuffled_original_fa 13704702 9,096,554,121 79G unshuffled_deduplicated_fa 8203495 4,363,505,319 38G
Piemontese pms unshuffled_original_pms 3225 362,013 2.1M unshuffled_deduplicated_pms 2859 337,246 1.9M
Polish pl unshuffled_original_pl 35440972 15,277,255,137 109G unshuffled_deduplicated_pl 20682611 6,708,709,674 47G
Portuguese pt unshuffled_original_pt 42114520 20,641,903,898 124G unshuffled_deduplicated_pt 26920397 10,751,156,918 64G
Pushto ps unshuffled_original_ps 98216 46,559,441 361M unshuffled_deduplicated_ps 67921 31,347,348 242M
Quechua qu unshuffled_original_qu 452 10,186 78K unshuffled_deduplicated_qu 411 8,691 67K
Romanian ro unshuffled_original_ro 9387265 3,984,317,058 25G unshuffled_deduplicated_ro 5044757 1,741,794,069 11G
Romansh rm unshuffled_original_rm 41 1,093 7.4K unshuffled_deduplicated_rm 34 960 6.5K
Russia Buriat bxr unshuffled_original_bxr 42 963 13K unshuffled_deduplicated_bxr 36 809 11K
Russian ru unshuffled_original_ru 161836003 92,522,407,837 1.2T unshuffled_deduplicated_ru 115954598 46,692,691,520 568G
Sanskrit sa unshuffled_original_sa 14291 4,331,569 93M unshuffled_deduplicated_sa 7121 1,713,930 37M
Scottish Gaelic gd unshuffled_original_gd 5799 310,689 1.9M unshuffled_deduplicated_gd 3883 207,110 1.3M
Serbian sr unshuffled_original_sr 1013619 364,395,411 3.9G unshuffled_deduplicated_sr 645747 207,561,168 2.2G
Serbo-Croatian sh unshuffled_original_sh 36700 5,292,184 25M unshuffled_deduplicated_sh 17610 1,040,573 5.8M
Sicilian scn unshuffled_original_scn 21 554 3.3K unshuffled_deduplicated_scn 17 468 2.8K
Sindhi sd unshuffled_original_sd 44280 43,530,158 347M unshuffled_deduplicated_sd 33925 33,028,015 263M
Sinhala si unshuffled_original_si 203082 93,053,465 1.4G unshuffled_deduplicated_si 120684 50,864,857 802M
Slovak sk unshuffled_original_sk 5492194 1,322,247,763 9.1G unshuffled_deduplicated_sk 2820821 656,346,179 4.5G
Slovenian sl unshuffled_original_sl 1746604 387,399,700 2.5G unshuffled_deduplicated_sl 886223 193,926,684 1.3G
Somali so unshuffled_original_so 156 1,202 61K unshuffled_deduplicated_so 42 472 16K
South Azerbaijani azb unshuffled_original_azb 15446 2,175,054 27M unshuffled_deduplicated_azb 9985 1,528,709 19M
Spanish es unshuffled_original_es 88199221 47,545,122,279 278G unshuffled_deduplicated_es 56326016 25,928,290,729 149G
Sundanese su unshuffled_original_su 805 30,321 211K unshuffled_deduplicated_su 511 20,278 141K
Swahili sw unshuffled_original_sw 41986 2,211,927 13M unshuffled_deduplicated_sw 24803 1,376,963 8.1M
Swedish sv unshuffled_original_sv 17395625 7,155,994,312 44G unshuffled_deduplicated_sv 11014487 4,106,120,608 25G
Tagalog tl unshuffled_original_tl 458206 98,949,299 573M unshuffled_deduplicated_tl 294132 70,121,601 407M
Tajik tg unshuffled_original_tg 89002 31,758,142 379M unshuffled_deduplicated_tg 56259 21,029,893 249M
Tamil ta unshuffled_original_ta 1263280 420,537,132 9.3G unshuffled_deduplicated_ta 833101 226,013,330 5.1G
Tatar tt unshuffled_original_tt 135923 51,034,893 670M unshuffled_deduplicated_tt 82738 23,825,695 305M
Telugu te unshuffled_original_te 475703 123,711,517 2.5G unshuffled_deduplicated_te 312644 79,094,167 1.6G
Thai th unshuffled_original_th 6064129 951,743,087 36G unshuffled_deduplicated_th 3749826 368,965,202 16G
Tibetan bo unshuffled_original_bo 26795 1,483,589 187M unshuffled_deduplicated_bo 15762 936,556 138M
Turkish tr unshuffled_original_tr 18535253 7,577,388,700 60G unshuffled_deduplicated_tr 11596446 3,365,734,289 27G
Turkmen tk unshuffled_original_tk 6456 1,113,869 11M unshuffled_deduplicated_tk 4694 752,326 6.8M
Tuvinian tyv unshuffled_original_tyv 34 759 12K unshuffled_deduplicated_tyv 24 540 7.9K
Uighur ug unshuffled_original_ug 22255 8,657,141 122M unshuffled_deduplicated_ug 15503 5,852,225 83M
Ukrainian uk unshuffled_original_uk 12973467 4,204,381,276 53G unshuffled_deduplicated_uk 7782375 2,252,380,351 28G
Upper Sorbian hsb unshuffled_original_hsb 7959 545,351 4.2M unshuffled_deduplicated_hsb 3084 236,867 1.8M
Urdu ur unshuffled_original_ur 638596 331,817,982 2.7G unshuffled_deduplicated_ur 428674 218,030,228 1.7G
Uzbek uz unshuffled_original_uz 27537 2,450,256 21M unshuffled_deduplicated_uz 15074 1,381,644 12M
Venetian vec unshuffled_original_vec 73 3,492 18K unshuffled_deduplicated_vec 64 3,199 17K
Vietnamese vi unshuffled_original_vi 14898250 12,036,845,359 68G unshuffled_deduplicated_vi 9897709 5,577,159,843 32G
Volapük vo unshuffled_original_vo 3366 321,121 2.0M unshuffled_deduplicated_vo 3317 318,568 2.0M
Walloon wa unshuffled_original_wa 1001 50,720 273K unshuffled_deduplicated_wa 677 37,543 203K
Waray war unshuffled_original_war 9760 397,315 2.5M unshuffled_deduplicated_war 9161 336,311 2.2M
Welsh cy unshuffled_original_cy 157698 37,422,441 213M unshuffled_deduplicated_cy 98225 23,574,673 133M
Western Frisian fy unshuffled_original_fy 33053 5,691,077 35M unshuffled_deduplicated_fy 20661 4,223,816 26M
Western Mari mrj unshuffled_original_mrj 757 93,338 1.2M unshuffled_deduplicated_mrj 669 87,780 1.1M
Western Panjabi pnb unshuffled_original_pnb 4599 1,426,986 12M unshuffled_deduplicated_pnb 3463 1,111,112 9.0M
Wu Chinese wuu unshuffled_original_wuu 214 11,189 109K unshuffled_deduplicated_wuu 64 4,333 32K
Yakut sah unshuffled_original_sah 22301 2,547,623 42M unshuffled_deduplicated_sah 8555 1,789,174 26M
Yiddish yi unshuffled_original_yi 59364 13,834,320 141M unshuffled_deduplicated_yi 32919 8,212,970 84M
Yoruba yo unshuffled_original_yo 214 8,906 55K unshuffled_deduplicated_yo 49 3,518 27K
Yue Chinese yue unshuffled_original_yue 11 186 3.7K unshuffled_deduplicated_yue 7 128 2.2K

Dataset Creation

Curation Rationale

OSCAR was constructed new pipeline derived from the fastText's one, called goclassy. Goclassy reuses the fastText linear classifier and the pre-trained fastText model for language recognition, but it completely rewrites and parallelises their pipeline in an asynchronous manner.

The order of operations is more or less the same as in the fastText pre-processing pipeline but instead of clustering multiple operations into a single blocking process, a worker is launched for each operation but bounding the number of possible parallel operations at a given time by the number of available threads instead of the number of CPUs. Goclassy is implemented in the Go programming language so it lets the Go runtime handle the scheduling of the processes. Thus the goclassy's pipeline one does not have to wait for a whole WET file to download, decompress and classify in order to start downloading and processing the next one, a new file will start downloading and processing as soon as the scheduler is able to allocate a new process.

Filtering and cleaning processes at line level are done before feeding each line to the classifier. Lines shorter than 100 UTF-8 characters and lines containing invalid UTF-8 characters are discarted and are not classified. After all files are proccesed the deduplicated versions are constructed and everything is then splitted in shards and compressed.

Source Data

Initial Data Collection and Normalization

Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.

Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics.

To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR, the November 2018 snapshot was used. It surpasses 20TB of uncompressed data and contains more than 50 thousand plain text files where each file consists of the plain text from multiple websites along its metadata header.

Who are the source language producers?

The data comes from multiple web pages in a large variety of languages.

Annotations

The dataset does not contain any additional annotations.

Annotation process

N/A

Who are the annotators?

N/A

Personal and Sensitive Information

Being constructed from Common Crawl, Personal and sensitive information might be present. This must be considered before training deep learning models with OSCAR, specially in the case of text-generation models.

Considerations for Using the Data

Social Impact of Dataset

OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures.

Discussion of Biases

OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models.

Other Known Limitations

The fastText linear classifier is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by third parties.

Additional Information

Dataset Curators

The corpus was put together by Pedro J. Ortiz, Benoît Sagot, and Laurent Romary, during work done at Inria, particularly at the ALMAnaCH team.

Licensing Information

These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR
This work is published from: France.

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Citation Information

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Benoit",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}

Contributions

Thanks to @pjox and @lhoestq for adding this dataset.

Downloads last month
11,151

Models trained or fine-tuned on oscar-corpus/oscar