The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider
removing the
loading script
and relying on
automated data support
(you can use
convert_to_parquet
from the datasets
library). If this is not possible, please
open a discussion
for direct help.
Dataset Card for "oscar"
Dataset Summary
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.
The version here is the original OSCAR 2019 release: https://oscar-project.org/post/oscar-2019/
For more recent versions, visit the oscar-corpus organization on the Hub:
- OSCAR 22.01 (released in January 2022): oscar-corpus/OSCAR-2201
- OSCAR 21.09 (released in September 2021): oscar-corpus/OSCAR-2109
Supported Tasks and Leaderboards
OSCAR is mainly inteded to pretrain language models and word represantations.
Languages
All the data is distributed by language, both the original and the deduplicated versions of the data are available. 166 different languages are available. The table in subsection Data Splits Sample Size provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR.
Dataset Structure
We show detailed information for all the configurations of the dataset.
Data Instances
Click to expand the Data/size information for each language (deduplicated)
unshuffled_deduplicated_af
- Size of downloaded dataset files: 65.99 MB
- Size of the generated dataset: 172.30 MB
- Total amount of disk used: 238.29 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}
unshuffled_deduplicated_als
- Size of downloaded dataset files: 1.26 MB
- Size of the generated dataset: 2.96 MB
- Total amount of disk used: 4.22 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}
unshuffled_deduplicated_am
- Size of downloaded dataset files: 61.35 MB
- Size of the generated dataset: 216.15 MB
- Total amount of disk used: 277.50 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}
unshuffled_deduplicated_an
- Size of downloaded dataset files: 0.14 MB
- Size of the generated dataset: 0.85 MB
- Total amount of disk used: 0.99 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}
unshuffled_deduplicated_ar
- Size of downloaded dataset files: 9.67 GB
- Size of the generated dataset: 33.57 GB
- Total amount of disk used: 43.23 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}
unshuffled_deduplicated_arz
- Size of downloaded dataset files: 10.02 MB
- Size of the generated dataset: 35.91 MB
- Total amount of disk used: 45.94 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}
unshuffled_deduplicated_as
- Size of downloaded dataset files: 15.51 MB
- Size of the generated dataset: 74.07 MB
- Total amount of disk used: 89.58 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}
unshuffled_deduplicated_ast
- Size of downloaded dataset files: 0.86 MB
- Size of the generated dataset: 2.17 MB
- Total amount of disk used: 3.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}
unshuffled_deduplicated_av
- Size of downloaded dataset files: 0.07 MB
- Size of the generated dataset: 0.34 MB
- Total amount of disk used: 0.41 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}
unshuffled_deduplicated_az
- Size of downloaded dataset files: 521.74 MB
- Size of the generated dataset: 1.53 GB
- Total amount of disk used: 2.05 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}
unshuffled_deduplicated_azb
- Size of downloaded dataset files: 5.19 MB
- Size of the generated dataset: 20.08 MB
- Total amount of disk used: 25.27 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}
unshuffled_deduplicated_ba
- Size of downloaded dataset files: 25.98 MB
- Size of the generated dataset: 93.84 MB
- Total amount of disk used: 119.82 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}
unshuffled_deduplicated_bar
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": " vo"
}
unshuffled_deduplicated_bcl
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}
unshuffled_deduplicated_be
- Size of downloaded dataset files: 306.70 MB
- Size of the generated dataset: 1.08 GB
- Total amount of disk used: 1.39 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}
unshuffled_deduplicated_bg
- Size of downloaded dataset files: 3.85 GB
- Size of the generated dataset: 14.45 GB
- Total amount of disk used: 18.30 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}
unshuffled_deduplicated_bh
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.04 MB
- Total amount of disk used: 0.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}
unshuffled_deduplicated_bn
- Size of downloaded dataset files: 1.26 GB
- Size of the generated dataset: 6.24 GB
- Total amount of disk used: 7.50 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nTagged with ডায়োজিনি..."
}
unshuffled_deduplicated_bo
- Size of downloaded dataset files: 22.37 MB
- Size of the generated dataset: 144.65 MB
- Total amount of disk used: 167.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}
unshuffled_deduplicated_bpy
- Size of downloaded dataset files: 0.19 MB
- Size of the generated dataset: 1.78 MB
- Total amount of disk used: 1.97 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}
unshuffled_deduplicated_br
- Size of downloaded dataset files: 6.47 MB
- Size of the generated dataset: 17.00 MB
- Total amount of disk used: 23.47 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}
unshuffled_deduplicated_bs
- Size of downloaded dataset files: 0.04 MB
- Size of the generated dataset: 0.15 MB
- Total amount of disk used: 0.18 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}
unshuffled_deduplicated_bxr
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}
unshuffled_deduplicated_ca
- Size of downloaded dataset files: 1.73 GB
- Size of the generated dataset: 4.57 GB
- Total amount of disk used: 6.30 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}
unshuffled_deduplicated_cbk
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}
unshuffled_deduplicated_ce
- Size of downloaded dataset files: 1.87 MB
- Size of the generated dataset: 7.04 MB
- Total amount of disk used: 8.90 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}
unshuffled_deduplicated_ceb
- Size of downloaded dataset files: 7.12 MB
- Size of the generated dataset: 24.83 MB
- Total amount of disk used: 31.95 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}
unshuffled_deduplicated_ckb
- Size of downloaded dataset files: 60.32 MB
- Size of the generated dataset: 237.72 MB
- Total amount of disk used: 298.05 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}
unshuffled_deduplicated_cs
- Size of downloaded dataset files: 10.49 GB
- Size of the generated dataset: 25.71 GB
- Total amount of disk used: 36.20 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}
unshuffled_deduplicated_cv
- Size of downloaded dataset files: 7.47 MB
- Size of the generated dataset: 27.49 MB
- Total amount of disk used: 34.95 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}
unshuffled_deduplicated_cy
- Size of downloaded dataset files: 53.63 MB
- Size of the generated dataset: 141.22 MB
- Total amount of disk used: 194.86 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}
unshuffled_deduplicated_da
- Size of downloaded dataset files: 3.82 GB
- Size of the generated dataset: 10.24 GB
- Total amount of disk used: 14.06 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}
unshuffled_deduplicated_de
- Size of downloaded dataset files: 60.80 GB
- Size of the generated dataset: 156.30 GB
- Total amount of disk used: 217.10 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}
unshuffled_deduplicated_diq
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}
unshuffled_deduplicated_dsb
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}
unshuffled_deduplicated_dv
- Size of downloaded dataset files: 16.84 MB
- Size of the generated dataset: 82.19 MB
- Total amount of disk used: 99.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}
unshuffled_deduplicated_el
- Size of downloaded dataset files: 7.91 GB
- Size of the generated dataset: 28.74 GB
- Total amount of disk used: 36.65 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}
unshuffled_deduplicated_eml
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}
unshuffled_deduplicated_en
- Size of downloaded dataset files: 496.50 GB
- Size of the generated dataset: 1299.75 GB
- Total amount of disk used: 1796.24 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}
unshuffled_deduplicated_eo
- Size of downloaded dataset files: 92.86 MB
- Size of the generated dataset: 240.12 MB
- Total amount of disk used: 332.99 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}
unshuffled_deduplicated_es
- Size of downloaded dataset files: 60.46 GB
- Size of the generated dataset: 160.86 GB
- Total amount of disk used: 221.32 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}
unshuffled_deduplicated_et
- Size of downloaded dataset files: 966.79 MB
- Size of the generated dataset: 2.45 GB
- Total amount of disk used: 3.41 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}
unshuffled_deduplicated_eu
- Size of downloaded dataset files: 134.68 MB
- Size of the generated dataset: 363.93 MB
- Total amount of disk used: 498.61 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}
unshuffled_deduplicated_fa
- Size of downloaded dataset files: 10.46 GB
- Size of the generated dataset: 40.06 GB
- Total amount of disk used: 50.52 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}
unshuffled_deduplicated_fi
- Size of downloaded dataset files: 5.38 GB
- Size of the generated dataset: 13.99 GB
- Total amount of disk used: 19.37 GB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}
unshuffled_deduplicated_fr
- Size of downloaded dataset files: 55.46 GB
- Size of the generated dataset: 148.28 GB
- Total amount of disk used: 203.75 GB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}
unshuffled_deduplicated_frr
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}
unshuffled_deduplicated_fy
- Size of downloaded dataset files: 10.27 MB
- Size of the generated dataset: 26.73 MB
- Total amount of disk used: 37.00 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}
unshuffled_deduplicated_ga
- Size of downloaded dataset files: 22.22 MB
- Size of the generated dataset: 63.86 MB
- Total amount of disk used: 86.08 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}
unshuffled_deduplicated_gd
- Size of downloaded dataset files: 0.42 MB
- Size of the generated dataset: 1.36 MB
- Total amount of disk used: 1.78 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}
unshuffled_deduplicated_gl
- Size of downloaded dataset files: 155.85 MB
- Size of the generated dataset: 408.34 MB
- Total amount of disk used: 564.19 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}
unshuffled_deduplicated_gn
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}
unshuffled_deduplicated_gom
- Size of downloaded dataset files: 0.38 MB
- Size of the generated dataset: 1.87 MB
- Total amount of disk used: 2.24 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}
unshuffled_deduplicated_gu
- Size of downloaded dataset files: 162.97 MB
- Size of the generated dataset: 759.34 MB
- Total amount of disk used: 922.32 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}
unshuffled_deduplicated_he
- Size of downloaded dataset files: 3.04 GB
- Size of the generated dataset: 10.47 GB
- Total amount of disk used: 13.51 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}
unshuffled_deduplicated_hi
- Size of downloaded dataset files: 2.01 GB
- Size of the generated dataset: 9.57 GB
- Total amount of disk used: 11.58 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}
unshuffled_deduplicated_hr
- Size of downloaded dataset files: 46.74 MB
- Size of the generated dataset: 121.50 MB
- Total amount of disk used: 168.23 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}
unshuffled_deduplicated_hsb
- Size of downloaded dataset files: 0.72 MB
- Size of the generated dataset: 1.89 MB
- Total amount of disk used: 2.61 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}
unshuffled_deduplicated_ht
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}
unshuffled_deduplicated_hu
- Size of downloaded dataset files: 7.37 GB
- Size of the generated dataset: 19.09 GB
- Total amount of disk used: 26.46 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}
unshuffled_deduplicated_hy
- Size of downloaded dataset files: 393.62 MB
- Size of the generated dataset: 1.56 GB
- Total amount of disk used: 1.96 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}
unshuffled_deduplicated_ia
- Size of downloaded dataset files: 0.05 MB
- Size of the generated dataset: 0.38 MB
- Total amount of disk used: 0.43 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}
unshuffled_deduplicated_id
- Size of downloaded dataset files: 6.00 GB
- Size of the generated dataset: 17.05 GB
- Total amount of disk used: 23.05 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}
unshuffled_deduplicated_ie
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}
unshuffled_deduplicated_ilo
- Size of downloaded dataset files: 0.23 MB
- Size of the generated dataset: 0.68 MB
- Total amount of disk used: 0.91 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}
unshuffled_deduplicated_io
- Size of downloaded dataset files: 0.04 MB
- Size of the generated dataset: 0.14 MB
- Total amount of disk used: 0.19 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}
unshuffled_deduplicated_is
- Size of downloaded dataset files: 332.87 MB
- Size of the generated dataset: 894.28 MB
- Total amount of disk used: 1.23 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}
unshuffled_deduplicated_it
- Size of downloaded dataset files: 27.93 GB
- Size of the generated dataset: 74.09 GB
- Total amount of disk used: 102.03 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}
unshuffled_deduplicated_ja
- Size of downloaded dataset files: 40.80 GB
- Size of the generated dataset: 113.63 GB
- Total amount of disk used: 154.44 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}
unshuffled_deduplicated_jbo
- Size of downloaded dataset files: 0.20 MB
- Size of the generated dataset: 0.70 MB
- Total amount of disk used: 0.91 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}
unshuffled_deduplicated_jv
- Size of downloaded dataset files: 0.21 MB
- Size of the generated dataset: 0.62 MB
- Total amount of disk used: 0.82 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}
unshuffled_deduplicated_ka
- Size of downloaded dataset files: 377.23 MB
- Size of the generated dataset: 1.99 GB
- Total amount of disk used: 2.36 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}
unshuffled_deduplicated_kk
- Size of downloaded dataset files: 389.12 MB
- Size of the generated dataset: 1.59 GB
- Total amount of disk used: 1.97 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}
unshuffled_deduplicated_km
- Size of downloaded dataset files: 114.48 MB
- Size of the generated dataset: 610.61 MB
- Total amount of disk used: 725.09 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}
unshuffled_deduplicated_kn
- Size of downloaded dataset files: 215.52 MB
- Size of the generated dataset: 1.08 GB
- Total amount of disk used: 1.30 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}
unshuffled_deduplicated_ko
- Size of downloaded dataset files: 4.46 GB
- Size of the generated dataset: 12.00 GB
- Total amount of disk used: 16.47 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}
unshuffled_deduplicated_krc
- Size of downloaded dataset files: 0.62 MB
- Size of the generated dataset: 2.41 MB
- Total amount of disk used: 3.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}
unshuffled_deduplicated_ku
- Size of downloaded dataset files: 23.34 MB
- Size of the generated dataset: 63.09 MB
- Total amount of disk used: 86.43 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}
unshuffled_deduplicated_kv
- Size of downloaded dataset files: 0.33 MB
- Size of the generated dataset: 1.21 MB
- Total amount of disk used: 1.54 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}
unshuffled_deduplicated_kw
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..."
}
unshuffled_deduplicated_ky
- Size of downloaded dataset files: 106.22 MB
- Size of the generated dataset: 408.40 MB
- Total amount of disk used: 514.61 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}
unshuffled_deduplicated_la
- Size of downloaded dataset files: 3.42 MB
- Size of the generated dataset: 9.79 MB
- Total amount of disk used: 13.22 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}
unshuffled_deduplicated_lb
- Size of downloaded dataset files: 8.30 MB
- Size of the generated dataset: 21.42 MB
- Total amount of disk used: 29.72 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}
unshuffled_deduplicated_lez
- Size of downloaded dataset files: 0.77 MB
- Size of the generated dataset: 3.08 MB
- Total amount of disk used: 3.84 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}
unshuffled_deduplicated_li
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.03 MB
- Total amount of disk used: 0.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}
unshuffled_deduplicated_lmo
- Size of downloaded dataset files: 0.10 MB
- Size of the generated dataset: 0.46 MB
- Total amount of disk used: 0.57 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}
unshuffled_deduplicated_lo
- Size of downloaded dataset files: 23.63 MB
- Size of the generated dataset: 119.29 MB
- Total amount of disk used: 142.92 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}
unshuffled_deduplicated_lrc
- Size of downloaded dataset files: 0.02 MB
- Size of the generated dataset: 0.06 MB
- Total amount of disk used: 0.08 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}
unshuffled_deduplicated_lt
- Size of downloaded dataset files: 1.65 GB
- Size of the generated dataset: 4.20 GB
- Total amount of disk used: 5.86 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}
unshuffled_deduplicated_lv
- Size of downloaded dataset files: 710.45 MB
- Size of the generated dataset: 1.91 GB
- Total amount of disk used: 2.62 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}
unshuffled_deduplicated_mai
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}
unshuffled_deduplicated_mg
- Size of downloaded dataset files: 4.30 MB
- Size of the generated dataset: 13.59 MB
- Total amount of disk used: 17.89 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}
unshuffled_deduplicated_mhr
- Size of downloaded dataset files: 1.63 MB
- Size of the generated dataset: 6.26 MB
- Total amount of disk used: 7.89 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}
unshuffled_deduplicated_min
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.31 MB
- Total amount of disk used: 0.33 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\" ..."
}
unshuffled_deduplicated_mk
- Size of downloaded dataset files: 303.12 MB
- Size of the generated dataset: 1.19 GB
- Total amount of disk used: 1.49 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}
unshuffled_deduplicated_ml
- Size of downloaded dataset files: 496.80 MB
- Size of the generated dataset: 2.69 GB
- Total amount of disk used: 3.18 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..."
}
unshuffled_deduplicated_mn
- Size of downloaded dataset files: 219.52 MB
- Size of the generated dataset: 883.46 MB
- Total amount of disk used: 1.10 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"МУБИС-ын багш мэргэжлийн хөрвөх сургалтыг төгссөн багшид багшлах эрх олгох тухай ~ БМДИ-ийн захирлын тушаал - Багшийн мэргэжил ..."
}
unshuffled_deduplicated_mr
- Size of downloaded dataset files: 299.68 MB
- Size of the generated dataset: 1.49 GB
- Total amount of disk used: 1.79 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}
unshuffled_deduplicated_mrj
- Size of downloaded dataset files: 0.29 MB
- Size of the generated dataset: 1.10 MB
- Total amount of disk used: 1.38 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}
unshuffled_deduplicated_ms
- Size of downloaded dataset files: 16.39 MB
- Size of the generated dataset: 49.45 MB
- Total amount of disk used: 65.85 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}
unshuffled_deduplicated_mt
- Size of downloaded dataset files: 5.90 MB
- Size of the generated dataset: 17.68 MB
- Total amount of disk used: 23.58 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}
unshuffled_deduplicated_mwl
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}
unshuffled_deduplicated_my
- Size of downloaded dataset files: 207.14 MB
- Size of the generated dataset: 1.11 GB
- Total amount of disk used: 1.32 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}
unshuffled_deduplicated_myv
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}
unshuffled_deduplicated_mzn
- Size of downloaded dataset files: 0.16 MB
- Size of the generated dataset: 0.63 MB
- Total amount of disk used: 0.79 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}
unshuffled_deduplicated_nah
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}
unshuffled_deduplicated_nap
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}
unshuffled_deduplicated_nds
- Size of downloaded dataset files: 5.27 MB
- Size of the generated dataset: 13.48 MB
- Total amount of disk used: 18.76 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}
unshuffled_deduplicated_ne
- Size of downloaded dataset files: 240.63 MB
- Size of the generated dataset: 1.24 GB
- Total amount of disk used: 1.48 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}
unshuffled_deduplicated_new
- Size of downloaded dataset files: 0.83 MB
- Size of the generated dataset: 4.26 MB
- Total amount of disk used: 5.09 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}
unshuffled_deduplicated_nl
- Size of downloaded dataset files: 15.73 GB
- Size of the generated dataset: 41.91 GB
- Total amount of disk used: 57.65 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}
unshuffled_deduplicated_nn
- Size of downloaded dataset files: 23.58 MB
- Size of the generated dataset: 58.32 MB
- Total amount of disk used: 81.90 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}
unshuffled_deduplicated_no
- Size of downloaded dataset files: 1.96 GB
- Size of the generated dataset: 5.11 GB
- Total amount of disk used: 7.07 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}
unshuffled_deduplicated_oc
- Size of downloaded dataset files: 1.34 MB
- Size of the generated dataset: 4.00 MB
- Total amount of disk used: 5.34 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}
unshuffled_deduplicated_or
- Size of downloaded dataset files: 38.72 MB
- Size of the generated dataset: 197.63 MB
- Total amount of disk used: 236.36 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}
unshuffled_deduplicated_os
- Size of downloaded dataset files: 2.83 MB
- Size of the generated dataset: 11.00 MB
- Total amount of disk used: 13.83 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}
unshuffled_deduplicated_pa
- Size of downloaded dataset files: 102.39 MB
- Size of the generated dataset: 483.04 MB
- Total amount of disk used: 585.42 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}
unshuffled_deduplicated_pam
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}
unshuffled_deduplicated_pl
- Size of downloaded dataset files: 20.19 GB
- Size of the generated dataset: 50.59 GB
- Total amount of disk used: 70.78 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}
unshuffled_deduplicated_pms
- Size of downloaded dataset files: 0.71 MB
- Size of the generated dataset: 2.00 MB
- Total amount of disk used: 2.72 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}
unshuffled_deduplicated_pnb
- Size of downloaded dataset files: 2.58 MB
- Size of the generated dataset: 9.44 MB
- Total amount of disk used: 12.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}
unshuffled_deduplicated_ps
- Size of downloaded dataset files: 71.83 MB
- Size of the generated dataset: 254.79 MB
- Total amount of disk used: 326.61 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}
unshuffled_deduplicated_pt
- Size of downloaded dataset files: 26.00 GB
- Size of the generated dataset: 68.37 GB
- Total amount of disk used: 94.37 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}
unshuffled_deduplicated_qu
- Size of downloaded dataset files: 0.02 MB
- Size of the generated dataset: 0.07 MB
- Total amount of disk used: 0.09 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}
unshuffled_deduplicated_rm
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}
unshuffled_deduplicated_ro
- Size of downloaded dataset files: 4.48 GB
- Size of the generated dataset: 11.66 GB
- Total amount of disk used: 16.14 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}
unshuffled_deduplicated_ru
- Size of downloaded dataset files: 166.68 GB
- Size of the generated dataset: 611.70 GB
- Total amount of disk used: 778.38 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}
unshuffled_deduplicated_sa
- Size of downloaded dataset files: 7.27 MB
- Size of the generated dataset: 38.33 MB
- Total amount of disk used: 45.60 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}
unshuffled_deduplicated_sah
- Size of downloaded dataset files: 7.01 MB
- Size of the generated dataset: 27.46 MB
- Total amount of disk used: 34.49 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}
unshuffled_deduplicated_scn
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}
unshuffled_deduplicated_sd
- Size of downloaded dataset files: 74.17 MB
- Size of the generated dataset: 275.48 MB
- Total amount of disk used: 349.66 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}
unshuffled_deduplicated_sh
- Size of downloaded dataset files: 1.45 MB
- Size of the generated dataset: 6.44 MB
- Total amount of disk used: 7.87 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}
unshuffled_deduplicated_si
- Size of downloaded dataset files: 175.62 MB
- Size of the generated dataset: 842.57 MB
- Total amount of disk used: 1.02 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}
unshuffled_deduplicated_sk
- Size of downloaded dataset files: 1.96 GB
- Size of the generated dataset: 4.80 GB
- Total amount of disk used: 6.76 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}
unshuffled_deduplicated_sl
- Size of downloaded dataset files: 523.22 MB
- Size of the generated dataset: 1.32 GB
- Total amount of disk used: 1.85 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}
unshuffled_deduplicated_so
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}
unshuffled_deduplicated_sq
- Size of downloaded dataset files: 445.36 MB
- Size of the generated dataset: 1.21 GB
- Total amount of disk used: 1.66 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}
unshuffled_deduplicated_sr
- Size of downloaded dataset files: 665.03 MB
- Size of the generated dataset: 2.36 GB
- Total amount of disk used: 3.03 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}
unshuffled_deduplicated_su
- Size of downloaded dataset files: 0.05 MB
- Size of the generated dataset: 0.16 MB
- Total amount of disk used: 0.21 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}
unshuffled_deduplicated_sv
- Size of downloaded dataset files: 10.19 GB
- Size of the generated dataset: 26.33 GB
- Total amount of disk used: 36.51 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}
unshuffled_deduplicated_sw
- Size of downloaded dataset files: 2.95 MB
- Size of the generated dataset: 8.98 MB
- Total amount of disk used: 11.92 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}
unshuffled_deduplicated_ta
- Size of downloaded dataset files: 971.12 MB
- Size of the generated dataset: 5.48 GB
- Total amount of disk used: 6.45 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}
unshuffled_deduplicated_te
- Size of downloaded dataset files: 342.43 MB
- Size of the generated dataset: 1.70 GB
- Total amount of disk used: 2.04 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}
unshuffled_deduplicated_tg
- Size of downloaded dataset files: 62.90 MB
- Size of the generated dataset: 261.68 MB
- Total amount of disk used: 324.60 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}
unshuffled_deduplicated_th
- Size of downloaded dataset files: 3.54 GB
- Size of the generated dataset: 17.11 GB
- Total amount of disk used: 20.65 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}
unshuffled_deduplicated_tk
- Size of downloaded dataset files: 2.22 MB
- Size of the generated dataset: 7.12 MB
- Total amount of disk used: 9.34 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}
unshuffled_deduplicated_tl
- Size of downloaded dataset files: 151.34 MB
- Size of the generated dataset: 431.69 MB
- Total amount of disk used: 583.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}
unshuffled_deduplicated_tr
- Size of downloaded dataset files: 10.39 GB
- Size of the generated dataset: 28.47 GB
- Total amount of disk used: 38.86 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}
unshuffled_deduplicated_tt
- Size of downloaded dataset files: 85.89 MB
- Size of the generated dataset: 321.37 MB
- Total amount of disk used: 407.26 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}
unshuffled_deduplicated_tyv
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}
unshuffled_deduplicated_ug
- Size of downloaded dataset files: 20.53 MB
- Size of the generated dataset: 86.44 MB
- Total amount of disk used: 106.97 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}
unshuffled_deduplicated_uk
- Size of downloaded dataset files: 8.04 GB
- Size of the generated dataset: 29.86 GB
- Total amount of disk used: 37.90 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}
unshuffled_deduplicated_ur
- Size of downloaded dataset files: 483.59 MB
- Size of the generated dataset: 1.82 GB
- Total amount of disk used: 2.31 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}
unshuffled_deduplicated_uz
- Size of downloaded dataset files: 4.30 MB
- Size of the generated dataset: 12.00 MB
- Total amount of disk used: 16.29 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}
unshuffled_deduplicated_vec
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}
unshuffled_deduplicated_vi
- Size of downloaded dataset files: 10.71 GB
- Size of the generated dataset: 33.60 GB
- Total amount of disk used: 44.31 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}
unshuffled_deduplicated_vo
- Size of downloaded dataset files: 0.30 MB
- Size of the generated dataset: 2.10 MB
- Total amount of disk used: 2.40 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}
unshuffled_deduplicated_wa
- Size of downloaded dataset files: 0.08 MB
- Size of the generated dataset: 0.22 MB
- Total amount of disk used: 0.29 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}
unshuffled_deduplicated_war
- Size of downloaded dataset files: 0.55 MB
- Size of the generated dataset: 2.36 MB
- Total amount of disk used: 2.90 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}
unshuffled_deduplicated_wuu
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.03 MB
- Total amount of disk used: 0.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}
unshuffled_deduplicated_xal
- Size of downloaded dataset files: 0.03 MB
- Size of the generated dataset: 0.12 MB
- Total amount of disk used: 0.15 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}
unshuffled_deduplicated_xmf
- Size of downloaded dataset files: 0.94 MB
- Size of the generated dataset: 4.63 MB
- Total amount of disk used: 5.58 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}
unshuffled_deduplicated_yi
- Size of downloaded dataset files: 22.20 MB
- Size of the generated dataset: 88.29 MB
- Total amount of disk used: 110.49 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}
unshuffled_deduplicated_yo
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.03 MB
- Total amount of disk used: 0.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}
unshuffled_deduplicated_yue
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}
unshuffled_deduplicated_zh
- Size of downloaded dataset files: 99.98 GB
- Size of the generated dataset: 267.88 GB
- Total amount of disk used: 367.86 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}
Click to expand the Data/size information for each language (original)
unshuffled_original_af
- Size of downloaded dataset files: 85.79 MB
- Size of the generated dataset: 254.08 MB
- Total amount of disk used: 339.87 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}
unshuffled_original_als
- Size of downloaded dataset files: 1.49 MB
- Size of the generated dataset: 5.30 MB
- Total amount of disk used: 6.78 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}
unshuffled_original_am
- Size of downloaded dataset files: 102.79 MB
- Size of the generated dataset: 378.06 MB
- Total amount of disk used: 480.85 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}
unshuffled_original_an
- Size of downloaded dataset files: 0.15 MB
- Size of the generated dataset: 1.33 MB
- Total amount of disk used: 1.48 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}
unshuffled_original_ar
- Size of downloaded dataset files: 22.23 GB
- Size of the generated dataset: 87.94 GB
- Total amount of disk used: 110.17 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}
unshuffled_original_arz
- Size of downloaded dataset files: 15.90 MB
- Size of the generated dataset: 70.13 MB
- Total amount of disk used: 86.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}
unshuffled_original_as
- Size of downloaded dataset files: 21.43 MB
- Size of the generated dataset: 117.73 MB
- Total amount of disk used: 139.17 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}
unshuffled_original_ast
- Size of downloaded dataset files: 0.92 MB
- Size of the generated dataset: 2.54 MB
- Total amount of disk used: 3.46 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}
unshuffled_original_av
- Size of downloaded dataset files: 0.08 MB
- Size of the generated dataset: 0.42 MB
- Total amount of disk used: 0.50 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}
unshuffled_original_az
- Size of downloaded dataset files: 927.76 MB
- Size of the generated dataset: 2.96 GB
- Total amount of disk used: 3.89 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}
unshuffled_original_azb
- Size of downloaded dataset files: 6.64 MB
- Size of the generated dataset: 28.47 MB
- Total amount of disk used: 35.11 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}
unshuffled_original_ba
- Size of downloaded dataset files: 33.22 MB
- Size of the generated dataset: 133.70 MB
- Total amount of disk used: 166.92 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}
unshuffled_original_bar
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": " vo"
}
unshuffled_original_bcl
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}
unshuffled_original_be
- Size of downloaded dataset files: 498.29 MB
- Size of the generated dataset: 1.88 GB
- Total amount of disk used: 2.38 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}
unshuffled_original_bg
- Size of downloaded dataset files: 8.34 GB
- Size of the generated dataset: 33.75 GB
- Total amount of disk used: 42.09 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}
unshuffled_original_bh
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.12 MB
- Total amount of disk used: 0.13 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}
unshuffled_original_bn
- Size of downloaded dataset files: 2.14 GB
- Size of the generated dataset: 10.77 GB
- Total amount of disk used: 12.91 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nভড়ং সর্বস্ব বাংলা আর..."
}
unshuffled_original_bo
- Size of downloaded dataset files: 28.94 MB
- Size of the generated dataset: 195.40 MB
- Total amount of disk used: 224.34 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}
unshuffled_original_bpy
- Size of downloaded dataset files: 0.34 MB
- Size of the generated dataset: 4.35 MB
- Total amount of disk used: 4.69 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}
unshuffled_original_br
- Size of downloaded dataset files: 9.18 MB
- Size of the generated dataset: 30.20 MB
- Total amount of disk used: 39.38 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}
unshuffled_original_bs
- Size of downloaded dataset files: 0.05 MB
- Size of the generated dataset: 0.48 MB
- Total amount of disk used: 0.53 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}
unshuffled_original_bxr
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}
unshuffled_original_ca
- Size of downloaded dataset files: 3.10 GB
- Size of the generated dataset: 8.62 GB
- Total amount of disk used: 11.73 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}
unshuffled_original_cbk
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}
unshuffled_original_ce
- Size of downloaded dataset files: 2.09 MB
- Size of the generated dataset: 8.73 MB
- Total amount of disk used: 10.82 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}
unshuffled_original_ceb
- Size of downloaded dataset files: 11.07 MB
- Size of the generated dataset: 40.97 MB
- Total amount of disk used: 52.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}
unshuffled_original_ckb
- Size of downloaded dataset files: 111.88 MB
- Size of the generated dataset: 510.97 MB
- Total amount of disk used: 622.85 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}
unshuffled_original_cs
- Size of downloaded dataset files: 21.72 GB
- Size of the generated dataset: 57.08 GB
- Total amount of disk used: 78.80 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}
unshuffled_original_cv
- Size of downloaded dataset files: 9.40 MB
- Size of the generated dataset: 41.05 MB
- Total amount of disk used: 50.45 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}
unshuffled_original_cy
- Size of downloaded dataset files: 81.74 MB
- Size of the generated dataset: 224.93 MB
- Total amount of disk used: 306.67 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}
unshuffled_original_da
- Size of downloaded dataset files: 6.00 GB
- Size of the generated dataset: 16.76 GB
- Total amount of disk used: 22.76 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}
unshuffled_original_de
- Size of downloaded dataset files: 119.51 GB
- Size of the generated dataset: 331.22 GB
- Total amount of disk used: 450.73 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}
unshuffled_original_diq
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}
unshuffled_original_dsb
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}
unshuffled_original_dv
- Size of downloaded dataset files: 24.91 MB
- Size of the generated dataset: 131.63 MB
- Total amount of disk used: 156.54 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}
unshuffled_original_el
- Size of downloaded dataset files: 17.31 GB
- Size of the generated dataset: 66.27 GB
- Total amount of disk used: 83.58 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}
unshuffled_original_eml
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}
unshuffled_original_en
- Size of downloaded dataset files: 903.83 GB
- Size of the generated dataset: 2525.44 GB
- Total amount of disk used: 3429.27 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}
unshuffled_original_eo
- Size of downloaded dataset files: 117.07 MB
- Size of the generated dataset: 314.18 MB
- Total amount of disk used: 431.27 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}
unshuffled_original_es
- Size of downloaded dataset files: 106.04 GB
- Size of the generated dataset: 298.49 GB
- Total amount of disk used: 404.53 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}
unshuffled_original_et
- Size of downloaded dataset files: 1.88 GB
- Size of the generated dataset: 5.17 GB
- Total amount of disk used: 7.06 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}
unshuffled_original_eu
- Size of downloaded dataset files: 248.19 MB
- Size of the generated dataset: 894.83 MB
- Total amount of disk used: 1.14 GB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}
unshuffled_original_fa
- Size of downloaded dataset files: 20.96 GB
- Size of the generated dataset: 84.21 GB
- Total amount of disk used: 105.17 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}
unshuffled_original_fi
- Size of downloaded dataset files: 9.97 GB
- Size of the generated dataset: 28.57 GB
- Total amount of disk used: 38.54 GB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}
unshuffled_original_fr
- Size of downloaded dataset files: 105.32 GB
- Size of the generated dataset: 303.19 GB
- Total amount of disk used: 408.51 GB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}
unshuffled_original_frr
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}
unshuffled_original_fy
- Size of downloaded dataset files: 12.40 MB
- Size of the generated dataset: 36.24 MB
- Total amount of disk used: 48.64 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}
unshuffled_original_ga
- Size of downloaded dataset files: 29.27 MB
- Size of the generated dataset: 92.37 MB
- Total amount of disk used: 121.63 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}
unshuffled_original_gd
- Size of downloaded dataset files: 0.52 MB
- Size of the generated dataset: 2.02 MB
- Total amount of disk used: 2.55 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}
unshuffled_original_gl
- Size of downloaded dataset files: 235.38 MB
- Size of the generated dataset: 656.48 MB
- Total amount of disk used: 891.87 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}
unshuffled_original_gn
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.04 MB
- Total amount of disk used: 0.05 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}
unshuffled_original_gom
- Size of downloaded dataset files: 0.44 MB
- Size of the generated dataset: 2.25 MB
- Total amount of disk used: 2.71 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}
unshuffled_original_gu
- Size of downloaded dataset files: 232.02 MB
- Size of the generated dataset: 1.09 GB
- Total amount of disk used: 1.33 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}
unshuffled_original_he
- Size of downloaded dataset files: 5.66 GB
- Size of the generated dataset: 21.11 GB
- Total amount of disk used: 26.77 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}
unshuffled_original_hi
- Size of downloaded dataset files: 3.66 GB
- Size of the generated dataset: 17.93 GB
- Total amount of disk used: 21.59 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}
unshuffled_original_hr
- Size of downloaded dataset files: 79.42 MB
- Size of the generated dataset: 243.83 MB
- Total amount of disk used: 323.24 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}
unshuffled_original_hsb
- Size of downloaded dataset files: 1.39 MB
- Size of the generated dataset: 4.49 MB
- Total amount of disk used: 5.87 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}
unshuffled_original_ht
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}
unshuffled_original_hu
- Size of downloaded dataset files: 15.69 GB
- Size of the generated dataset: 43.07 GB
- Total amount of disk used: 58.77 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}
unshuffled_original_hy
- Size of downloaded dataset files: 897.36 MB
- Size of the generated dataset: 3.94 GB
- Total amount of disk used: 4.84 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}
unshuffled_original_ia
- Size of downloaded dataset files: 0.08 MB
- Size of the generated dataset: 0.69 MB
- Total amount of disk used: 0.78 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}
unshuffled_original_id
- Size of downloaded dataset files: 10.60 GB
- Size of the generated dataset: 32.32 GB
- Total amount of disk used: 42.91 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}
unshuffled_original_ie
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}
unshuffled_original_ilo
- Size of downloaded dataset files: 0.27 MB
- Size of the generated dataset: 0.92 MB
- Total amount of disk used: 1.20 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}
unshuffled_original_io
- Size of downloaded dataset files: 0.04 MB
- Size of the generated dataset: 0.16 MB
- Total amount of disk used: 0.20 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}
unshuffled_original_is
- Size of downloaded dataset files: 533.03 MB
- Size of the generated dataset: 1.52 GB
- Total amount of disk used: 2.06 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}
unshuffled_original_it
- Size of downloaded dataset files: 52.16 GB
- Size of the generated dataset: 147.38 GB
- Total amount of disk used: 199.54 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}
unshuffled_original_ja
- Size of downloaded dataset files: 79.56 GB
- Size of the generated dataset: 232.22 GB
- Total amount of disk used: 311.78 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}
unshuffled_original_jbo
- Size of downloaded dataset files: 0.21 MB
- Size of the generated dataset: 0.77 MB
- Total amount of disk used: 0.98 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}
unshuffled_original_jv
- Size of downloaded dataset files: 0.22 MB
- Size of the generated dataset: 0.69 MB
- Total amount of disk used: 0.91 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}
unshuffled_original_ka
- Size of downloaded dataset files: 680.74 MB
- Size of the generated dataset: 3.77 GB
- Total amount of disk used: 4.45 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}
unshuffled_original_kk
- Size of downloaded dataset files: 615.06 MB
- Size of the generated dataset: 2.83 GB
- Total amount of disk used: 3.45 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}
unshuffled_original_km
- Size of downloaded dataset files: 193.28 MB
- Size of the generated dataset: 1.10 GB
- Total amount of disk used: 1.30 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}
unshuffled_original_kn
- Size of downloaded dataset files: 342.15 MB
- Size of the generated dataset: 1.76 GB
- Total amount of disk used: 2.11 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}
unshuffled_original_ko
- Size of downloaded dataset files: 8.81 GB
- Size of the generated dataset: 25.29 GB
- Total amount of disk used: 34.10 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}
unshuffled_original_krc
- Size of downloaded dataset files: 0.66 MB
- Size of the generated dataset: 2.68 MB
- Total amount of disk used: 3.34 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}
unshuffled_original_ku
- Size of downloaded dataset files: 33.38 MB
- Size of the generated dataset: 99.06 MB
- Total amount of disk used: 132.44 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}
unshuffled_original_kv
- Size of downloaded dataset files: 0.40 MB
- Size of the generated dataset: 2.38 MB
- Total amount of disk used: 2.78 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}
unshuffled_original_kw
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.04 MB
- Total amount of disk used: 0.05 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼Pray without ceasing🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏..."
}
unshuffled_original_ky
- Size of downloaded dataset files: 152.64 MB
- Size of the generated dataset: 630.79 MB
- Total amount of disk used: 783.43 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}
unshuffled_original_la
- Size of downloaded dataset files: 5.46 MB
- Size of the generated dataset: 27.80 MB
- Total amount of disk used: 33.26 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}
unshuffled_original_lb
- Size of downloaded dataset files: 10.73 MB
- Size of the generated dataset: 30.60 MB
- Total amount of disk used: 41.32 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}
unshuffled_original_lez
- Size of downloaded dataset files: 0.83 MB
- Size of the generated dataset: 3.38 MB
- Total amount of disk used: 4.20 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}
unshuffled_original_li
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.03 MB
- Total amount of disk used: 0.04 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}
unshuffled_original_lmo
- Size of downloaded dataset files: 0.10 MB
- Size of the generated dataset: 0.47 MB
- Total amount of disk used: 0.58 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}
unshuffled_original_lo
- Size of downloaded dataset files: 33.92 MB
- Size of the generated dataset: 182.36 MB
- Total amount of disk used: 216.28 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}
unshuffled_original_lrc
- Size of downloaded dataset files: 0.02 MB
- Size of the generated dataset: 0.07 MB
- Total amount of disk used: 0.09 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}
unshuffled_original_lt
- Size of downloaded dataset files: 3.44 GB
- Size of the generated dataset: 9.45 GB
- Total amount of disk used: 12.89 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}
unshuffled_original_lv
- Size of downloaded dataset files: 1.49 GB
- Size of the generated dataset: 4.27 GB
- Total amount of disk used: 5.75 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}
unshuffled_original_mai
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.33 MB
- Total amount of disk used: 0.34 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}
unshuffled_original_mg
- Size of downloaded dataset files: 6.22 MB
- Size of the generated dataset: 21.79 MB
- Total amount of disk used: 28.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}
unshuffled_original_mhr
- Size of downloaded dataset files: 1.84 MB
- Size of the generated dataset: 7.55 MB
- Total amount of disk used: 9.38 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}
unshuffled_original_min
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.63 MB
- Total amount of disk used: 0.64 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\" ..."
}
unshuffled_original_mk
- Size of downloaded dataset files: 508.24 MB
- Size of the generated dataset: 2.20 GB
- Total amount of disk used: 2.71 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}
unshuffled_original_ml
- Size of downloaded dataset files: 938.69 MB
- Size of the generated dataset: 5.24 GB
- Total amount of disk used: 6.18 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..."
}
unshuffled_original_mn
- Size of downloaded dataset files: 472.36 MB
- Size of the generated dataset: 2.33 GB
- Total amount of disk used: 2.81 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Монгол улс, Улаанбаатар хот - 14191 Энхтайваны өргөн чөлөө - 10, Багш хөгжлийн ордон, Багшийн мэргэжил дээшлүүлэх институт\\nБаг..."
}
unshuffled_original_mr
- Size of downloaded dataset files: 525.31 MB
- Size of the generated dataset: 2.82 GB
- Total amount of disk used: 3.34 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}
unshuffled_original_mrj
- Size of downloaded dataset files: 0.30 MB
- Size of the generated dataset: 1.16 MB
- Total amount of disk used: 1.47 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}
unshuffled_original_ms
- Size of downloaded dataset files: 28.46 MB
- Size of the generated dataset: 122.33 MB
- Total amount of disk used: 150.79 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}
unshuffled_original_mt
- Size of downloaded dataset files: 7.53 MB
- Size of the generated dataset: 24.47 MB
- Total amount of disk used: 32.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}
unshuffled_original_mwl
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}
unshuffled_original_my
- Size of downloaded dataset files: 369.85 MB
- Size of the generated dataset: 2.02 GB
- Total amount of disk used: 2.39 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}
unshuffled_original_myv
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}
unshuffled_original_mzn
- Size of downloaded dataset files: 0.18 MB
- Size of the generated dataset: 0.72 MB
- Total amount of disk used: 0.90 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}
unshuffled_original_nah
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}
unshuffled_original_nap
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.02 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}
unshuffled_original_nds
- Size of downloaded dataset files: 6.74 MB
- Size of the generated dataset: 18.23 MB
- Total amount of disk used: 24.99 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}
unshuffled_original_ne
- Size of downloaded dataset files: 355.29 MB
- Size of the generated dataset: 1.87 GB
- Total amount of disk used: 2.22 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}
unshuffled_original_new
- Size of downloaded dataset files: 1.03 MB
- Size of the generated dataset: 5.77 MB
- Total amount of disk used: 6.79 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}
unshuffled_original_nl
- Size of downloaded dataset files: 29.35 GB
- Size of the generated dataset: 83.23 GB
- Total amount of disk used: 112.58 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}
unshuffled_original_nn
- Size of downloaded dataset files: 32.86 MB
- Size of the generated dataset: 90.84 MB
- Total amount of disk used: 123.70 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}
unshuffled_original_no
- Size of downloaded dataset files: 3.11 GB
- Size of the generated dataset: 8.65 GB
- Total amount of disk used: 11.76 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}
unshuffled_original_oc
- Size of downloaded dataset files: 1.57 MB
- Size of the generated dataset: 6.12 MB
- Total amount of disk used: 7.71 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}
unshuffled_original_or
- Size of downloaded dataset files: 49.84 MB
- Size of the generated dataset: 260.15 MB
- Total amount of disk used: 309.99 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}
unshuffled_original_os
- Size of downloaded dataset files: 3.09 MB
- Size of the generated dataset: 12.90 MB
- Total amount of disk used: 15.99 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}
unshuffled_original_pa
- Size of downloaded dataset files: 164.21 MB
- Size of the generated dataset: 801.16 MB
- Total amount of disk used: 965.37 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}
unshuffled_original_pam
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}
unshuffled_original_pl
- Size of downloaded dataset files: 42.88 GB
- Size of the generated dataset: 117.12 GB
- Total amount of disk used: 160.01 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}
unshuffled_original_pms
- Size of downloaded dataset files: 0.75 MB
- Size of the generated dataset: 2.15 MB
- Total amount of disk used: 2.92 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}
unshuffled_original_pnb
- Size of downloaded dataset files: 3.22 MB
- Size of the generated dataset: 12.04 MB
- Total amount of disk used: 15.26 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}
unshuffled_original_ps
- Size of downloaded dataset files: 103.66 MB
- Size of the generated dataset: 379.51 MB
- Total amount of disk used: 483.17 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}
unshuffled_original_pt
- Size of downloaded dataset files: 47.26 GB
- Size of the generated dataset: 132.64 GB
- Total amount of disk used: 179.89 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}
unshuffled_original_qu
- Size of downloaded dataset files: 0.02 MB
- Size of the generated dataset: 0.08 MB
- Total amount of disk used: 0.10 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}
unshuffled_original_rm
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}
unshuffled_original_ro
- Size of downloaded dataset files: 9.53 GB
- Size of the generated dataset: 26.87 GB
- Total amount of disk used: 36.40 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}
unshuffled_original_ru
- Size of downloaded dataset files: 319.76 GB
- Size of the generated dataset: 1241.63 GB
- Total amount of disk used: 1561.38 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}
unshuffled_original_sa
- Size of downloaded dataset files: 17.52 MB
- Size of the generated dataset: 97.06 MB
- Total amount of disk used: 114.58 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}
unshuffled_original_sah
- Size of downloaded dataset files: 9.08 MB
- Size of the generated dataset: 43.82 MB
- Total amount of disk used: 52.90 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}
unshuffled_original_scn
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
{
"id": 0,
"text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}
unshuffled_original_sd
- Size of downloaded dataset files: 90.62 MB
- Size of the generated dataset: 364.25 MB
- Total amount of disk used: 454.88 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}
unshuffled_original_sh
- Size of downloaded dataset files: 3.46 MB
- Size of the generated dataset: 25.84 MB
- Total amount of disk used: 29.30 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}
unshuffled_original_si
- Size of downloaded dataset files: 310.93 MB
- Size of the generated dataset: 1.47 GB
- Total amount of disk used: 1.78 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}
unshuffled_original_sk
- Size of downloaded dataset files: 3.71 GB
- Size of the generated dataset: 9.81 GB
- Total amount of disk used: 13.52 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}
unshuffled_original_sl
- Size of downloaded dataset files: 956.20 MB
- Size of the generated dataset: 2.68 GB
- Total amount of disk used: 3.63 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}
unshuffled_original_so
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.06 MB
- Total amount of disk used: 0.06 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}
unshuffled_original_sq
- Size of downloaded dataset files: 861.84 MB
- Size of the generated dataset: 2.44 GB
- Total amount of disk used: 3.30 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}
unshuffled_original_sr
- Size of downloaded dataset files: 1.08 GB
- Size of the generated dataset: 4.13 GB
- Total amount of disk used: 5.21 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}
unshuffled_original_su
- Size of downloaded dataset files: 0.06 MB
- Size of the generated dataset: 0.23 MB
- Total amount of disk used: 0.28 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}
unshuffled_original_sv
- Size of downloaded dataset files: 17.18 GB
- Size of the generated dataset: 47.00 GB
- Total amount of disk used: 64.18 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}
unshuffled_original_sw
- Size of downloaded dataset files: 3.71 MB
- Size of the generated dataset: 14.07 MB
- Total amount of disk used: 17.78 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}
unshuffled_original_ta
- Size of downloaded dataset files: 1.74 GB
- Size of the generated dataset: 9.93 GB
- Total amount of disk used: 11.67 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}
unshuffled_original_te
- Size of downloaded dataset files: 522.47 MB
- Size of the generated dataset: 2.61 GB
- Total amount of disk used: 3.13 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}
unshuffled_original_tg
- Size of downloaded dataset files: 90.97 MB
- Size of the generated dataset: 397.43 MB
- Total amount of disk used: 488.41 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}
unshuffled_original_th
- Size of downloaded dataset files: 7.38 GB
- Size of the generated dataset: 38.29 GB
- Total amount of disk used: 45.67 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}
unshuffled_original_tk
- Size of downloaded dataset files: 2.96 MB
- Size of the generated dataset: 10.66 MB
- Total amount of disk used: 13.62 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}
unshuffled_original_tl
- Size of downloaded dataset files: 204.89 MB
- Size of the generated dataset: 606.30 MB
- Total amount of disk used: 811.19 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}
unshuffled_original_tr
- Size of downloaded dataset files: 21.96 GB
- Size of the generated dataset: 63.58 GB
- Total amount of disk used: 85.54 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}
unshuffled_original_tt
- Size of downloaded dataset files: 151.06 MB
- Size of the generated dataset: 703.42 MB
- Total amount of disk used: 854.47 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}
unshuffled_original_tyv
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.01 MB
- Total amount of disk used: 0.01 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}
unshuffled_original_ug
- Size of downloaded dataset files: 27.92 MB
- Size of the generated dataset: 127.42 MB
- Total amount of disk used: 155.35 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}
unshuffled_original_uk
- Size of downloaded dataset files: 14.42 GB
- Size of the generated dataset: 56.44 GB
- Total amount of disk used: 70.86 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}
unshuffled_original_ur
- Size of downloaded dataset files: 712.61 MB
- Size of the generated dataset: 2.80 GB
- Total amount of disk used: 3.51 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}
unshuffled_original_uz
- Size of downloaded dataset files: 5.78 MB
- Size of the generated dataset: 21.46 MB
- Total amount of disk used: 27.24 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}
unshuffled_original_vec
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.02 MB
- Total amount of disk used: 0.03 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}
unshuffled_original_vi
- Size of downloaded dataset files: 21.50 GB
- Size of the generated dataset: 72.23 GB
- Total amount of disk used: 93.73 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}
unshuffled_original_vo
- Size of downloaded dataset files: 0.30 MB
- Size of the generated dataset: 2.12 MB
- Total amount of disk used: 2.42 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}
unshuffled_original_wa
- Size of downloaded dataset files: 0.09 MB
- Size of the generated dataset: 0.29 MB
- Total amount of disk used: 0.38 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}
unshuffled_original_war
- Size of downloaded dataset files: 0.64 MB
- Size of the generated dataset: 2.68 MB
- Total amount of disk used: 3.32 MB
An example of 'train' looks as follows.
{
"id": 1,
"text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}
unshuffled_original_wuu
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.12 MB
- Total amount of disk used: 0.13 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}
unshuffled_original_xal
- Size of downloaded dataset files: 0.03 MB
- Size of the generated dataset: 0.12 MB
- Total amount of disk used: 0.15 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}
unshuffled_original_xmf
- Size of downloaded dataset files: 1.05 MB
- Size of the generated dataset: 6.12 MB
- Total amount of disk used: 7.17 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}
unshuffled_original_yi
- Size of downloaded dataset files: 33.33 MB
- Size of the generated dataset: 147.60 MB
- Total amount of disk used: 180.94 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}
unshuffled_original_yo
- Size of downloaded dataset files: 0.01 MB
- Size of the generated dataset: 0.06 MB
- Total amount of disk used: 0.06 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}
unshuffled_original_yue
- Size of downloaded dataset files: 0.00 MB
- Size of the generated dataset: 0.00 MB
- Total amount of disk used: 0.00 MB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}
unshuffled_original_zh
- Size of downloaded dataset files: 206.00 GB
- Size of the generated dataset: 545.61 GB
- Total amount of disk used: 751.61 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}
Data Fields
The data fields are the same among all configs.
id
: aint64
feature.text
: astring
feature.
Data Splits
Click to expand the number of samples per configuration
Language | Language code | Name original | Train original | Words original | Size original | Name deduplicated | Train deduplicated | Words deduplicated | Size deduplicated |
---|---|---|---|---|---|---|---|---|---|
Afrikaans | af | unshuffled_original_af | 201117 | 43,482,801 | 241M | unshuffled_deduplicated_af | 130640 | 29,533,437 | 163M |
Albanian | sq | unshuffled_original_sq | 672077 | 374,196,110 | 2.3G | unshuffled_deduplicated_sq | 461598 | 186,856,699 | 1.2G |
Alemannic | als | unshuffled_original_als | 7324 | 841,750 | 5.0M | unshuffled_deduplicated_als | 4518 | 459,001 | 2.8M |
Amharic | am | unshuffled_original_am | 83663 | 28,301,601 | 360M | unshuffled_deduplicated_am | 43102 | 16,086,628 | 206M |
Arabic | ar | unshuffled_original_ar | 16365602 | 8,117,162,828 | 82G | unshuffled_deduplicated_ar | 9006977 | 3,171,221,354 | 32G |
Aragonese | an | unshuffled_original_an | 2449 | 52,896 | 1.3M | unshuffled_deduplicated_an | 2025 | 45,669 | 801K |
Armenian | hy | unshuffled_original_hy | 659430 | 273,919,388 | 3.7G | unshuffled_deduplicated_hy | 396093 | 110,196,043 | 1.5G |
Assamese | as | unshuffled_original_as | 14985 | 6,956,663 | 113M | unshuffled_deduplicated_as | 9212 | 4,366,570 | 71M |
Asturian | ast | unshuffled_original_ast | 6999 | 381,005 | 2.4M | unshuffled_deduplicated_ast | 5343 | 325,237 | 2.0M |
Avaric | av | unshuffled_original_av | 456 | 24,720 | 409K | unshuffled_deduplicated_av | 360 | 19,478 | 324K |
Azerbaijani | az | unshuffled_original_az | 912330 | 322,641,710 | 2.8G | unshuffled_deduplicated_az | 626796 | 167,742,296 | 1.5G |
Bashkir | ba | unshuffled_original_ba | 42551 | 9,796,764 | 128M | unshuffled_deduplicated_ba | 27050 | 6,922,589 | 90M |
Basque | eu | unshuffled_original_eu | 506883 | 120,456,652 | 848M | unshuffled_deduplicated_eu | 256513 | 45,359,710 | 342M |
Bavarian | bar | unshuffled_original_bar | 4 | 399 | 503 | unshuffled_deduplicated_bar | 4 | 399 | 503 |
Belarusian | be | unshuffled_original_be | 586031 | 144,579,630 | 1.8G | unshuffled_deduplicated_be | 307405 | 83,499,037 | 1.1G |
Bengali | bn | unshuffled_original_bn | 1675515 | 623,575,733 | 11G | unshuffled_deduplicated_bn | 1114481 | 363,766,143 | 5.8G |
Bihari | bh | unshuffled_original_bh | 336 | 8,848 | 110K | unshuffled_deduplicated_bh | 82 | 2,875 | 34K |
Bishnupriya | bpy | unshuffled_original_bpy | 6046 | 198,286 | 4.1M | unshuffled_deduplicated_bpy | 1770 | 96,940 | 1.7M |
Bosnian | bs | unshuffled_original_bs | 2143 | 106,448 | 447K | unshuffled_deduplicated_bs | 702 | 20,485 | 116K |
Breton | br | unshuffled_original_br | 37085 | 5,013,241 | 29M | unshuffled_deduplicated_br | 14724 | 2,890,384 | 16M |
Bulgarian | bg | unshuffled_original_bg | 5869686 | 2,947,648,106 | 32G | unshuffled_deduplicated_bg | 3398679 | 1,268,114,977 | 14G |
Burmese | my | unshuffled_original_my | 232329 | 56,111,184 | 1.9G | unshuffled_deduplicated_my | 136639 | 30,102,173 | 1.1G |
Catalan | ca | unshuffled_original_ca | 4390754 | 1,360,212,450 | 8.0G | unshuffled_deduplicated_ca | 2458067 | 729,333,440 | 4.3G |
Cebuano | ceb | unshuffled_original_ceb | 56248 | 6,603,567 | 39M | unshuffled_deduplicated_ceb | 26145 | 3,675,024 | 24M |
Central Bikol | bcl | unshuffled_original_bcl | 1 | 312 | 885 | unshuffled_deduplicated_bcl | 1 | 312 | 885 |
Central Khmer | km | unshuffled_original_km | 159363 | 20,690,610 | 1.1G | unshuffled_deduplicated_km | 108346 | 10,082,245 | 581M |
Central Kurdish | ckb | unshuffled_original_ckb | 103639 | 48,478,334 | 487M | unshuffled_deduplicated_ckb | 68210 | 18,726,721 | 226M |
Chavacano | cbk | unshuffled_original_cbk | 1 | 130 | 520 | unshuffled_deduplicated_cbk | 1 | 130 | 520 |
Chechen | ce | unshuffled_original_ce | 4042 | 711,051 | 8.3M | unshuffled_deduplicated_ce | 2984 | 568,146 | 6.7M |
Chinese | zh | unshuffled_original_zh | 60137667 | 14,986,424,850 | 508G | unshuffled_deduplicated_zh | 41708901 | 6,350,215,113 | 249G |
Chuvash | cv | unshuffled_original_cv | 20281 | 3,041,614 | 39M | unshuffled_deduplicated_cv | 10130 | 2,054,810 | 26M |
Cornish | kw | unshuffled_original_kw | 203 | 8,329 | 44K | unshuffled_deduplicated_kw | 68 | 2,704 | 14K |
Croatian | hr | unshuffled_original_hr | 582219 | 34,232,765 | 226M | unshuffled_deduplicated_hr | 321484 | 16,727,640 | 110M |
Czech | cs | unshuffled_original_cs | 21001388 | 7,715,977,441 | 53G | unshuffled_deduplicated_cs | 12308039 | 3,540,997,509 | 24G |
Danish | da | unshuffled_original_da | 7664010 | 2,637,463,889 | 16G | unshuffled_deduplicated_da | 4771098 | 1,620,091,317 | 9.5G |
Dhivehi | dv | unshuffled_original_dv | 21018 | 7,559,472 | 126M | unshuffled_deduplicated_dv | 17024 | 4,726,660 | 79M |
Dimli | diq | unshuffled_original_diq | 1 | 19 | 146 | unshuffled_deduplicated_diq | 1 | 19 | 146 |
Dutch | nl | unshuffled_original_nl | 34682142 | 13,020,136,373 | 78G | unshuffled_deduplicated_nl | 20812149 | 6,598,786,137 | 39G |
Eastern Mari | mhr | unshuffled_original_mhr | 3212 | 565,992 | 7.2M | unshuffled_deduplicated_mhr | 2515 | 469,297 | 6.0M |
Egyptian Arabic | arz | unshuffled_original_arz | 158113 | 7,305,151 | 66M | unshuffled_deduplicated_arz | 79928 | 3,659,419 | 33M |
Emilian-Romagnol | eml | unshuffled_original_eml | 84 | 6,376 | 25K | unshuffled_deduplicated_eml | 80 | 6,121 | 24K |
English | en | unshuffled_original_en | 455994980 | 418,187,793,408 | 2.3T | unshuffled_deduplicated_en | 304230423 | 215,841,256,971 | 1.2T |
Erzya | myv | unshuffled_original_myv | 6 | 90 | 1.4K | unshuffled_deduplicated_myv | 5 | 78 | 1.2K |
Esperanto | eo | unshuffled_original_eo | 121171 | 48,486,161 | 299M | unshuffled_deduplicated_eo | 84752 | 37,324,446 | 228M |
Estonian | et | unshuffled_original_et | 2093621 | 643,163,730 | 4.8G | unshuffled_deduplicated_et | 1172041 | 309,931,463 | 2.3G |
Finnish | fi | unshuffled_original_fi | 8557453 | 3,196,666,419 | 27G | unshuffled_deduplicated_fi | 5326443 | 1,597,855,468 | 13G |
French | fr | unshuffled_original_fr | 96742378 | 46,896,036,417 | 282G | unshuffled_deduplicated_fr | 59448891 | 23,206,776,649 | 138G |
Galician | gl | unshuffled_original_gl | 544388 | 102,011,291 | 620M | unshuffled_deduplicated_gl | 284320 | 63,600,602 | 384M |
Georgian | ka | unshuffled_original_ka | 563916 | 171,950,621 | 3.6G | unshuffled_deduplicated_ka | 372158 | 91,569,739 | 1.9G |
German | de | unshuffled_original_de | 104913504 | 44,878,908,446 | 308G | unshuffled_deduplicated_de | 62398034 | 21,529,164,172 | 145G |
Goan Konkani | gom | unshuffled_original_gom | 640 | 124,277 | 2.2M | unshuffled_deduplicated_gom | 484 | 102,306 | 1.8M |
Guarani | gn | unshuffled_original_gn | 106 | 7,382 | 36K | unshuffled_deduplicated_gn | 68 | 4,680 | 24K |
Gujarati | gu | unshuffled_original_gu | 240691 | 72,045,701 | 1.1G | unshuffled_deduplicated_gu | 169834 | 50,023,432 | 722M |
Haitian | ht | unshuffled_original_ht | 13 | 1,014 | 3.9K | unshuffled_deduplicated_ht | 9 | 832 | 3.3K |
Hebrew | he | unshuffled_original_he | 3808397 | 2,067,753,528 | 20G | unshuffled_deduplicated_he | 2375030 | 1,032,018,056 | 9.8G |
Hindi | hi | unshuffled_original_hi | 3264660 | 1,372,234,782 | 17G | unshuffled_deduplicated_hi | 1909387 | 745,774,934 | 8.9G |
Hungarian | hu | unshuffled_original_hu | 11197780 | 5,163,936,345 | 40G | unshuffled_deduplicated_hu | 6582908 | 2,339,127,555 | 18G |
Icelandic | is | unshuffled_original_is | 625673 | 219,900,094 | 1.5G | unshuffled_deduplicated_is | 389515 | 129,818,331 | 846M |
Ido | io | unshuffled_original_io | 694 | 25,702 | 147K | unshuffled_deduplicated_io | 617 | 22,773 | 130K |
Iloko | ilo | unshuffled_original_ilo | 2638 | 142,942 | 874K | unshuffled_deduplicated_ilo | 1578 | 105,564 | 636K |
Indonesian | id | unshuffled_original_id | 16236463 | 4,574,692,265 | 30G | unshuffled_deduplicated_id | 9948521 | 2,394,957,629 | 16G |
Interlingua | ia | unshuffled_original_ia | 1040 | 180,231 | 662K | unshuffled_deduplicated_ia | 529 | 100,019 | 360K |
Interlingue | ie | unshuffled_original_ie | 101 | 5,352 | 24K | unshuffled_deduplicated_ie | 11 | 602 | 1.6K |
Irish | ga | unshuffled_original_ga | 83223 | 14,483,593 | 88M | unshuffled_deduplicated_ga | 46493 | 10,017,303 | 60M |
Italian | it | unshuffled_original_it | 46981781 | 22,248,707,341 | 137G | unshuffled_deduplicated_it | 28522082 | 11,250,012,896 | 69G |
Japanese | ja | unshuffled_original_ja | 62721527 | 4,962,979,182 | 216G | unshuffled_deduplicated_ja | 39496439 | 1,123,067,063 | 106G |
Javanese | jv | unshuffled_original_jv | 1445 | 104,896 | 659K | unshuffled_deduplicated_jv | 1163 | 86,654 | 583K |
Kalmyk | xal | unshuffled_original_xal | 39 | 10,277 | 113K | unshuffled_deduplicated_xal | 36 | 10,155 | 112K |
Kannada | kn | unshuffled_original_kn | 350363 | 81,186,863 | 1.7G | unshuffled_deduplicated_kn | 251064 | 49,343,462 | 1.1G |
Karachay-Balkar | krc | unshuffled_original_krc | 1581 | 185,436 | 2.6M | unshuffled_deduplicated_krc | 1377 | 166,496 | 2.3M |
Kazakh | kk | unshuffled_original_kk | 524591 | 191,126,469 | 2.7G | unshuffled_deduplicated_kk | 338073 | 108,388,743 | 1.5G |
Kirghiz | ky | unshuffled_original_ky | 146993 | 44,194,823 | 600M | unshuffled_deduplicated_ky | 86561 | 28,982,620 | 388M |
Komi | kv | unshuffled_original_kv | 1549 | 201,404 | 2.3M | unshuffled_deduplicated_kv | 924 | 95,243 | 1.2M |
Korean | ko | unshuffled_original_ko | 7345075 | 2,368,765,142 | 24G | unshuffled_deduplicated_ko | 3675420 | 1,120,375,149 | 12G |
Kurdish | ku | unshuffled_original_ku | 46535 | 15,561,003 | 94M | unshuffled_deduplicated_ku | 29054 | 9,946,440 | 60M |
Lao | lo | unshuffled_original_lo | 52910 | 4,133,311 | 174M | unshuffled_deduplicated_lo | 32652 | 2,583,342 | 114M |
Latin | la | unshuffled_original_la | 94588 | 4,122,201 | 26M | unshuffled_deduplicated_la | 18808 | 1,328,038 | 8.3M |
Latvian | lv | unshuffled_original_lv | 1593820 | 520,761,977 | 4.0G | unshuffled_deduplicated_lv | 843195 | 236,428,905 | 1.8G |
Lezghian | lez | unshuffled_original_lez | 1485 | 247,646 | 3.3M | unshuffled_deduplicated_lez | 1381 | 224,871 | 3.0M |
Limburgan | li | unshuffled_original_li | 137 | 4,730 | 29K | unshuffled_deduplicated_li | 118 | 4,283 | 27K |
Lithuanian | lt | unshuffled_original_lt | 2977757 | 1,159,661,742 | 8.8G | unshuffled_deduplicated_lt | 1737411 | 516,183,525 | 3.9G |
Lojban | jbo | unshuffled_original_jbo | 832 | 154,330 | 736K | unshuffled_deduplicated_jbo | 617 | 141,973 | 678K |
Lombard | lmo | unshuffled_original_lmo | 1401 | 75,229 | 443K | unshuffled_deduplicated_lmo | 1374 | 73,665 | 433K |
Low German | nds | unshuffled_original_nds | 18174 | 2,906,347 | 18M | unshuffled_deduplicated_nds | 8714 | 2,146,417 | 13M |
Lower Sorbian | dsb | unshuffled_original_dsb | 65 | 1,787 | 13K | unshuffled_deduplicated_dsb | 37 | 966 | 7.1K |
Luxembourgish | lb | unshuffled_original_lb | 34807 | 4,403,577 | 29M | unshuffled_deduplicated_lb | 21735 | 3,087,650 | 21M |
Macedonian | mk | unshuffled_original_mk | 437871 | 189,289,873 | 2.1G | unshuffled_deduplicated_mk | 299457 | 102,849,595 | 1.2G |
Maithili | mai | unshuffled_original_mai | 123 | 69,161 | 317K | unshuffled_deduplicated_mai | 25 | 874 | 11K |
Malagasy | mg | unshuffled_original_mg | 17957 | 3,068,360 | 21M | unshuffled_deduplicated_mg | 13343 | 1,872,044 | 13M |
Malay | ms | unshuffled_original_ms | 534016 | 16,696,882 | 111M | unshuffled_deduplicated_ms | 183443 | 6,045,753 | 42M |
Malayalam | ml | unshuffled_original_ml | 603937 | 189,534,472 | 4.9G | unshuffled_deduplicated_ml | 453904 | 95,892,551 | 2.5G |
Maltese | mt | unshuffled_original_mt | 26598 | 2,995,654 | 24M | unshuffled_deduplicated_mt | 16383 | 2,163,358 | 17M |
Marathi | mr | unshuffled_original_mr | 326804 | 162,609,404 | 2.7G | unshuffled_deduplicated_mr | 212556 | 82,130,803 | 1.4G |
Mazanderani | mzn | unshuffled_original_mzn | 1055 | 73,870 | 691K | unshuffled_deduplicated_mzn | 917 | 64,481 | 602K |
Minangkabau | min | unshuffled_original_min | 220 | 5,682 | 608K | unshuffled_deduplicated_min | 166 | 4,825 | 310K |
Mingrelian | xmf | unshuffled_original_xmf | 3783 | 299,098 | 5.8M | unshuffled_deduplicated_xmf | 2418 | 228,629 | 4.4M |
Mirandese | mwl | unshuffled_original_mwl | 8 | 171 | 1.2K | unshuffled_deduplicated_mwl | 7 | 152 | 1.1K |
Modern Greek | el | unshuffled_original_el | 10425596 | 5,479,180,137 | 62G | unshuffled_deduplicated_el | 6521169 | 2,412,419,435 | 27G |
Mongolian | mn | unshuffled_original_mn | 395605 | 181,307,167 | 2.2G | unshuffled_deduplicated_mn | 197878 | 68,362,013 | 838M |
Nahuatl languages | nah | unshuffled_original_nah | 61 | 1,234 | 12K | unshuffled_deduplicated_nah | 58 | 1,193 | 11K |
Neapolitan | nap | unshuffled_original_nap | 73 | 5,282 | 17K | unshuffled_deduplicated_nap | 55 | 4,147 | 13K |
Nepali | ne | unshuffled_original_ne | 299938 | 107,448,208 | 1.8G | unshuffled_deduplicated_ne | 219334 | 71,628,317 | 1.2G |
Newari | new | unshuffled_original_new | 4696 | 564,697 | 5.5M | unshuffled_deduplicated_new | 2126 | 288,995 | 4.1M |
Northern Frisian | frr | unshuffled_original_frr | 7 | 1,516 | 4.4K | unshuffled_deduplicated_frr | 7 | 1,516 | 4.4K |
Northern Luri | lrc | unshuffled_original_lrc | 88 | 8,022 | 76K | unshuffled_deduplicated_lrc | 72 | 6,740 | 63K |
Norwegian | no | unshuffled_original_no | 5546211 | 1,344,326,388 | 8.0G | unshuffled_deduplicated_no | 3229940 | 804,894,377 | 4.7G |
Norwegian Nynorsk | nn | unshuffled_original_nn | 185884 | 14,764,980 | 85M | unshuffled_deduplicated_nn | 109118 | 9,435,139 | 54M |
Occitan | oc | unshuffled_original_oc | 10709 | 750,301 | 5.8M | unshuffled_deduplicated_oc | 6485 | 512,678 | 3.7M |
Oriya | or | unshuffled_original_or | 59463 | 14,938,567 | 248M | unshuffled_deduplicated_or | 44230 | 11,321,740 | 188M |
Ossetian | os | unshuffled_original_os | 5213 | 1,031,268 | 13M | unshuffled_deduplicated_os | 2559 | 878,765 | 11M |
Pampanga | pam | unshuffled_original_pam | 3 | 130 | 760 | unshuffled_deduplicated_pam | 1 | 52 | 304 |
Panjabi | pa | unshuffled_original_pa | 127467 | 61,847,806 | 763M | unshuffled_deduplicated_pa | 87235 | 37,555,835 | 460M |
Persian | fa | unshuffled_original_fa | 13704702 | 9,096,554,121 | 79G | unshuffled_deduplicated_fa | 8203495 | 4,363,505,319 | 38G |
Piemontese | pms | unshuffled_original_pms | 3225 | 362,013 | 2.1M | unshuffled_deduplicated_pms | 2859 | 337,246 | 1.9M |
Polish | pl | unshuffled_original_pl | 35440972 | 15,277,255,137 | 109G | unshuffled_deduplicated_pl | 20682611 | 6,708,709,674 | 47G |
Portuguese | pt | unshuffled_original_pt | 42114520 | 20,641,903,898 | 124G | unshuffled_deduplicated_pt | 26920397 | 10,751,156,918 | 64G |
Pushto | ps | unshuffled_original_ps | 98216 | 46,559,441 | 361M | unshuffled_deduplicated_ps | 67921 | 31,347,348 | 242M |
Quechua | qu | unshuffled_original_qu | 452 | 10,186 | 78K | unshuffled_deduplicated_qu | 411 | 8,691 | 67K |
Romanian | ro | unshuffled_original_ro | 9387265 | 3,984,317,058 | 25G | unshuffled_deduplicated_ro | 5044757 | 1,741,794,069 | 11G |
Romansh | rm | unshuffled_original_rm | 41 | 1,093 | 7.4K | unshuffled_deduplicated_rm | 34 | 960 | 6.5K |
Russia Buriat | bxr | unshuffled_original_bxr | 42 | 963 | 13K | unshuffled_deduplicated_bxr | 36 | 809 | 11K |
Russian | ru | unshuffled_original_ru | 161836003 | 92,522,407,837 | 1.2T | unshuffled_deduplicated_ru | 115954598 | 46,692,691,520 | 568G |
Sanskrit | sa | unshuffled_original_sa | 14291 | 4,331,569 | 93M | unshuffled_deduplicated_sa | 7121 | 1,713,930 | 37M |
Scottish Gaelic | gd | unshuffled_original_gd | 5799 | 310,689 | 1.9M | unshuffled_deduplicated_gd | 3883 | 207,110 | 1.3M |
Serbian | sr | unshuffled_original_sr | 1013619 | 364,395,411 | 3.9G | unshuffled_deduplicated_sr | 645747 | 207,561,168 | 2.2G |
Serbo-Croatian | sh | unshuffled_original_sh | 36700 | 5,292,184 | 25M | unshuffled_deduplicated_sh | 17610 | 1,040,573 | 5.8M |
Sicilian | scn | unshuffled_original_scn | 21 | 554 | 3.3K | unshuffled_deduplicated_scn | 17 | 468 | 2.8K |
Sindhi | sd | unshuffled_original_sd | 44280 | 43,530,158 | 347M | unshuffled_deduplicated_sd | 33925 | 33,028,015 | 263M |
Sinhala | si | unshuffled_original_si | 203082 | 93,053,465 | 1.4G | unshuffled_deduplicated_si | 120684 | 50,864,857 | 802M |
Slovak | sk | unshuffled_original_sk | 5492194 | 1,322,247,763 | 9.1G | unshuffled_deduplicated_sk | 2820821 | 656,346,179 | 4.5G |
Slovenian | sl | unshuffled_original_sl | 1746604 | 387,399,700 | 2.5G | unshuffled_deduplicated_sl | 886223 | 193,926,684 | 1.3G |
Somali | so | unshuffled_original_so | 156 | 1,202 | 61K | unshuffled_deduplicated_so | 42 | 472 | 16K |
South Azerbaijani | azb | unshuffled_original_azb | 15446 | 2,175,054 | 27M | unshuffled_deduplicated_azb | 9985 | 1,528,709 | 19M |
Spanish | es | unshuffled_original_es | 88199221 | 47,545,122,279 | 278G | unshuffled_deduplicated_es | 56326016 | 25,928,290,729 | 149G |
Sundanese | su | unshuffled_original_su | 805 | 30,321 | 211K | unshuffled_deduplicated_su | 511 | 20,278 | 141K |
Swahili | sw | unshuffled_original_sw | 41986 | 2,211,927 | 13M | unshuffled_deduplicated_sw | 24803 | 1,376,963 | 8.1M |
Swedish | sv | unshuffled_original_sv | 17395625 | 7,155,994,312 | 44G | unshuffled_deduplicated_sv | 11014487 | 4,106,120,608 | 25G |
Tagalog | tl | unshuffled_original_tl | 458206 | 98,949,299 | 573M | unshuffled_deduplicated_tl | 294132 | 70,121,601 | 407M |
Tajik | tg | unshuffled_original_tg | 89002 | 31,758,142 | 379M | unshuffled_deduplicated_tg | 56259 | 21,029,893 | 249M |
Tamil | ta | unshuffled_original_ta | 1263280 | 420,537,132 | 9.3G | unshuffled_deduplicated_ta | 833101 | 226,013,330 | 5.1G |
Tatar | tt | unshuffled_original_tt | 135923 | 51,034,893 | 670M | unshuffled_deduplicated_tt | 82738 | 23,825,695 | 305M |
Telugu | te | unshuffled_original_te | 475703 | 123,711,517 | 2.5G | unshuffled_deduplicated_te | 312644 | 79,094,167 | 1.6G |
Thai | th | unshuffled_original_th | 6064129 | 951,743,087 | 36G | unshuffled_deduplicated_th | 3749826 | 368,965,202 | 16G |
Tibetan | bo | unshuffled_original_bo | 26795 | 1,483,589 | 187M | unshuffled_deduplicated_bo | 15762 | 936,556 | 138M |
Turkish | tr | unshuffled_original_tr | 18535253 | 7,577,388,700 | 60G | unshuffled_deduplicated_tr | 11596446 | 3,365,734,289 | 27G |
Turkmen | tk | unshuffled_original_tk | 6456 | 1,113,869 | 11M | unshuffled_deduplicated_tk | 4694 | 752,326 | 6.8M |
Tuvinian | tyv | unshuffled_original_tyv | 34 | 759 | 12K | unshuffled_deduplicated_tyv | 24 | 540 | 7.9K |
Uighur | ug | unshuffled_original_ug | 22255 | 8,657,141 | 122M | unshuffled_deduplicated_ug | 15503 | 5,852,225 | 83M |
Ukrainian | uk | unshuffled_original_uk | 12973467 | 4,204,381,276 | 53G | unshuffled_deduplicated_uk | 7782375 | 2,252,380,351 | 28G |
Upper Sorbian | hsb | unshuffled_original_hsb | 7959 | 545,351 | 4.2M | unshuffled_deduplicated_hsb | 3084 | 236,867 | 1.8M |
Urdu | ur | unshuffled_original_ur | 638596 | 331,817,982 | 2.7G | unshuffled_deduplicated_ur | 428674 | 218,030,228 | 1.7G |
Uzbek | uz | unshuffled_original_uz | 27537 | 2,450,256 | 21M | unshuffled_deduplicated_uz | 15074 | 1,381,644 | 12M |
Venetian | vec | unshuffled_original_vec | 73 | 3,492 | 18K | unshuffled_deduplicated_vec | 64 | 3,199 | 17K |
Vietnamese | vi | unshuffled_original_vi | 14898250 | 12,036,845,359 | 68G | unshuffled_deduplicated_vi | 9897709 | 5,577,159,843 | 32G |
Volapük | vo | unshuffled_original_vo | 3366 | 321,121 | 2.0M | unshuffled_deduplicated_vo | 3317 | 318,568 | 2.0M |
Walloon | wa | unshuffled_original_wa | 1001 | 50,720 | 273K | unshuffled_deduplicated_wa | 677 | 37,543 | 203K |
Waray | war | unshuffled_original_war | 9760 | 397,315 | 2.5M | unshuffled_deduplicated_war | 9161 | 336,311 | 2.2M |
Welsh | cy | unshuffled_original_cy | 157698 | 37,422,441 | 213M | unshuffled_deduplicated_cy | 98225 | 23,574,673 | 133M |
Western Frisian | fy | unshuffled_original_fy | 33053 | 5,691,077 | 35M | unshuffled_deduplicated_fy | 20661 | 4,223,816 | 26M |
Western Mari | mrj | unshuffled_original_mrj | 757 | 93,338 | 1.2M | unshuffled_deduplicated_mrj | 669 | 87,780 | 1.1M |
Western Panjabi | pnb | unshuffled_original_pnb | 4599 | 1,426,986 | 12M | unshuffled_deduplicated_pnb | 3463 | 1,111,112 | 9.0M |
Wu Chinese | wuu | unshuffled_original_wuu | 214 | 11,189 | 109K | unshuffled_deduplicated_wuu | 64 | 4,333 | 32K |
Yakut | sah | unshuffled_original_sah | 22301 | 2,547,623 | 42M | unshuffled_deduplicated_sah | 8555 | 1,789,174 | 26M |
Yiddish | yi | unshuffled_original_yi | 59364 | 13,834,320 | 141M | unshuffled_deduplicated_yi | 32919 | 8,212,970 | 84M |
Yoruba | yo | unshuffled_original_yo | 214 | 8,906 | 55K | unshuffled_deduplicated_yo | 49 | 3,518 | 27K |
Yue Chinese | yue | unshuffled_original_yue | 11 | 186 | 3.7K | unshuffled_deduplicated_yue | 7 | 128 | 2.2K |
Dataset Creation
Curation Rationale
OSCAR was constructed new pipeline derived from the fastText's one, called goclassy. Goclassy reuses the fastText linear classifier and the pre-trained fastText model for language recognition, but it completely rewrites and parallelises their pipeline in an asynchronous manner.
The order of operations is more or less the same as in the fastText pre-processing pipeline but instead of clustering multiple operations into a single blocking process, a worker is launched for each operation but bounding the number of possible parallel operations at a given time by the number of available threads instead of the number of CPUs. Goclassy is implemented in the Go programming language so it lets the Go runtime handle the scheduling of the processes. Thus the goclassy's pipeline one does not have to wait for a whole WET file to download, decompress and classify in order to start downloading and processing the next one, a new file will start downloading and processing as soon as the scheduler is able to allocate a new process.
Filtering and cleaning processes at line level are done before feeding each line to the classifier. Lines shorter than 100 UTF-8 characters and lines containing invalid UTF-8 characters are discarted and are not classified. After all files are proccesed the deduplicated versions are constructed and everything is then splitted in shards and compressed.
Source Data
Initial Data Collection and Normalization
Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.
Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics.
To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR, the November 2018 snapshot was used. It surpasses 20TB of uncompressed data and contains more than 50 thousand plain text files where each file consists of the plain text from multiple websites along its metadata header.
Who are the source language producers?
The data comes from multiple web pages in a large variety of languages.
Annotations
The dataset does not contain any additional annotations.
Annotation process
N/A
Who are the annotators?
N/A
Personal and Sensitive Information
Being constructed from Common Crawl, Personal and sensitive information might be present. This must be considered before training deep learning models with OSCAR, specially in the case of text-generation models.
Considerations for Using the Data
Social Impact of Dataset
OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures.
Discussion of Biases
OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models.
Other Known Limitations
The fastText linear classifier is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by third parties.
Additional Information
Dataset Curators
The corpus was put together by Pedro J. Ortiz, Benoît Sagot, and Laurent Romary, during work done at Inria, particularly at the ALMAnaCH team.
Licensing Information
These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR
This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Citation Information
@inproceedings{ortiz-suarez-etal-2020-monolingual,
title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
author = "Ortiz Su{'a}rez, Pedro Javier and
Romary, Laurent and
Sagot, Benoit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.156",
pages = "1703--1714",
abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}
@inproceedings{OrtizSuarezSagotRomary2019,
author = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
title = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
editor = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
address = {Mannheim},
doi = {10.14618/ids-pub-9021},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
pages = {9 -- 16},
year = {2019},
abstract = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
language = {en}
}
Contributions
- Downloads last month
- 11,151