The data is gathered from translations of Egyptian songs on lyrics websites.
This is achieved using the JavaScript scripts in the scripts
folder, executed with Node.
- ▶ Run
npm start
- ☑ Once the execution is finished, the gathered data will be in
data-compiled.json
in thedata
folder. - 🔠 Import the json file as a table by running
makeTable.m
in MATLAB from thematlab
folder.
📥 Gathering data from lyricstranslate.com
-
📜
getArtistsLinks.js
gets and saves a list of only Egyptian Artists -
📜
downloadData.js
gets and saves the list of their songs and the lyrics both in Arabic and English for each song (bonus: it also gets the transliteration) -
📜
preprocessData.js
splits the verse ID from its text into 2 arrays. It does this for every songs of every artists, so it's easier to compile the data afterwards. -
📜
compileData.js
compiles the data into 2 arrays of sentences, one containing all the Arabic verses and one containing all the English verses. It's aligning the data, e.g. it's matching the verse ID from each language so that the sentence at index i of the English array corresponds to the translation of the sentence at index i in the Arabic array.
ℹ A snapshot of the data is saved for every step of the process, so the data required by a specific step won't be downloaded again when a snapshot exists.
The data can then be imported as a table in MATLAB to train a transformer model.
In the matlab
folder:
- 🔠
tableTotal.mat
contains all the Egyptian Arabic - English sentence pairs, it's the table that will be used for training - 📜
makeTable.m
generates the table. This script imports the JSON data as a table in MATLAB.
For now data is only gathered from lyricstranslate.com