This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project.
One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.
More Results: 100 Learned Topics
└── src - the source code projects, see below ├── WSDA ├── combine_sequence_files ├── examples │ ├── spark_example │ └── word_count_1 ├── html_to_text_conversion ├── remove_infrequent_words ├── results_display ├── scripts └── word_count
- This is the code repository
- The runs and the raw results can be found in this repository
- The hadoop config is here
- The spark config is here
The self-implemented LDA
@hany-abdelrahman: the WSDA directory should probably be renamed to something more meaningful 😉 TODO: add some more doc, references, etc.
Author: Hany Abdelrahman
Combines sequence files from subdirectories into multiple sequence files. These sequence files have the same name as the subdirectories.
This way, it is possible to create a flat directory structure whith few large sequence files.
Author: Lukas Elmer
Contains a spark example project and a simple word count application. Only for dev env setup purposes.
Author: Lukas Elmer
Converts web archive records into sequence files, removing all HTML / JS tags using boilerplate and doing some additional steps:
- remove stopwords
- remove words with non a-z characters
- try to remove non-english documents
- remove numbers
- remove URLs
- convert uppercase to lowercase charaters
- apply stemming (org.apache.lucene.analysis.en.EnglishAnalyzer)
See also:
Author: Lukas Elmer
Removes words which appear infrequent. Needs a word count dictionary as input.
Author: Lukas Elmer
A script to help displaying the topics. Generates
- A readable text version
- A tag cloud for each topic, each word size weighted by the probability of the word
Author: Lukas Elmer
Simple word count for sequence files.
Author: Lukas Elmer