ETH Zurich - Web Scale Data Processing and Mining Project

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project.

Results

One of the results are the visualized topics, which have been learned autonomously from terabytes of raw HTML data.

More Results: 100 Learned Topics

Directory Structure and Overview

└── src - the source code projects, see below
    ├── WSDA
    ├── combine_sequence_files
    ├── examples
    │   ├── spark_example
    │   └── word_count_1
    ├── html_to_text_conversion
    ├── remove_infrequent_words
    ├── results_display
    ├── scripts
    └── word_count

Repositories

This is the code repository
The runs and the raw results can be found in this repository
The hadoop config is here
The spark config is here

Project Management, Documentation

Google Drive

Source code projects

WSDA

The self-implemented LDA

@hany-abdelrahman: the WSDA directory should probably be renamed to something more meaningful 😉 TODO: add some more doc, references, etc.

Author: Hany Abdelrahman

combine_sequence_files

Combines sequence files from subdirectories into multiple sequence files. These sequence files have the same name as the subdirectories.

This way, it is possible to create a flat directory structure whith few large sequence files.

Author: Lukas Elmer

examples

Contains a spark example project and a simple word count application. Only for dev env setup purposes.

Author: Lukas Elmer

html_to_text_conversion

Converts web archive records into sequence files, removing all HTML / JS tags using boilerplate and doing some additional steps:

remove stopwords
remove words with non a-z characters
try to remove non-english documents
remove numbers
remove URLs
convert uppercase to lowercase charaters
apply stemming (org.apache.lucene.analysis.en.EnglishAnalyzer)

remove_infrequent_words

Removes words which appear infrequent. Needs a word count dictionary as input.

Example how to use it

Author: Lukas Elmer

results_display

A script to help displaying the topics. Generates

A readable text version
A tag cloud for each topic, each word size weighted by the probability of the word

Author: Lukas Elmer

word_count

Simple word count for sequence files.

Example how to use it

Author: Lukas Elmer

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETH Zurich - Web Scale Data Processing and Mining Project

Results

Directory Structure and Overview

Repositories

Project Management, Documentation

Source code projects

WSDA

combine_sequence_files

examples

html_to_text_conversion

remove_infrequent_words

results_display

word_count

About

Releases

Packages

Contributors 3

Languages

lukaselmer/ethz-web-scale-data-mining-project

Folders and files

Latest commit

History

Repository files navigation

ETH Zurich - Web Scale Data Processing and Mining Project

Results

Directory Structure and Overview

Repositories

Project Management, Documentation

Source code projects

WSDA

combine_sequence_files

examples

html_to_text_conversion

remove_infrequent_words

results_display

word_count

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages