The task in this assignment is to produce sentiment predictions over a collection of IMDB reviews by various text data representation such as unigram, unigram with tf-idf, bigram and bigram with tf-idf.
- Combine raw DB into a single CSV file (
imdb_tr.csv
with 3 cols:row_number
,text
,polarity
). - Remove all common stopwords.
- transform text col in
imdb_tr.csv
into a term-document matrix using unigram model. - Train a SGC classifier on it with
loss="hinge"
andpenalty="l1"
. - train a SGD classifier using unigram representation, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
- train a SGD classifier using bigram representation, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
- train a SGD classifier using unigram representation with tf-idf, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
- train a SGD classifier using bigram representation with tf-idf, predict sentiments on imdb_te.csv, and write output to unigram.output.txt