Backend API Assignment for Hackercamp 2018
This API aims to make data collection and filteration from the social media site twitter.com easy for someone with no backend programming experience. This API is basically divided into 4 parts
- API 1 : will start a twitter stream for the keywords sent along with the API call to store the tweet and some of its metadata into a NoSQL database(MongoDB) in realtime.
- API 2 : this API call will be used for filtering the collected data in API 1. The filtering can be done on a number of parameters described below. E.g. filtering by a user who posted the tweet, the time range in which tweet was posted, etc. The data can be sorted according to a no. of fields described below. E.g. lexicographically reverse order according to text of the tweet.
- API 3 : will be used to export the filtered data generated in API 2 in a CSV file and sent to the client.
- API 4 : this API call will be used to stop a running twitter stream.
- JavaScript
- Node
- Express
- MongoDB
- Fork the repository to your github account.
- Clone the repository to your local machine.
- Open index.js and add consumer_key,consumer_secret,access_token_key and access_token_secret. You can get these keys from https://developer.twitter.com/
- run MongoDB server.
- open terminal window and cd into the directory.
- run command :- npm install , this will install the node depedencies.
- run command :- node index.js
- The API is now listening on http://127.0.0.1:3000. If the listening IP or port is not free then change the IP or Port no. and then goto step 6.
the javascript files which define the database schema are in models directory
- tweets : _id,time,text,userid:{ref:users},retweet_count,favorite_count,language,retweeted,favorited,jsdate
- users : _id,name,screen_name,follower_count,friend_count
- urls_user_mentions : _id,tweet_id:{ref:tweets},content,type
To start streaming the tweets make a GET call at http://127.0.0.1:3000/?track=keyword_1,keyword_2,keyword_3...keyword_n
To stop a running tweet stream make a GET call at http://127.0.0.1:3000/stop_stream
To filter data according to queries you have to make a GET call at http://127.0.0.1:3000/filter_data?param_1=value_1¶m_2=value_2&....¶m_n=value_n
To get the filtered tweets in a CSV file make a GET call at http://127.0.0.1:3000/get_csv?keys=col_1,col_col_2,...,col_n
- sort : { -1 for descending, 1 for ascending, default : 1 }
- sorting_field : {allowed_fields : [ time , text , retweet_count , favorite_count , language ] , default : time}
- text : {substring matching}
- language : { string matching }
- start_date : { JavaScript ISO (Date-Time) format , default : 1970-01-01T00:00:00-00:00 }
- end_date : { JavaScript ISO (Date-Time) format , default : current time }
- limit : { no. of tweets to return at a time, default : 10 }
- page : { page no. of tweets , default : 1 }
- user_name : { search for a user_name }
- user_name_type : { string matching methods: {contains : substring , starts : prefix , ends : suffix , exact : same string } , default : exact }
- screen_name : {search for a screen_name }
- screen_name_type : { string matching methods: {contains : substring , starts : prefix , ends : suffix , exact : same string } , default : exact }
- url : {search for a url mentioned in tweet }
- url_type : { string matching methods: {contains : substring , starts : prefix , ends : suffix , exact : same string } , default : exact }
- user_mention : {search for a user mentioned in the tweet }
- user_mention_type : { string matching methods: {contains : substring , starts : prefix , ends : suffix , exact : same string } , default : exact }
- retweet_min : { minimum retweets filter, default : 0 }
- retweet_max : { maximum retweets filter, default : 100000000 }
- retweets : { exact number of retweets filter , note : overrides retweet_min and retweet_max }
- favorite_min : { minimum favorites filter, default : 0 }
- favorite_max : { maximum favorites filter, default : 100000000 }
- favorite : { exact number of favorited filter , note : overrides favorite_min and favorite_max }
- follower_min : {minimum followers filter , default : 0 }
- follower_max : {maximum followers filter , default : 100000000 }
- follower : { exact number of followers filter , note: overrides follower_min and follower_max }
- friend_min : {minimum friends filter , default : 0 }
- friend_max : {maximum friends filter , default : 100000000 }
- friends : { exact number of friends filter , note: overrides friends_min and friends_max }
Example
To extract the tweets for a user whose screen_name ends with 'per' with number of followers between 10 and 60 with exactly 60 friends who has mentioned a user with screen_name 'harry', sorted in lexicographically reverse order according to the tweet text , make a GET call in the following format
To get the filtered tweets in a CSV file make a GET call at http://127.0.0.1:3000/get_csv?keys=col_1,col_col_2,...,col_n
keys : { allowed fields : ['_id','text','userid._id','userid.name','userid.screen_name','userid.friend_count','user.follower_count','retweet_count','favorite_count','language','time','retweeted','favorited'] }
note : if no parameter is sent in keys then all the columns will be returned in the CSV
I decided to use Node.js and MongoDB as I am more comfortable in working on these technologies together. The first challenge which I faced was the way to collect the tweets in realtime in form of a stream. I searched for some time on the internet and found out about the twitter node module.
The next challenge which I faced was to decide upon a schema which could be easily queried upon with high number of filtering parameters for use in API 2. After putting in some hours into this step I came up with a database schema and ways to write simplistic queries over it.
The next phase, which was also the most challenging was to deal with missing parameters. It was quite evident that someone shouldn't make an API call with all the parameters every time even if only lexicographic sorting is required. To overcome this problem I decided to use the $regex with Mongoose. For strings, I manipulated the string to be matched according to the matching technique to be used. If any string parameter was absent, then I used $regex to accept all strings for that field. In the case of numbers, I decided to put the minimum field to be set to 0 and the maximum field to 100000000(10^8), if the required parameters was absent. For exact number filtering, I set both the minimum and the maximum fields to the same number for filtering. For filtering according to the range of date-time, the API requires the date to be passed in JavaScript ISO(Date-Time) format. During storing of tweets there is one field "jsdate" in the 'tweets' collection which contains the number of milliseconds passed since Jan. 1, 1970. "jsdate" is created by first parsing the twitter date-time format to JavaScript ISO(Date-Time) format and then using the getTime() method. If the start_date parameter is empty then it is set to Jan. 1,1970 and if the end_date parameter is empty then it is set to the current date and time of the server.
Variable could not be passed into the mongoose query for sorting, so the filtered data was sorted accordingly by using a function.
One more bug which occurred during development was that using the .limit() and .skip() with mongoose queries was not truly filtering the data into pages but was first paging the data in its unfiltered form and then filtering according to the parameters. To rectify this bug, the final sorted filtered data which was returned as an array was sliced using the Array.prototype.slice() function of JavaScript.
To export the filtered data as CSV, I used another node module json-2-csv to convert the array of JavaScript objects into a CSV file.
I also added one more API call to stop the twitter stream.
Twitter developer docs : https://developer.twitter.com/en/docs
The testing of API was done on the following:
- Node (v8.9.4)
- MongoDB(3.2)
- Express(v4.16.2)
- Mongoose(v5.0.6)
- twitter (v1.7.1)
- json-2-csv(v2.1.2)
AJAX calls were generated using Postman