- Click the buttons below to see the Project Repo and Canva presentation.
Yuvia Cardenas | Justin Evans | Cristina Lucin | Woodrow Sims |
---|---|---|---|
In 1900, there were fewer than 3,000 cars on the roads of France. In order to increase demand for cars and, accordingly, car tires, car tire manufacturers and brothers Édouard and André Michelin published a guide for French motorists, the Michelin Guide. It provided information to motorists, such as maps, car mechanics listings, petrol stations, hotels and restaurants throughout France. The guide began to award stars for fine dining establishments in 1926. Initially, there was only a single star awarded. Then, in 1931, the hierarchy of zero, one, two, and three stars was introduced. In 1955, a fourth category, "Bib Gourmand", identified restaurants with quality food at a value price.
At present, a star award from the Michelin Guide is widely accepted as the pre-eminent culinary achievement of restauranteurs and chefs alike. Michelin reviewers (commonly called "inspectors") are anonymous. Many of the company's top executives have never met an inspector; inspectors themselves are advised not to disclose their line of work, even to their parents. The amount of secrecy in this process, and importance of this review in the culinary world, led us to ask the question--"What factors can be revealed by examining Michelin restaurant reviews?"Through our shared love of food, we embarked on a journey to utilize Data Science to distill the essence of fine dining perfection."
Our Capstone Team Project utilizes Web-scraping & Natural Language Processing to develop a model that predicts Michelin food star award ratings based on content from the official Michelin review.
Following the Data Science Pipeline First, our team will acquire and prepare the data for exploration. Then, we will explore the data to gain insight on which features to engineer that ultimately improve our model's accuracy. After we create several types of machine learning models that can effectly predict the Michelin food star award rating we will compare each model's performance on training and validate datasets. The model that performs the best will move forward with test dataset for final results.
- Create a model that effectively predicts Michelin food star award ratings based on content from the official Michelin review
- Provide a well-documented jupyter notebook that contains our analysis
- Produce a Final GitHub repository
- Present a Canva slide deck suitable for a general audience which summarizes our findings and documents the results with well-labeled visualizations
Can be accomplished by simply cloning our project and running the final notebook as explained in the instructions below:
Warning to ensure you are not banned from the host while scraping, a 2sec sleep pause per page with a backup 5sec sleep command in case of error was implemented in the acquire function. This slows down the initial scraping run of the program. After web scraping each of the 6700+ reviews, all data is saved locally to the michelin_df.pickle
file.
Reproduction Instructions:
-
Clone the Repository using this code in your terminal
git clone [email protected]:CodeupGourmands/Michelin_NLP_Capstone.git
then run themvp_notebook.ipynb
Jupyter Notebook. -
You will need to ensure the below listed files, at a minimum, are included in the repo in order to be able to run.
mvp_notebook.ipynb
acquire.py
prepare.py
explore.py
model.py
Our initial thoughts are that country, cuisine, and words/groups of words (bigrams and trigrams) may be very impactful features to predict our target 'award' level. Another thought was that the price level and available facilities could also help determine the target 'award' level.
- Acquire initial data (CSV file) via
Kaggle
download - Acquire review data using
Beautiful Soup
via 'get_michelin_pages' function in acquire file - Clean and Prepare the data utilizing
RegEx
and string functions - Explore data in search of significant relationships to target (Michelin Star Ratings)
- Conduct statistical testing as necessary
▪︎ Answer 6 initial exploratory questions:
Question 1. What is the distribution of our target variable (award type)?
Question 2. What countries have the most Michelin restaurants?
Question 3. What is the average wordcount of restaurant reviews, by award type?
Question 4. Do three star Michelin restaurants have the highest sentiment score?
Question 5. What are the most frequent words used in Michelin Restaurant reviews?
Question 6. Do higher rated restaurants have more facilities?
-
Develop a Model to predict Award Category of Michelin restaurants:
- Evaluate models on train and validate data using accuracy score
- Select the best model based on the smallest difference in the accuracy score on the train and validate sets.
- Evaluate the best model on test data
-
Draw conclusions
Original Features:
Feature | Description |
---|---|
name | Name of the awardee restaurant |
address | Address of the awardee restaurant |
location | City, country, or province of the awardee restaurant |
price | Representation of the price value from one to four (min-max) using the curency symbol of the location country |
cuisine | Main style of cuisine served by the awardee restaurant |
longitude | Geographical longitude of the awardee restaurant |
latitude | Geographical latitude of the awardee restaurant |
url | Url address to the Michelin Review of the awardee restaurant |
facilities_and_services | Highlighted facilities and services available at the awardee restaurant |
data | Web-scraped review for each awardee document |
Feature Engineered:
Feature | Description |
---|---|
price_level | Numeric value from 1 to 4 (min-max) representing the same relative level of expense across all countries |
city | City as captured by the first position of the location feature |
country | Country as captured by the second position of the location feature; also captures provinces that only had one entry in the location feature |
review_clean | Tokenized text in lower case, with latin symbols only from the original data column containing the scraped reviews |
review_lemmatized | Data column containing the web-scraped reviews after being cleaned and lemmatzed |
review_word_count | Word count of each corresponding review |
review_sentiment | Compound sentiment score of each observation |
Target Variable:
Feature | Value | Description |
---|---|---|
award | ['1 michelin star', '2 michelin stars', '3 michelin stars', 'bib gourmand'] | This feature identifies which award was presented to the restaurant belonging to each document |
1 michelin star | "High quality cooking, worth a stop!" | |
2 michelin stars | "Excellent cooking, worth a detour!" | |
3 michelin stars | "Exceptional cuisine, worth a special journey!" | |
bib gourmand | "Good quality, good value cooking." |
Our dataset of all Michelin Awardee restaurants worldwide was acquired January 17, 2023 from Kaggle. This dataset is updated quarterly with new additions of Michelin Awardee restaurants and removal of restaurants that no longer carry the award. From this initial dataset, we utilized the Michelin Guide URL for each restaurant and Beautiful Soup to web-scrape the review text for each restaurant, enhancing the original dataset.
Acquisition Actions:
- Web-scraped data from
guide.michelin.com
usingBeautiful Soup
- The review text for each restaurant was then appended back to the original dataframe
- Each row represents a Michein Awardee restaurant
- Each column represents a feature of the restaurant, including feature-engineered columns
- 6780 acquired restaurant reviews (6 NaN values caused by restaurants no longer active Michelin Awardees).
Our data set was prepared following standard Data Processing procedures and the details can be explored under the prepare actions below.
Prepare Actions:
- FEATURE ENGINEER: Used 'bag of words' to create new categorical features from polarizing words.
- Created columns with
clean
andlemmatized
text - Created a column containing the word_count length
- Created a column containing sentiment score of the text
- Created columns with
- DROP: Dropped phone_number and website_url columns that contained Nulls values as determined would not be used as features for the iteration of this project. Dropped six restaurants from the original Kaggle dataset that are no longer Michelin restaurants.
- RENAME: Converted column names to lowercase with no spaces.
- ENCODED: Features 'price_category' and 'country' were encoded into dummy variables
- IMPUTED: There were missing values in the price column that were imputed with the mode
- Note: Special care was taken to ensure that there was no leakage of this data
- SPLIT: train, validate and test (approx. 56/24/20), stratifying on target of
award
- SCALED: We scaled all numeric columns for modeling ['lem_length','original_length','clean_length','length_diff']
- Xy SPLIT: split each DataFrame (train, validate, test) into X (features) and y (target)
-
Bib gourmand is the most common award (baseline is 50.3%), 3 Michelin stars is the least common.
-
France has the most Michelin awarded restaurants, followed by Japan, Italy, U.S.A and Germany)
-
Restaurants awarded 3 Michelin stars had reviews with the most words, and Bib Gourmand Restaurants had the fewest word count.
-
Restaurants awarded 2 Michelin stars had the highest sentiment score, and Bib Gourmand restaurants had the lowest sentiment score
-
Most frequent single words used:
- modern
- room
- wine
-
Most frequent bigrams:
- tasting menu
- la carte
- open kitchen
-
Most frequent trigrams:
- two tasting menu
- take pride place
-
Higher-rated restaurants had more facilities than lower rated restaurants
The models created
Used following classifier models:
- Decision Tree
- Random Forest
- Logistic Regression
- Gradient Boosting Classifier
The metric used to evaluate the models was the accuracy score. The ideal model's accuracy score is expected to outperfom the baseline accuracy score.
Modeling Results
- We ran grid search on four different models, optimizing hyperparameters
- Logistic Regression performed the best, over both train and validate
- When run on test, Logistic Regression yielded an accuracy score of 87.9%, improving baseline accuracy by 37.6%
- Restaurants with higher Michelin award levels have, on average, longer reviews
- France, Japan, and Italy have the most Michelin restaurants
- Two (2) Star Michelin Restaurant reviews have the highest sentiment levels, followed by one (1) star restaurants, three (3) star restaurants, and Bib Gourmand restaurants. However, the difference in sentiment levels between the star categories was not significant.
- Utilizing the cleaned and lemmatized text of reviews, we produced a model that predicts, with 87.9% accuracy, the award category of a restaurant.
- Our results suggest that the way Michelin reviewers talk about restaurants is impactful and meaningful, and further exploration could yield valuable results
- To imrpove your chances for Michelin designation, “shoot for the stars”
- The higher level a restaurant is rated, the more service focused words, groups of two and three words occur in the review
- An improvement in dining experience, seems to be the biggest driver towards a three-star restaurant review
- Pruning TF/IDF to hone model performance
- Investigate deep learning to further improve model accuracy
- Exploration of restaurant cuisine type to feature engineer
- Investigation and deeper exploration of unique words and phrases
- Clustering features for modeling