Codeup Gourmands Presents...

Michelin NLP Capstone

Project Links

Click the buttons below to see the Project Repo and Canva presentation.

Meet Team Codeup Gourmands!

Yuvia Cardenas	Justin Evans	Cristina Lucin	Woodrow Sims

Project Inspiration:

In 1900, there were fewer than 3,000 cars on the roads of France. In order to increase demand for cars and, accordingly, car tires, car tire manufacturers and brothers Édouard and André Michelin published a guide for French motorists, the Michelin Guide. It provided information to motorists, such as maps, car mechanics listings, petrol stations, hotels and restaurants throughout France. The guide began to award stars for fine dining establishments in 1926. Initially, there was only a single star awarded. Then, in 1931, the hierarchy of zero, one, two, and three stars was introduced. In 1955, a fourth category, "Bib Gourmand", identified restaurants with quality food at a value price.

At present, a star award from the Michelin Guide is widely accepted as the pre-eminent culinary achievement of restauranteurs and chefs alike. Michelin reviewers (commonly called "inspectors") are anonymous. Many of the company's top executives have never met an inspector; inspectors themselves are advised not to disclose their line of work, even to their parents. The amount of secrecy in this process, and importance of this review in the culinary world, led us to ask the question--"What factors can be revealed by examining Michelin restaurant reviews?"Through our shared love of food, we embarked on a journey to utilize Data Science to distill the essence of fine dining perfection."

Project Overview:

Our Capstone Team Project utilizes Web-scraping & Natural Language Processing to develop a model that predicts Michelin food star award ratings based on content from the official Michelin review.

Following the Data Science Pipeline First, our team will acquire and prepare the data for exploration. Then, we will explore the data to gain insight on which features to engineer that ultimately improve our model's accuracy. After we create several types of machine learning models that can effectly predict the Michelin food star award rating we will compare each model's performance on training and validate datasets. The model that performs the best will move forward with test dataset for final results.

Project Goals:

Create a model that effectively predicts Michelin food star award ratings based on content from the official Michelin review
Provide a well-documented jupyter notebook that contains our analysis
Produce a Final GitHub repository
Present a Canva slide deck suitable for a general audience which summarizes our findings and documents the results with well-labeled visualizations

Reproduction of this Data:

Can be accomplished by simply cloning our project and running the final notebook as explained in the instructions below:

Warning to ensure you are not banned from the host while scraping, a 2sec sleep pause per page with a backup 5sec sleep command in case of error was implemented in the acquire function. This slows down the initial scraping run of the program. After web scraping each of the 6700+ reviews, all data is saved locally to the michelin_df.pickle file.

Reproduction Instructions:

Clone the Repository using this code in your terminal git clone [email protected]:CodeupGourmands/Michelin_NLP_Capstone.git then run the mvp_notebook.ipynb Jupyter Notebook.
You will need to ensure the below listed files, at a minimum, are included in the repo in order to be able to run.
- mvp_notebook.ipynb
- acquire.py
- prepare.py
- explore.py
- model.py

Initial Thoughts

Our initial thoughts are that country, cuisine, and words/groups of words (bigrams and trigrams) may be very impactful features to predict our target 'award' level. Another thought was that the price level and available facilities could also help determine the target 'award' level.

The Plan

Acquire initial data (CSV file) via Kaggle download
Acquire review data using Beautiful Soup via 'get_michelin_pages' function in acquire file
Clean and Prepare the data utilizing RegEx and string functions
Explore data in search of significant relationships to target (Michelin Star Ratings)
Conduct statistical testing as necessary

▪︎ Answer 6 initial exploratory questions:

Question 1. What is the distribution of our target variable (award type)?
Question 2. What countries have the most Michelin restaurants?
Question 3. What is the average wordcount of restaurant reviews, by award type?
Question 4. Do three star Michelin restaurants have the highest sentiment score?
Question 5. What are the most frequent words used in Michelin Restaurant reviews?
Question 6. Do higher rated restaurants have more facilities?

Develop a Model to predict Award Category of Michelin restaurants:
- Evaluate models on train and validate data using accuracy score
- Select the best model based on the smallest difference in the accuracy score on the train and validate sets.
- Evaluate the best model on test data
Draw conclusions

Data Dictionary:

Original Features:

Feature	Description
name	Name of the awardee restaurant
address	Address of the awardee restaurant
location	City, country, or province of the awardee restaurant
price	Representation of the price value from one to four (min-max) using the curency symbol of the location country
cuisine	Main style of cuisine served by the awardee restaurant
longitude	Geographical longitude of the awardee restaurant
latitude	Geographical latitude of the awardee restaurant
url	Url address to the Michelin Review of the awardee restaurant
facilities_and_services	Highlighted facilities and services available at the awardee restaurant
data	Web-scraped review for each awardee document

Feature Engineered:

Feature	Description
price_level	Numeric value from 1 to 4 (min-max) representing the same relative level of expense across all countries
city	City as captured by the first position of the location feature
country	Country as captured by the second position of the location feature; also captures provinces that only had one entry in the location feature
review_clean	Tokenized text in lower case, with latin symbols only from the original data column containing the scraped reviews
review_lemmatized	Data column containing the web-scraped reviews after being cleaned and lemmatzed
review_word_count	Word count of each corresponding review
review_sentiment	Compound sentiment score of each observation

Target Variable:

Feature	Value	Description
award	['1 michelin star', '2 michelin stars', '3 michelin stars', 'bib gourmand']	This feature identifies which award was presented to the restaurant belonging to each document
	1 michelin star	"High quality cooking, worth a stop!"
	2 michelin stars	"Excellent cooking, worth a detour!"
	3 michelin stars	"Exceptional cuisine, worth a special journey!"
	bib gourmand	"Good quality, good value cooking."

Acquire

Our dataset of all Michelin Awardee restaurants worldwide was acquired January 17, 2023 from Kaggle. This dataset is updated quarterly with new additions of Michelin Awardee restaurants and removal of restaurants that no longer carry the award. From this initial dataset, we utilized the Michelin Guide URL for each restaurant and Beautiful Soup to web-scrape the review text for each restaurant, enhancing the original dataset.

Acquisition Actions:

Web-scraped data from guide.michelin.com using Beautiful Soup
The review text for each restaurant was then appended back to the original dataframe
Each row represents a Michein Awardee restaurant
Each column represents a feature of the restaurant, including feature-engineered columns
6780 acquired restaurant reviews (6 NaN values caused by restaurants no longer active Michelin Awardees).

Prepare

Our data set was prepared following standard Data Processing procedures and the details can be explored under the prepare actions below.

Prepare Actions:

FEATURE ENGINEER: Used 'bag of words' to create new categorical features from polarizing words.
- Created columns with clean and lemmatized text
- Created a column containing the word_count length
- Created a column containing sentiment score of the text
DROP: Dropped phone_number and website_url columns that contained Nulls values as determined would not be used as features for the iteration of this project. Dropped six restaurants from the original Kaggle dataset that are no longer Michelin restaurants.
RENAME: Converted column names to lowercase with no spaces.
ENCODED: Features 'price_category' and 'country' were encoded into dummy variables
IMPUTED: There were missing values in the price column that were imputed with the mode
Note: Special care was taken to ensure that there was no leakage of this data

Split

SPLIT: train, validate and test (approx. 56/24/20), stratifying on target of award
SCALED: We scaled all numeric columns for modeling ['lem_length','original_length','clean_length','length_diff']
Xy SPLIT: split each DataFrame (train, validate, test) into X (features) and y (target)

Exploration Summary of Findings:

Bib gourmand is the most common award (baseline is 50.3%), 3 Michelin stars is the least common.
France has the most Michelin awarded restaurants, followed by Japan, Italy, U.S.A and Germany)
Restaurants awarded 3 Michelin stars had reviews with the most words, and Bib Gourmand Restaurants had the fewest word count.
Restaurants awarded 2 Michelin stars had the highest sentiment score, and Bib Gourmand restaurants had the lowest sentiment score
Most frequent single words used:
- modern
- room
- wine
Most frequent bigrams:
- tasting menu
- la carte
- open kitchen
Most frequent trigrams:
- two tasting menu
- take pride place
Higher-rated restaurants had more facilities than lower rated restaurants

Modeling

The models created

Used following classifier models:

Decision Tree
Random Forest
Logistic Regression
Gradient Boosting Classifier

The metric used to evaluate the models was the accuracy score. The ideal model's accuracy score is expected to outperfom the baseline accuracy score.

Modeling Summary:

Modeling Results

We ran grid search on four different models, optimizing hyperparameters
Logistic Regression performed the best, over both train and validate
When run on test, Logistic Regression yielded an accuracy score of 87.9%, improving baseline accuracy by 37.6%

Conclusion:

Restaurants with higher Michelin award levels have, on average, longer reviews
France, Japan, and Italy have the most Michelin restaurants
Two (2) Star Michelin Restaurant reviews have the highest sentiment levels, followed by one (1) star restaurants, three (3) star restaurants, and Bib Gourmand restaurants. However, the difference in sentiment levels between the star categories was not significant.
Utilizing the cleaned and lemmatized text of reviews, we produced a model that predicts, with 87.9% accuracy, the award category of a restaurant.
Our results suggest that the way Michelin reviewers talk about restaurants is impactful and meaningful, and further exploration could yield valuable results

Recommendations

To imrpove your chances for Michelin designation, “shoot for the stars”
The higher level a restaurant is rated, the more service focused words, groups of two and three words occur in the review
An improvement in dining experience, seems to be the biggest driver towards a three-star restaurant review

Next Steps

Pruning TF/IDF to hone model performance
Investigate deep learning to further improve model accuracy
Exploration of restaurant cuisine type to feature engineer
Investigation and deeper exploration of unique words and phrases
Clustering features for modeling

Name		Name	Last commit message	Last commit date
Latest commit History 454 Commits
cristina		cristina
data		data
images		images
justin		justin
woody		woody
.gitignore		.gitignore
README.md		README.md
acquire.py		acquire.py
datatypes.py		datatypes.py
drafts.ipynb		drafts.ipynb
explore.py		explore.py
geo.py		geo.py
michelin_df.pickle		michelin_df.pickle
model.py		model.py
mvp_notebook.ipynb		mvp_notebook.ipynb
prepare.py		prepare.py
script.md		script.md
test_model.ipynb		test_model.ipynb
transparent_wc.py		transparent_wc.py
tune_models.ipynb		tune_models.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codeup Gourmands Presents...

Michelin NLP Capstone

Project Links

Meet Team Codeup Gourmands!

Project Inspiration:

Project Overview:

Project Goals:

Reproduction of this Data:

Initial Thoughts

The Plan

Data Dictionary:

Acquire

Prepare

Split

Exploration Summary of Findings:

Modeling

Modeling Summary:

Conclusion:

Recommendations

Next Steps

About

Releases

Packages

Contributors 4

Languages

CodeupGourmands/Michelin_NLP_Capstone

Folders and files

Latest commit

History

Repository files navigation

Codeup Gourmands Presents...

Michelin NLP Capstone

Project Links

Meet Team Codeup Gourmands!

Project Inspiration:

Project Overview:

Project Goals:

Reproduction of this Data:

Initial Thoughts

The Plan

Data Dictionary:

Acquire

Prepare

Split

Exploration Summary of Findings:

Modeling

Modeling Summary:

Conclusion:

Recommendations

Next Steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages