Skip to content

Codeup Capstone Team Project utilizing Natural Language Processing (NLP) to predict Michelin star ratings of Michelin awardee restaurants using Michelin Guide review text

Notifications You must be signed in to change notification settings

CodeupGourmands/Michelin_NLP_Capstone

Repository files navigation

Codeup Gourmands Presents...

My Image

Michelin NLP Capstone

Project Links

  • Click the buttons below to see the Project Repo and Canva presentation.

GitHub Canva Trello

Meet Team Codeup Gourmands!

Yuvia Cardenas Justin Evans Cristina Lucin Woodrow Sims
Yuvia's_PIC Justin's_PIC Cristina's_PIC Woody's_PIC
Yuvia's_LinkedIn Justin's_LinkedIn Cristina's_LinkedIn Woody's_LinkedIn
Yuvia's_GitHub Justin's_GitHub Cristina's_GitHub Woody's_GitHub

Project Inspiration:

In 1900, there were fewer than 3,000 cars on the roads of France. In order to increase demand for cars and, accordingly, car tires, car tire manufacturers and brothers Édouard and André Michelin published a guide for French motorists, the Michelin Guide. It provided information to motorists, such as maps, car mechanics listings, petrol stations, hotels and restaurants throughout France. The guide began to award stars for fine dining establishments in 1926. Initially, there was only a single star awarded. Then, in 1931, the hierarchy of zero, one, two, and three stars was introduced. In 1955, a fourth category, "Bib Gourmand", identified restaurants with quality food at a value price.

At present, a star award from the Michelin Guide is widely accepted as the pre-eminent culinary achievement of restauranteurs and chefs alike. Michelin reviewers (commonly called "inspectors") are anonymous. Many of the company's top executives have never met an inspector; inspectors themselves are advised not to disclose their line of work, even to their parents. The amount of secrecy in this process, and importance of this review in the culinary world, led us to ask the question--"What factors can be revealed by examining Michelin restaurant reviews?"Through our shared love of food, we embarked on a journey to utilize Data Science to distill the essence of fine dining perfection."

Project Overview:

Our Capstone Team Project utilizes Web-scraping & Natural Language Processing to develop a model that predicts Michelin food star award ratings based on content from the official Michelin review.

Following the Data Science Pipeline First, our team will acquire and prepare the data for exploration. Then, we will explore the data to gain insight on which features to engineer that ultimately improve our model's accuracy. After we create several types of machine learning models that can effectly predict the Michelin food star award rating we will compare each model's performance on training and validate datasets. The model that performs the best will move forward with test dataset for final results.

Project Goals:

  • Create a model that effectively predicts Michelin food star award ratings based on content from the official Michelin review
  • Provide a well-documented jupyter notebook that contains our analysis
  • Produce a Final GitHub repository
  • Present a Canva slide deck suitable for a general audience which summarizes our findings and documents the results with well-labeled visualizations

Reproduction of this Data:

Can be accomplished by simply cloning our project and running the final notebook as explained in the instructions below:

Warning to ensure you are not banned from the host while scraping, a 2sec sleep pause per page with a backup 5sec sleep command in case of error was implemented in the acquire function. This slows down the initial scraping run of the program. After web scraping each of the 6700+ reviews, all data is saved locally to the michelin_df.pickle file.

Reproduction Instructions:

  • Clone the Repository using this code in your terminal git clone [email protected]:CodeupGourmands/Michelin_NLP_Capstone.git then run the mvp_notebook.ipynb Jupyter Notebook.

  • You will need to ensure the below listed files, at a minimum, are included in the repo in order to be able to run.

    • mvp_notebook.ipynb
    • acquire.py
    • prepare.py
    • explore.py
    • model.py


Initial Thoughts

Our initial thoughts are that country, cuisine, and words/groups of words (bigrams and trigrams) may be very impactful features to predict our target 'award' level. Another thought was that the price level and available facilities could also help determine the target 'award' level.

The Plan

  • Acquire initial data (CSV file) via Kaggle download
  • Acquire review data using Beautiful Soup via 'get_michelin_pages' function in acquire file
  • Clean and Prepare the data utilizing RegEx and string functions
  • Explore data in search of significant relationships to target (Michelin Star Ratings)
  • Conduct statistical testing as necessary
▪︎ Answer 6 initial exploratory questions:

Question 1. What is the distribution of our target variable (award type)?
Question 2. What countries have the most Michelin restaurants?
Question 3. What is the average wordcount of restaurant reviews, by award type?
Question 4. Do three star Michelin restaurants have the highest sentiment score?
Question 5. What are the most frequent words used in Michelin Restaurant reviews?
Question 6. Do higher rated restaurants have more facilities?

  • Develop a Model to predict Award Category of Michelin restaurants:

    • Evaluate models on train and validate data using accuracy score
    • Select the best model based on the smallest difference in the accuracy score on the train and validate sets.
    • Evaluate the best model on test data
  • Draw conclusions

Data Dictionary:

Original Features:

Feature Description
name Name of the awardee restaurant
address Address of the awardee restaurant
location City, country, or province of the awardee restaurant
price Representation of the price value from one to four (min-max) using the curency symbol of the location country
cuisine Main style of cuisine served by the awardee restaurant
longitude Geographical longitude of the awardee restaurant
latitude Geographical latitude of the awardee restaurant
url Url address to the Michelin Review of the awardee restaurant
facilities_and_services Highlighted facilities and services available at the awardee restaurant
data Web-scraped review for each awardee document
Feature Engineered:

Feature Description
price_level Numeric value from 1 to 4 (min-max) representing the same relative level of expense across all countries
city City as captured by the first position of the location feature
country Country as captured by the second position of the location feature; also captures provinces that only had one entry in the location feature
review_clean Tokenized text in lower case, with latin symbols only from the original data column containing the scraped reviews
review_lemmatized Data column containing the web-scraped reviews after being cleaned and lemmatzed
review_word_count Word count of each corresponding review
review_sentiment Compound sentiment score of each observation
Target Variable:

Feature Value Description
award ['1 michelin star', '2 michelin stars', '3 michelin stars', 'bib gourmand'] This feature identifies which award was presented to the restaurant belonging to each document
1 michelin star "High quality cooking, worth a stop!"
2 michelin stars "Excellent cooking, worth a detour!"
3 michelin stars "Exceptional cuisine, worth a special journey!"
bib gourmand "Good quality, good value cooking."

Acquire

Our dataset of all Michelin Awardee restaurants worldwide was acquired January 17, 2023 from Kaggle. This dataset is updated quarterly with new additions of Michelin Awardee restaurants and removal of restaurants that no longer carry the award. From this initial dataset, we utilized the Michelin Guide URL for each restaurant and Beautiful Soup to web-scrape the review text for each restaurant, enhancing the original dataset.

Acquisition Actions:

  • Web-scraped data from guide.michelin.com using Beautiful Soup
  • The review text for each restaurant was then appended back to the original dataframe
  • Each row represents a Michein Awardee restaurant
  • Each column represents a feature of the restaurant, including feature-engineered columns
  • 6780 acquired restaurant reviews (6 NaN values caused by restaurants no longer active Michelin Awardees).

Prepare

Our data set was prepared following standard Data Processing procedures and the details can be explored under the prepare actions below.

Prepare Actions:

  • FEATURE ENGINEER: Used 'bag of words' to create new categorical features from polarizing words.
    • Created columns with clean and lemmatized text
    • Created a column containing the word_count length
    • Created a column containing sentiment score of the text
  • DROP: Dropped phone_number and website_url columns that contained Nulls values as determined would not be used as features for the iteration of this project. Dropped six restaurants from the original Kaggle dataset that are no longer Michelin restaurants.
  • RENAME: Converted column names to lowercase with no spaces.
  • ENCODED: Features 'price_category' and 'country' were encoded into dummy variables
  • IMPUTED: There were missing values in the price column that were imputed with the mode
  • Note: Special care was taken to ensure that there was no leakage of this data

Split

  • SPLIT: train, validate and test (approx. 56/24/20), stratifying on target of award
  • SCALED: We scaled all numeric columns for modeling ['lem_length','original_length','clean_length','length_diff']
  • Xy SPLIT: split each DataFrame (train, validate, test) into X (features) and y (target)

Exploration Summary of Findings:

  • Bib gourmand is the most common award (baseline is 50.3%), 3 Michelin stars is the least common.

  • France has the most Michelin awarded restaurants, followed by Japan, Italy, U.S.A and Germany)

  • Restaurants awarded 3 Michelin stars had reviews with the most words, and Bib Gourmand Restaurants had the fewest word count.

  • Restaurants awarded 2 Michelin stars had the highest sentiment score, and Bib Gourmand restaurants had the lowest sentiment score

  • Most frequent single words used:

    • modern
    • room
    • wine
  • Most frequent bigrams:

    • tasting menu
    • la carte
    • open kitchen
  • Most frequent trigrams:

    • two tasting menu
    • take pride place
  • Higher-rated restaurants had more facilities than lower rated restaurants

Modeling

The models created

Used following classifier models:

  • Decision Tree
  • Random Forest
  • Logistic Regression
  • Gradient Boosting Classifier

The metric used to evaluate the models was the accuracy score. The ideal model's accuracy score is expected to outperfom the baseline accuracy score.

Modeling Summary:

Modeling Results

  • We ran grid search on four different models, optimizing hyperparameters
  • Logistic Regression performed the best, over both train and validate
  • When run on test, Logistic Regression yielded an accuracy score of 87.9%, improving baseline accuracy by 37.6%

Conclusion:

  • Restaurants with higher Michelin award levels have, on average, longer reviews
  • France, Japan, and Italy have the most Michelin restaurants
  • Two (2) Star Michelin Restaurant reviews have the highest sentiment levels, followed by one (1) star restaurants, three (3) star restaurants, and Bib Gourmand restaurants. However, the difference in sentiment levels between the star categories was not significant.
  • Utilizing the cleaned and lemmatized text of reviews, we produced a model that predicts, with 87.9% accuracy, the award category of a restaurant.
  • Our results suggest that the way Michelin reviewers talk about restaurants is impactful and meaningful, and further exploration could yield valuable results

Recommendations

  • To imrpove your chances for Michelin designation, “shoot for the stars”
  • The higher level a restaurant is rated, the more service focused words, groups of two and three words occur in the review
  • An improvement in dining experience, seems to be the biggest driver towards a three-star restaurant review

Next Steps

  • Pruning TF/IDF to hone model performance
  • Investigate deep learning to further improve model accuracy
  • Exploration of restaurant cuisine type to feature engineer
  • Investigation and deeper exploration of unique words and phrases
  • Clustering features for modeling

About

Codeup Capstone Team Project utilizing Natural Language Processing (NLP) to predict Michelin star ratings of Michelin awardee restaurants using Michelin Guide review text

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •