View on GitHub

Predicting What Makes Food.com Recipes 5-Star Hits

A data science project exploring features of highly-rated Food.com recipes

By AJ Das, arinjoy@umich.edu

Introduction

I used the Recipes and Ratings dataset from Food.com to explore the question, What makes a recipe well-rated on Food.com?

The dataset contains 230,000+ recipes and reviews posted since 2008. I aim to identify which recipe features are most associated with high user ratings, such as preparation time, nutritional content, or complexity. Ultimately, I want to predict a new recipe’s average rating before it receives any user feedback.

Understanding which features contribute to higher-rated recipes can help cooks tailor their recipes for success. It also reveals insight into the user’s preferences and establishes recommendation systems on cooking platforms.

Dataset Overview

Column Description
minutes Total time required to prepare the recipe
tags Descriptive tags such as “easy”, “vegan”, or “holiday”
nutrition A list containing [calories, total fat, sugar, sodium, protein, saturated fat, carbohydrates] as percent daily values
n_steps Number of steps in the recipe
avg_rating The average user rating for the recipe, computed from all submitted reviews
description The user-submitted text description of the recipe

This metadata provides perception for predicting a recipe’s potential success.


Data Cleaning and Exploratory Data Analysis

Data Cleaning

The raw dataset was comprised of recipe metadata and user-submitted reviews. I performed a left merge on the recipe ID to consolidate all the information. After the merge, I cleaned and transformed the data by removing zero ratings. If a rating had a value of 0, I replaced it with NaN, since they indicated an absence of a rating rather than a low rating. For each recipe, I computed avg_rating, the target variable for prediction which represented the average user rating. Initially, the nutrition column was a stringified list so I parsed it into an individual numeric column for calories, total_fat, sugar, sodium, protein, saturated_fat, and carbohydrates. Finally, recipes without an average rating were removed from the modeling set. Cleaning the data helped structure and model for the data analysis.

calories total_fat sugar sodium protein saturated_fat carbohydrates avg_rating
138.4 10.0 50.0 3.0 3.0 19.0 6.0 4.0
595.1 46.0 211.0 22.0 13.0 51.0 26.0 5.0
194.8 20.0 6.0 32.0 22.0 36.0 3.0 5.0
194.8 20.0 6.0 32.0 22.0 36.0 3.0 5.0
194.8 20.0 6.0 32.0 22.0 36.0 3.0 5.0

Univariate Analysis

Distribution of Average Ratings

Most recipes are rated quite highly, with a strong right skew toward 5-star ratings. It’s important to understand bias in user feedback when interpreting model performance.

Distribution of Calories

Calories are heavily right-skewed, with most recipes under 1,000 calories but a few extreme outliers reaching up to 40,000+. I scaled the features appropriately.

Bivariate Analysis

Minutes vs Average Rating

This scatter plot shows a weak negative trend between cooking time (minutes) and average rating. Users slightly prefer quicker recipes, though high-rated dishes exist at all durations.

Steps vs Average Rating

This box plot shows that recipes with fewer steps generally earn higher ratings. However, recipes with more steps tend to have more variation, which may reflect the challenge of complex dishes.

Interesting Aggregates

I created a new column called is_easy, which flags whether a recipe contains the tag “easy”. If a recipe is easy to make, then the user is more willing to try it.

I grouped the data by this flag and calculated the mean average rating for each group:

is_easy avg_rating
False 4.67
True 4.68

Imputation

The parsed nutrition columns had missing values, so I used median imputation via SimpleImputer to handle missing data.

Protein Distribution Before and After Imputation

The overall shape of the distribution stayed the same by filling in missing protein values with the median. This didn’t change the graph, and stabilized the model performance without dropping rows and not introducing outliers.

Framing a Prediction Problem

The goal of this project was to predict the average user rating (avg_rating) of a recipe. Since avg_rating is a continuous numerical value ranging from 1 to 5, this is a regression problem. I wanted to estimate a real-valued prediction of how well a recipe is expected to be rated before any user reviews are submitted. The choice of avg_rating as the response variable aligns directly with my research question: What makes a recipe well-rated?

To evaluate the model, I chose Mean Squared Error (MSE) as my metric. MSE penalizes substantial errors more severely. Classification metrics like accuracy or F1-score would not be appropriate in this context because we are predicting a continuous outcome, not a class label.

At the time of prediction, I used all features (e.g., minutes, protein, n_steps, tags, etc.) that were available to the system before the user submitted any rating. Therefore, my setup simulates how a platform like Food.com might recommend or rank recipes before they are reviewed.

Baseline Model

I trained a linear regression model to have a starting point for predicting a recipe’s average user rating. I used minutes (prep time) and calories (energy content) as inputs. These features may affect user perception and satisfaction, like whether the recipe takes a long or short time to make or considering the amount of calories in a recipe.

Before training, I dropped any rows where the avg_rating was missing. The preprocessing pipeline consisted of imputation, scaling, and modeling. Any missing values in features were filled with the median. I used StandardScaler() to improve numerical stability. Finally, I trained the model with LinearRegression from sklearn. I bundled the process into a Pipeline and evaluated the model using Mean Squared Error (MSE) on a 20% test set.

Baseline MSE: 0.2400

Though this model is simple, it’s a rudimentary standard to compare against advanced models with different feature engineering.

Final Model

I introduced three new features and used a stronger modeling algorithm for my final model. I added:

I chose a RandomForestRegressor to capture nonlinear relationships and interactions between these features. I applied GridSearchCV to evaluate combinations of the following hyperparameters:

I utilized a median imputation strategy for numeric columns, applied standard scaling, and trained the model on 80% of the data using a pipeline structure.

Final Results

The MSE notably improved from the baseline model. Tuning feature engineering and hyperparameters helped capture complex trends in the data.

Feature Importances:

The bar chart shows the importance of each feature according to the Random Forest model.