A Food Data Analysis: Recipe Ratings & Calories

Authors: Clayton Hammen Tan, Varick Janiro Hasim

Overview

The goal of this project is to analyze which recipe factors impact calories and to predict a recipe’s calories accurately based on its other attributes (e.g., average rating, number of steps, number of ingredients). Using two datasets (recipes and user interactions), we clean and explore the data, test hypotheses, build predictive models, and ensure fairness across different recipe categories.

Introduction

This project investigates how various factors contribute to a recipe’s calories using data from Food.com. The datasets include:

Recipes Dataset: Over 83,000 recipes with details such as preparation time, nutritional information, and ingredients.
Interactions Dataset: Over 730,000 user reviews and ratings.

By merging these datasets, we explore trends in calorie content and predict calories based on recipe attributes.

Research Question

What factors influence the calories and average rating of a recipe?

This question helps determine if more complex recipes tend to have higher calories and different satisfaction levels. The insights can guide home cooks, food bloggers, and recipe developers.

Dataset Description

Recipes Dataset

Column Name	Description
name	Name of the recipe, providing a unique identifier for each dish.
id	Unique Recipe ID.
minutes	Total preparation time (in minutes).
contributor_id	ID of the user who submitted the recipe.
submitted	Date when the recipe was submitted.
tags	Categories associated with the recipe (e.g., cuisine or meal type).
nutrition	Nutritional information including calories, fat, protein, and carbohydrates.
n_steps	Number of steps in the recipe instructions.
steps	Detailed recipe instructions.
description	User-provided description of the recipe.
ingredients	List of ingredients used.
n_ingredients	Total number of ingredients used.
average_rating	Average rating, calculated from user interactions.

Interactions Dataset

Column Name	Description
user_id	ID of the user providing the rating or review.
recipe_id	ID of the reviewed recipe.
date	Date of the interaction (review or rating).
rating	Rating given by the user.
review	Textual review provided by the user (optional).

Importance of the Study

Understanding which recipe attributes influence calories has practical and academic benefits. It helps recipe developers and everyday cooks estimate recipe calories and demonstrates robust data analysis and predictive modeling techniques applicable to other recommendation systems.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

The following cleaning steps were performed:

Merged the Datasets:
The recipes and interactions datasets were merged using a left join on id and recipe_id so that all recipes were retained.
Replaced Invalid Ratings:
Ratings of 0 in the interactions dataset were replaced with NaN, treating them as missing rather than true ratings.
Calculated Average Ratings:
Average ratings were computed for each recipe (grouped by name) and added as a new column average_rating, rounded to two decimals.
Extracted Nutritional Information:
The nutrition column (a string) was split into separate columns for calories, total_fat, sugar, sodium, protein, saturated_fat, and carbohydrates.
Created Additional Binary Columns:
- simple: True if the number of ingredients ≤ 9.
- easy: True if the number of steps ≤ 9.
- healthy: True if the ratio of saturated fat to calories ≤ 0.10.
Removed Outliers:
Outliers (e.g., recipes with 40,000 calories) were removed.

Effect on Analysis:

Replacing invalid ratings and computing average ratings provides more accurate insights.
Splitting nutrition data facilitates focused analysis of individual nutritional factors.
Merging datasets offers a comprehensive view of recipes and user preferences.

Univariate Analysis

Plot 1: Distribution of Number of Ingredients

This histogram shows the typical number of ingredients per recipe.

Key Observations:
- A clear peak around 9–10 ingredients suggests moderate complexity.
- Very high-ingredient recipes (over 20) are rare.

Plot 2: Distribution of Calories in Recipes

This density-based histogram shows the calorie content across recipes.

Key Observations:
- The distribution is right-skewed, with many recipes under 300 calories.
- A tail exists for recipes with higher calories.

Bivariate Analysis

Scatter Plot: Calories vs. Average Rating

This plot suggests that while recipes with high ratings span a range of calorie values, calorie count alone may not drive ratings.

Box Plot: Average Rating by Number of Steps

This box plot indicates that recipes with fewer preparation steps have more consistent average ratings, and those with more have more varying average ratings.

Interesting Aggregates

Pivot Table: Calories and Ratings by Cooking Time (Minutes)

The table below shows how cooking time correlates with satisfaction and caloric content:

average_rating	calories
5	653.1
4.73875	195.705
4.72001	223.325
4.6944	253.525
4.53713	271.961

Assessment of Missingness

NMAR Analysis: Understanding Missingness in `average_rating`

The average_rating field exhibits NMAR (Not Missing At Random) behavior, meaning its missingness is driven by factors not explicitly captured in the dataset.

Why is `average_rating` NMAR?

User Behavior:
Some users may not rate recipes if they did not follow the recipe strictly or felt indifferent.
Nature of Missingness:
Such behaviors stem from user-specific preferences not reflected in the data.

How Could We Make It MAR?

Additional data (e.g., user interaction frequency, contextual details, engagement metrics) could link the missingness to observable factors, allowing it to be treated as MAR (Missing At Random).

Missingness Dependency: Investigating Relationships

Permutation tests were conducted to determine if the missingness of average_rating depends on other recipe features. For each test:

H₀ (Null Hypothesis): Missingness does not depend on the tested column.
Hₐ (Alternative Hypothesis): Missingness depends on the tested column.
Test Statistic: Difference in means of the column between recipes with missing and non-missing average_rating.

Detailed Results

Missingness vs. calories
- Observed Test Statistic: 19.308042
- P-value: 0.000
Conclusion: Depends on calories.
Missingness vs. healthy
- Observed Test Statistic: 0.026784
- P-value: 0.004
Conclusion: Depends on healthy.
Missingness vs. simple
- Observed Test Statistic: 0.024634
- P-value: 0.026
Conclusion: Depends on simple.
Missingness vs. easy
- Observed Test Statistic: 0.073634
- P-value: 0.000
Conclusion: Depends on easy.
Missingness vs. sugar
- Observed Test Statistic: 9.587501
- P-value: 0.000
Conclusion: Depends on sugar.
Missingness vs. protein
- Observed Test Statistic: 0.993006
- P-value: 0.120
Conclusion: Does not depend on protein.
Missingness vs. sodium
- Observed Test Statistic: 0.828399
- P-value: 0.622
Conclusion: Does not depend on sodium.
Missingness vs. n_ingredients
- Observed Test Statistic: 0.259784
- P-value: 0.000
Conclusion: Depends on n_ingredients.
Missingness vs. n_steps
- Observed Test Statistic: 1.471825
- P-value: 0.000
Conclusion: Depends on n_steps.
Missingness vs. saturated_fat
- Observed Test Statistic: 3.117959
- P-value: 0.000
Conclusion: Depends on saturated_fat.

Summary Table

Tested Column	Observed Statistic	P-Value	Conclusion
`calories`	19.308042	0.000	Depends on `calories`
`healthy`	0.026784	0.004	Depends on `healthy`
`simple`	0.024634	0.026	Depends on `simple`
`easy`	0.073634	0.000	Depends on `easy`
`sugar`	9.587501	0.000	Depends on `sugar`
`protein`	0.993006	0.120	Does not depend on `protein`
`sodium`	0.828399	0.622	Does not depend on `sodium`
`n_ingredients`	0.259784	0.000	Depends on `n_ingredients`
`n_steps`	1.471825	0.000	Depends on `n_steps`
`saturated_fat`	3.117959	0.000	Depends on `saturated_fat`

Summary:
Missingness in average_rating is significantly associated with calories, healthy, simple, easy, sugar, n_ingredients, n_steps, and saturated_fat (p < 0.05) but not with protein or sodium.

Hypothesis Testing

Objective

Determine whether a recipe’s number of preparation steps affects its average rating.

Hypotheses

Null Hypothesis: The mean average rating of recipes does not depend on the number of steps (recipe complexity).
Alternative Hypothesis: The mean average rating of recipes does depend on the number of steps.

Test Statistic & Significance Level

The test statistic is the absolute difference in the mean average rating between:

Easy Recipes: n_steps ≤ 9.
Hard Recipes: n_steps > 9.

Significance Level: 0.05

Methodology

Calculate the observed test statistic (absolute difference in means).
Shuffle the group labels (≤9 steps vs. >9 steps) 1,000 times to simulate the null hypothesis.
Compute the p-value as the proportion of permuted statistics ≥ the observed statistic.

Results & Visualization

Observed Statistic: ~0.000787
P-Value: ~0.87

Conclusion

Since the p-value (~0.87) is greater than 0.05, we fail to reject the null hypothesis. This suggests that the number of preparation steps does not significantly influence the average rating.

Framing a Prediction Problem

Prediction Problem Statement

Predict the total calories (calories) of a recipe based on its characteristics:

Number of preparation steps (n_steps)
Number of ingredients (n_ingredients)
Protein content (protein)
Sodium content (sodium)
Saturated fat content (saturated_fat)
Sugar content (sugar)

Predicting calories helps understand how complexity and nutritional content contribute to a recipe’s caloric value, guiding recipe creation and dietary choices.

Type of Prediction Problem

This is a regression problem since the target variable (calories) is continuous (ranging from 58 to 1036).

Response Variable

Calories – Represents the nutritional content and indirectly the healthiness of the recipe.

Features Used for Prediction

n_ingredients: Recipe complexity in terms of ingredients.
n_steps: Recipe complexity in terms of preparation steps.
protein: Protein content.
sodium: Sodium content.
saturated_fat: Saturated fat content.
sugar: Sugar content.

All features are known at the time of prediction.

Evaluation Metric

We use:

Mean Absolute Error (MAE): Provides an average error in the same units as calories and is robust to outliers.
R² (Coefficient of Determination): Indicates how well the model fits the data.

Visual Exploration of the Problem

Distribution of Calories:
(Right-skewed distribution with most recipes having lower calories.)
Scatter Plot: n_ingredients vs. Calories:
(No clear relationship observed.)
Scatter Plot: n_steps vs. Calories:
(Most recipes have ≤40 steps; relationship is unclear.)
Scatter Plot: Protein vs. Calories:
(Positive relationship observed.)
Scatter Plot: Sodium vs. Calories:
(No clear relationship.)
Scatter Plot: Saturated Fat vs. Calories:
(Positive relationship observed.)
Scatter Plot: Sugar vs. Calories:
(Positive relationship observed.)

Importance

Predicting calories benefits:

Health Enthusiasts: For tailored dietary plans.
Cooks: To design recipes that meet specific nutritional requirements.
Recipe Insights: Understanding how complexity and nutrition combine to affect calorie content.

Baseline Model

The baseline model is a Linear Regression model built using a preprocessing pipeline.

Model Features

Numerical Features

n_steps
n_ingredients
protein
sodium
saturated_fat
sugar

(Standardization was not applied since it does not affect predictions in linear regression.)

Categorical Feature

ingredients: Processed via OneHotEncoder.

Model Performance

Metrics:

R²: 0.7624704207215018
MAE: 76.76761605484424

Coefficients:

n_steps: 0.2255
n_ingredients: 4.2917
protein: 3.4293
sodium: 0.1105
saturated_fat: 3.0411
sugar: 0.9368
ingredients: 0.0000

The results suggest that n_ingredients, protein, and saturated_fat are significant predictors of calorie content.

Model’s Strengths and Weaknesses

Strengths:

Reproducible and scalable.
Effective handling of categorical data via one-hot encoding.
Interpretability of coefficients.

Weaknesses:

Prone to outliers.
Assumes linear relationships, which may not hold for all predictors.

Final Model

Added Features

complexity: n_steps + n_ingredients
(Captures overall recipe complexity.)
bad_nutrition: sugar + saturated_fat + sodium
(Captures cumulative negative nutritional impact.)

Modeling Algorithm

A Random Forest Regressor was used because it:

Handles both numerical and categorical data.
Captures non-linear relationships.
Allows hyperparameter tuning.

Hyperparameter Tuning

Using GridSearchCV (3-fold CV), the best hyperparameters were:

max_depth: 20
min_samples_split: 10
n_estimators: 200

Performance Comparison

Baseline Model:

MAE: 76.7585
R²: 0.7634

Final Model:

MAE: 67.0454
R²: 0.8076

Conclusion

The final model’s lower MAE and higher R² demonstrate improved accuracy. The enhancements are attributed to the new features and the non-linear modeling capability of the Random Forest Regressor.

Fairness Analysis

A fairness analysis was conducted by comparing model performance (MAE) between two groups:

Simple Recipes: ≤ 9 ingredients.
Complex Recipes: > 9 ingredients.

Hypotheses

Null Hypothesis (H₀): The model’s MAE does not differ significantly between Simple and Complex Recipes.
Alternative Hypothesis (Hₐ): The model’s MAE differs significantly between the two groups.

Test Statistic

The absolute difference in MAE between the two groups.

Significance Level

α = 0.05

Results

Observed MAE Difference: ~0.0031
P-Value: 0.0

Since the p-value is below 0.05, we reject H₀, indicating that the model’s performance differs between Simple and Complex Recipes. This suggests potential unfairness that warrants further investigation.

As the permutation test’s p-value of 0 is less than the chosen significance level of 0.05, we reject the null hypothesis. Hence, the observed difference in MAE is statistically significant, and it indicates that the model’s performance does differ between the two groups recipes, performing slightly better for one group of recipes. Hence, there is potential unfairness in the model between Simple and Complex Recipes, suggesting areas for improvement.

The histogram below illustrates the distribution of permutation test statistics under the null hypothesis, with the observed difference marked by the red dashed line.

A Food Data Analysis: Recipe Ratings & Calories

Overview

Introduction

Research Question

Dataset Description

Recipes Dataset

Interactions Dataset

Importance of the Study

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Univariate Analysis

Plot 1: Distribution of Number of Ingredients

Plot 2: Distribution of Calories in Recipes

Bivariate Analysis

Interesting Aggregates

Assessment of Missingness

NMAR Analysis: Understanding Missingness in average_rating

Why is average_rating NMAR?

How Could We Make It MAR?

Missingness Dependency: Investigating Relationships

Detailed Results

Summary Table

Hypothesis Testing

Objective

Hypotheses

Test Statistic & Significance Level

Methodology

Results & Visualization

Conclusion

Framing a Prediction Problem

Prediction Problem Statement

Type of Prediction Problem

Response Variable

Features Used for Prediction

Evaluation Metric

Visual Exploration of the Problem

Importance

Baseline Model

Model Features

Numerical Features

Categorical Feature

Model Performance

Model’s Strengths and Weaknesses

Final Model

Added Features

Modeling Algorithm

Hyperparameter Tuning

Performance Comparison

Conclusion

Fairness Analysis

Hypotheses

Test Statistic

Significance Level

Results

NMAR Analysis: Understanding Missingness in `average_rating`

Why is `average_rating` NMAR?