Foood Space

Using open data to identify the drivers of success for New York City restaurants


Insights for restaurateurs

Opening a restaurant in NYC requires making choices that directly impact the chances of success. In the course of pinning down a cuisine, location, or menu price, you might ask:

  • Just how valuable are star ratings?
  • What price level is best for my establishment?
  • Am I better off on an avenue or a street?

A data science approach can help predict how these choices could impact your business.

Learn more »

Based on data from

7,900+ Restaurants

located in NYC.

We source data via the Yelp Fusion API and extract features on 7,931 restaurants across 141 cuisines and establishment types,

focusing on restaurants in Manhattan that were active in early 2018.

Italian American (New) American (Trad.) Japanese
Mexican Chinese Mediterranean French
Indian Thai Asian Fusion Latin American
Korean Middle Eastern Spanish Barbeque
NYC restaurant data. This map plots a sample of the data set. Each circle represents a restaurant and is colored according to the primary type of food served. The size of the circle corresponds to the value of the restaurant's revenue proxy.

Restaurant revenue model View code »

We use restaurant features—including average rating, cuisines served, latitude, longitude, menu price, and opening hours—to predict a restaurant's revenues. To develop a proxy for revenues, we multiply (1) the number of reviews a restaurant has received by (2) the average price of a meal at that restaurant.

The model takes an ensemble learning approach by first fitting linear relationships using ridge regression and then modeling the residuals using a random forest. We use five-fold cross-validation to tune the ensemble model hyperparameters to ridge \(\alpha\) of 0.037, a forest of 200 decision trees, and 20 features per tree.

Predicted revenue.
Min.1st qu.MedianMean3rd Qu.Max
107002,6007,0217,200269,900
Predicted revenue proxy vs. observed revenue proxy. The observed revenue proxy is calculated as the product of the number of reviews a restaurant has received and the average price of a meal at that restaurant. The unit of proxy revenue is review-dollars.

When trained on the full data set, the model achieves:

Prediction quality across two levels

The model achieves 80% precision and 97% recall for restaurants with above-median proxy revenues.

Two-way confusion matrix
Prediction quality over two levels. Predictions classified by upper- and lower- half of revenue proxy values across 7,931 restaurants.

Prediction quality across four levels

The model achieves 79% precision and 93% recall for restaurants with top-quartile proxy revenues.

Four-way confusion matrix
Prediction quality over four levels. Predictions classified by quartiles of revenue proxy values across 7,931 restaurants.

Dollar-value of a rating star

While a review with more stars suggests a superior dining experience, the restaurant revenue model suggests that the dollar-value of an additional review star becomes meaningful only after having already earned three stars.

Revenue-value of a rating star.
Impact of star ratings on predicted revenue. Predictions are conditioned on Italian restaurants in Hell's Kitchen.

Pivoting on price

While sliding up the menu-price scale generally correlates with higher proxy revenues, the marginal impact of moving up a price tier varies considerably by cuisines served.

On- vs. off-avenue locations.
Impact of sliding along the price scale on predicted revenue. Predictions are conditioned on a median-latitude location in Hell's Kitchen.

On- vs. off-avenue locations

On-avenue restaurant locations provide exposure to transit patterns that significantly differ from those of off-avenue locations. While avenues tend to be busier than streets, the model suggests that off-avenue locations could demand a premium in many neighborhoods.

On- vs. off-avenue locations.
Impact of on- vs. off-avenue location on predicted revenue. Predictions are conditioned on three-dollar sign restaurants serving Italian food.

Restaurant feature recommender (prototype)

The recommender utility summarizes the output of a model that uses tensor-matrix factorization to decompose a three-dimensional array of average restaurant scores, indexed by cuisine, location, and menu price. This approach allows for both data compression and interpolation across missing values.

The model decomposes the original three-dimensional array into a small rank-3 tensor and three low-dimensional matrices. The restaurant score is a real number between 0 and 5 that is determined by the function \[ f(r, v; c) = r \left(1 - \frac{1}{cv + 1}\right), \] where \(r\) is the restaurant's average rating on a 0-to-5 scale, \(v\) is the number of reviews (or votes) a restaurant has received, and \(c\) is a parameter that we calibrate to allow restaurants with a median number of reviews to receive a score of \(f(r) = 0.5 r\).

In the table below, Score and Fit correspond to the observed and estimated restaurant scores, respectively. A higher Fit is better.