IBM Data Science Capstone Final Project

5 min readOct 6, 2019

Find the most suitable location to open a new South Indian restaurant in Toronto, Canada

Introduction

For this project, I am assuming a hypothetical scenario of a concept Indian entrepreneur who wants to explore opening a reliable South Indian restaurant in Toronto, Canada. The idea behind this project is that there may not be enough Indian restaurants in Toronto and it might present a great opportunity for this entrepreneur who is based in Toronto. The entrepreneur wants to open the restaurant near other Indian restaurants because people who like Indian food may tend to like south Indian food. Finding the location to open the restaurant is the most important and crucial decision for the entrepreneur and this project is all about helping him find the optimal location.

Business Problem

The objective of this project is to find the most suitable location for the entrepreneur to open a new South Indian restaurant in Toronto, Canada. By using Data analysis, and machine learning algorithms like clustering, this project aims to provide solutions to answer to the business problem: In Toronto, if an entrepreneur wants to open a new South Indian restaurant, where should he consider opening it?

Target Audience

The entrepreneur who wants to find the location to open a reliable South Indian Restaurant

Data

To solve the problem, we need the following data:

List of Neighborhoods in Toronto, Canada.
Latitude and Longitude of this Neighborhood.
Venue data related to Indian restaurants. This will help us find the neighborhoods that are most suitable to open a South Indian restaurant.

Extracting the data

Scrapping of Toronto Neighborhood data via Wikipedia
Getting the latitude and Longitude of this neighborhood using the Geocoder package.
Using Foursquare API to get the venues related to this neighborhood.

Methodology

First, I need to get the list of neighborhoods in Toronto, Canada. This is possible by extracting the list of neighborhoods from Wikipedia page(“ https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M ”). I did the web scraping by utilizing the pandas HTML table scraping method as it is easier and more convenient to pull tabular data directly from a web page into a data frame. However, it is only a list of neighborhood names and postal codes. I will need to get their coordinates to utilize Foursquare to pull the list of venues near these neighborhoods. To get the coordinates, either I can use the Geocoder or simply use the CSV file provided by Coursera itself. After gathering all these coordinates, I visualized the map of Toronto using the Folium package to verify whether these are correct coordinates. Next, I use Foursquare API to pull the list of top 100 venues within 500 meters radius. I have created a Foursquare developer account earlier in order to obtain an account ID and API key to pull the data. From Foursquare, I am able to pull the names, categories, latitude, and longitude of the venues. With this data, I can also check how many unique categories that I can get from these venues. Then, I analyze each neighborhood by grouping the rows by neighborhood and taking the mean on the frequency of occurrence of each venue category. This is to prepare clustering to be done later. Here, I made a justification to specifically look for “Indian restaurants” because the number of results for “South Indian Restaurants” was less. Lastly, I performed the clustering method by using k-means clustering. K-means clustering algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster while keeping the centroids as small as possible. It is one of the simplest and popular unsupervised machine learning algorithms and it is highly suited for this project as well. I have clustered the neighborhoods in Toronto into 4 clusters based on their frequency of occurrence for “Indian food”. Based on the results (the concentration of clusters), I can able to recommend the ideal location to open the restaurant.

Observations

Map of Toronto

Clusters

The results from k-means clustering show that we can categorize Toronto neighborhoods into 4.

clusters based on how many Indian restaurants are in each neighborhood:

● Cluster 0: Neighborhoods with no Indian restaurants

● Cluster 1: Neighborhoods with little Indian restaurants.

● Cluster 2: Neighborhoods with little or no Indian restaurants.

● Cluster 3: Neighborhoods with high Indian restaurants.

The results are visualized in the above map with Cluster 0 in red color, Cluster 1 in purple color, Cluster 2 in lemon sky blue color and Cluster 4 in lemon yellow color.

Bar Graph

The Pictorial representation of the above-mentioned points. We can clearly see that Cluster 3 is high on the numbers of an Indian restaurant, and Cluster 0 has almost zero Indian restaurants.

Observations And Recommendation

Most of the Indian restaurants are in Cluster 3 which is around The Annex, North Midtown, Yorkville. Lowest (close to zero) in Cluster 0 and Cluster 2 areas which are Downtown Toronto, east, west, central Toronto areas. Also, there are good opportunities to open near Downtown, east, west Toronto as the competition seems to below. Looking at nearby venues, it seems Cluster 0 and cluster 2 might be a good location as there are not a lot of Asian restaurants in these areas. Therefore, this project recommends the entrepreneur to open a reliable South Indian restaurant in these locations with little to no competition. Nonetheless, if the food is Delicious, reliable, and affordable, I am confident that it will have great demand everywhere!

Conclusion

In this project, we have gone through the process of identifying the business problem, specifying the data required, extracting and preparing the data, performing the machine learning by utilizing k-means clustering and providing a recommendation to the stakeholders.

References

List of neighborhoods in Toronto:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Foursquare Developer Documentation: https://developer.foursquare.com/docs

Latitude and Longitude of Postal codes:

http://cocl.us/Geospatial_data