Machine Learning: Using Python to Find Your Next Business Opportunity¶
Inspired by the course "Advanced Python Projects: Build AI Applications" by Priya Mohan on LinkedIn Learning, this project delves into AI development using Python. Throughout this work, we explore advanced AI concepts, techniques, and applications to elevate our Python portfolio and enhance our understanding of AI development.
Mohan, Priya. "Advanced Python Projects: Build AI Applications." LinkedIn Learning, 2024.
In [2]:
!pip install pandas scikit-learn matplotlib
!pip install alpha_vantage pandas
Requirement already satisfied: pandas in c:\users\nayel\anaconda3\lib\site-packages (2.2.1) Requirement already satisfied: scikit-learn in c:\users\nayel\anaconda3\lib\site-packages (1.2.2) Requirement already satisfied: matplotlib in c:\users\nayel\anaconda3\lib\site-packages (3.8.0) Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3.post1) Requirement already satisfied: tzdata>=2022.7 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: scipy>=1.3.2 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (1.11.4) Requirement already satisfied: joblib>=1.1.1 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: six>=1.5 in c:\users\nayel\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: alpha_vantage in c:\users\nayel\anaconda3\lib\site-packages (2.3.1) Requirement already satisfied: pandas in c:\users\nayel\anaconda3\lib\site-packages (2.2.1) Requirement already satisfied: aiohttp in c:\users\nayel\anaconda3\lib\site-packages (from alpha_vantage) (3.9.3) Requirement already satisfied: requests in c:\users\nayel\anaconda3\lib\site-packages (from alpha_vantage) (2.31.0) Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3.post1) Requirement already satisfied: tzdata>=2022.7 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\nayel\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: aiosignal>=1.1.2 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.2.0) Requirement already satisfied: attrs>=17.3.0 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (23.1.0) Requirement already satisfied: frozenlist>=1.1.1 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.4.0) Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.9.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2024.6.2)
In [3]:
#import all of required libraries and classes right here
import pandas as pd
import re
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import warnings
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
In [4]:
# Data Ingestion and Analysis
df= pd.read_excel('Coffee_shop_data.xlsx')
population = pd.read_csv('population.csv', skiprows=[0])
In [5]:
population.head()
Out[5]:
Geography | Label for GEO_ID | Race/Ethnic Group | Population Groups | Total | Errata of Total | Total!!1-person household | Total!!2-person household | Total!!3-person household | Total!!4-person household | Total!!5-person household | Total!!6-person household | Total!!7-or-more-person household | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8600000US89010 | ZCTA5 89010 | 1 | Total population | 172 | NaN | 51 | 69 | 19 | 12 | 12 | 4 | 5 |
1 | 8600000US89019 | ZCTA5 89019 | 1 | Total population | 1164 | NaN | 412 | 421 | 134 | 83 | 57 | 32 | 25 |
2 | 8600000US89060 | ZCTA5 89060 | 1 | Total population | 4144 | NaN | 1106 | 1714 | 550 | 361 | 222 | 95 | 96 |
3 | 8600000US89061 | ZCTA5 89061 | 1 | Total population | 2109 | NaN | 469 | 936 | 283 | 206 | 112 | 62 | 41 |
4 | 8600000US89439 | ZCTA5 89439 | 1 | Total population | 671 | NaN | 193 | 314 | 88 | 55 | 17 | 2 | 2 |
In [6]:
df.head()
Out[6]:
ID no. | Business Name | Street address | City | State | Zip Code | Phone | Rating | Gender majority | Median Salary | Latte Price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Brew Haven | 8 Old Shore Place | Oakland | California | 94616 | 415-810-4769 | 4.5 | Male | 72463 | 3.31000 |
1 | 2 | Bean Bliss | 6650 Clarendon Crossing | Stockton | California | 95210 | 209-701-1665 | 5.0 | Female | 87117 | 5.35000 |
2 | 3 | Caffeine Cove | 7281 Buell Road | Fresno | California | 93773 | 559-137-3554 | 4.7 | Male | 86394 | 4.15000 |
3 | 4 | Mug Magic | 670 Jackson Avenue | Torrance | California | 90510 | 818-789-5573 | 4.3 | Male | 88343 | 5.34000 |
4 | 5 | Daily Grind | 37 Ludington Terrace | San Diego | California | 92196 | 619-354-2389 | 1.6 | Male | 77795 | 4.51175 |
In [7]:
# check for data info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID no. 1000 non-null int64 1 Business Name 1000 non-null object 2 Street address 1000 non-null object 3 City 1000 non-null object 4 State 1000 non-null object 5 Zip Code 1000 non-null int64 6 Phone 1000 non-null object 7 Rating 1000 non-null float64 8 Gender majority 1000 non-null object 9 Median Salary 1000 non-null int64 10 Latte Price 1000 non-null float64 dtypes: float64(2), int64(3), object(6) memory usage: 86.1+ KB
In [8]:
df.shape
Out[8]:
(1000, 11)
In [9]:
population.shape
Out[9]:
(1705, 13)
In [10]:
#get basic stats about the data
df.describe()
Out[10]:
ID no. | Zip Code | Rating | Median Salary | Latte Price | |
---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 500.500000 | 92976.163000 | 3.784600 | 81182.842000 | 5.061491 |
std | 288.819436 | 1706.943177 | 1.150717 | 5142.670356 | 0.352002 |
min | 1.000000 | 90005.000000 | 1.000000 | 72001.000000 | 3.090000 |
25% | 250.750000 | 91751.750000 | 3.200000 | 76776.000000 | 4.830175 |
50% | 500.500000 | 92883.000000 | 4.240000 | 81113.000000 | 5.062450 |
75% | 750.250000 | 94257.000000 | 4.580000 | 85684.000000 | 5.310025 |
max | 1000.000000 | 96154.000000 | 5.000000 | 89978.000000 | 6.480000 |
In [11]:
ax=df['City'].value_counts().head(5).plot(kind='bar')
ax.set_title('Top 5 cities with most coffee shops')
plt.show()
In [12]:
#the top 10 mosr famous brands
ax=df['Business Name'].value_counts().head(10).plot(kind='bar')
ax.set_title('Top most famous brands')
plt.show()
Data Preprocessing¶
In [13]:
#if we have null values we would impute it. Missing values -replace it with the mode (most occuring values)
df.isna().sum()
Out[13]:
ID no. 0 Business Name 0 Street address 0 City 0 State 0 Zip Code 0 Phone 0 Rating 0 Gender majority 0 Median Salary 0 Latte Price 0 dtype: int64
In [14]:
# converting zipcode to object data (str) in order to join the zip code with the population data
df['Zip Code']=df['Zip Code'].astype(str)
In [15]:
#extract zip code from population
#Getting the last 5 digits from the population zip code. Creating a new column called zip code
def find_zip_code(geocode):
pattern = r'\d{5}$'
match = re.search(pattern, geocode)
if match:
zip_code = match.group(0)
return zip_code
In [16]:
#the actual conversion is below. The above is the function
population['Zip Code']=population['Geography'].apply(find_zip_code)
In [17]:
#merging the population via zip code as population is an important feature to determinate the price / locations
cafe_data=df.copy()
#notice that the data size is reduce after a join
df=pd.merge(cafe_data,population)
In [18]:
#keeping only Total from population.In the pop dataset, keeoing total population column and other columns.
columns=cafe_data.columns.values.tolist()+['Total']
df=df[columns]
#rename Total to population
df=df.rename(columns={"Total":"Population"})
In [19]:
df
Out[19]:
ID no. | Business Name | Street address | City | State | Zip Code | Phone | Rating | Gender majority | Median Salary | Latte Price | Population | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | Bean Bliss | 6650 Clarendon Crossing | Stockton | California | 95210 | 209-701-1665 | 5.00 | Female | 87117 | 5.35000 | 11180 |
1 | 6 | Espresso Elegance | 8427 Atwood Road | Visalia | California | 93291 | 559-929-4731 | 1.20 | Female | 78753 | 5.31765 | 15310 |
2 | 10 | Perk Palace | 7 David Junction | Fresno | California | 93726 | 559-323-2365 | 4.20 | Female | 80973 | 5.30665 | 13942 |
3 | 11 | The Coffee Cartel | 6018 Rockefeller Center | Orange | California | 92867 | 714-292-8390 | 1.90 | Male | 74340 | 4.43900 | 13284 |
4 | 14 | Grindhouse Cafe | 7846 Cherokee Junction | Visalia | California | 93291 | 559-253-9426 | 4.50 | Male | 78753 | 5.27165 | 15310 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
407 | 982 | Perk Paradise | 4 Park Meadow Crossing | Van Nuys | California | 91411 | 818-584-8823 | 4.00 | Female | 76034 | 4.99170 | 9177 |
408 | 983 | Urban Brewtropolis | 2459 Golden Leaf Place | Santa Monica | California | 90405 | 818-215-2671 | 4.82 | Female | 86920 | 4.98200 | 14376 |
409 | 985 | Caffeine Communion | 40 Sommers Parkway | San Jose | California | 95133 | 408-304-3646 | 4.04 | Female | 78176 | 4.64680 | 7365 |
410 | 998 | Urban Brewtropolis | 141 Burrows Place | Los Angeles | California | 90035 | 323-785-4094 | 3.00 | Male | 87604 | 4.64000 | 12814 |
411 | 999 | Steamy Moments | 42004 Bellgrove Terrace | Orange | California | 92867 | 714-506-9394 | 4.50 | Male | 74340 | 3.82000 | 13284 |
412 rows × 12 columns
In [20]:
#Keeping only relevant features
df=df[['Zip Code', 'Rating', 'Median Salary', 'Latte Price', 'Population']]
df.shape
Out[20]:
(412, 5)
In [21]:
df.columns
Out[21]:
Index(['Zip Code', 'Rating', 'Median Salary', 'Latte Price', 'Population'], dtype='object')
In [22]:
# Calculate the total number of coffee shops for each zip code
coffee_shop_counts = df['Zip Code'].value_counts().reset_index()
coffee_shop_counts.columns = ['Zip Code', 'CoffeeShopCount']
# Ensure 'Zip Code' is of type string in both DataFrames
df.loc[:, 'Zip Code'] = df['Zip Code'].astype(str)
coffee_shop_counts['Zip Code'] = coffee_shop_counts['Zip Code'].astype(str)
# Merge the counts back into the original DataFrame
df = df.merge(coffee_shop_counts, on='Zip Code', how='left', suffixes=('', '_coffee'))
# Print the updated DataFrame
print(df)
# Criteria:
# a. High population
# b. Low total number of coffee shops
# c. Low ratings
# d. High median salary
# Sorting the DataFrame based on the criteria
sorted_df = df.sort_values(by=['Population', 'CoffeeShopCount', 'Rating', 'Median Salary'],
ascending=[False, True, True, False]).reset_index(drop=True)
Zip Code Rating Median Salary Latte Price Population CoffeeShopCount 0 95210 5.00 87117 5.35000 11180 4 1 93291 1.20 78753 5.31765 15310 7 2 93726 4.20 80973 5.30665 13942 5 3 92867 1.90 74340 4.43900 13284 9 4 93291 4.50 78753 5.27165 15310 7 .. ... ... ... ... ... ... 407 91411 4.00 76034 4.99170 9177 8 408 90405 4.82 86920 4.98200 14376 3 409 95133 4.04 78176 4.64680 7365 6 410 90035 3.00 87604 4.64000 12814 5 411 92867 4.50 74340 3.82000 13284 9 [412 rows x 6 columns]
In [22]:
# Created a list - if length of list 5, if the zip code is already present, it will not add that into the list.
# Deduping zip code column and displaying all of the records for the top 5.
lst=[]
for i in range(len(sorted_df)):
if len(lst)!=5:
if (sorted_df['Zip Code'][i]) not in lst:
lst.append(sorted_df['Zip Code'][i])
# Filter 'sorted_df' to include only rows where 'Zip Code' is in 'lst'
top_5_zip_codes_df = sorted_df[sorted_df['Zip Code'].isin(lst)]
top_5_zip_codes_df
Out[22]:
Zip Code | Rating | Median Salary | Latte Price | Population | CoffeeShopCount | |
---|---|---|---|---|---|---|
0 | 94110 | 2.22 | 74020 | 5.04500 | 27128 | 5 |
1 | 94110 | 3.50 | 74020 | 5.04700 | 27128 | 5 |
2 | 94110 | 4.40 | 74020 | 5.05900 | 27128 | 5 |
3 | 94110 | 4.50 | 74020 | 4.77900 | 27128 | 5 |
4 | 94110 | 4.70 | 74020 | 4.58100 | 27128 | 5 |
5 | 90805 | 3.00 | 88140 | 5.47900 | 26056 | 4 |
6 | 90805 | 3.89 | 88140 | 5.31900 | 26056 | 4 |
7 | 90805 | 4.31 | 88140 | 5.02100 | 26056 | 4 |
8 | 90805 | 4.37 | 88140 | 5.20700 | 26056 | 4 |
9 | 95823 | 4.30 | 73165 | 4.96825 | 22470 | 3 |
10 | 95823 | 4.82 | 73165 | 4.87225 | 22470 | 3 |
11 | 95823 | 4.94 | 73165 | 4.47225 | 22470 | 3 |
12 | 94544 | 3.98 | 72092 | 4.94860 | 21872 | 6 |
13 | 94544 | 4.12 | 72092 | 4.96060 | 21872 | 6 |
14 | 94544 | 4.17 | 72092 | 4.79060 | 21872 | 6 |
15 | 94544 | 4.24 | 72092 | 4.64460 | 21872 | 6 |
16 | 94544 | 4.24 | 72092 | 4.91660 | 21872 | 6 |
17 | 94544 | 4.86 | 72092 | 4.59060 | 21872 | 6 |
18 | 90025 | 4.50 | 85001 | 5.37405 | 21228 | 1 |
In [23]:
# Features excluding 'Latte Price' and 'Zip Code'
X = df.drop(['Latte Price', 'Zip Code'], axis=1)
# Target variable
y = df['Latte Price']
In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Scaling¶
In [25]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test=sc.transform(X_test)
Model Selection¶
In [26]:
#Model selection
models = {
'Linear Regression':LinearRegression(),
'Random Forest': RandomForestRegressor(),
'Gradient Boosting': GradientBoostingRegressor(),
}
Hyperparameter Tuning¶
In [27]:
param_grid = {
'Random Forest':{'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]},
'Gradient Boosting': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 10]},
}
In [28]:
# Perform hyperparameter tuning using GridSearchCV
for model_name, model in models.items():
if model_name in param_grid:
grid_search = GridSearchCV(model, param_grid[model_name], cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X, y)
#Set the best hyperparameters to the model
models[model_name] = grid_search.best_estimator_
Model Training and Evaluation¶
In [29]:
#Modeling training
for model_name, model in models.items():
model.fit(X_train, y_train)
In [30]:
#Model Evaluation
for model_name, model in models.items():
#Evaluate the model on the testing set
y_pred = model.predict(X_test)
print(f"{model_name} Metrics:")
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error", mean_squared_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))
print()
Linear Regression Metrics: Mean Absolute Error: 0.21737899529390042 Mean Squared Error 0.06581667796516916 R-squared: 0.44342320694160287 Random Forest Metrics: Mean Absolute Error: 0.23160748734464295 Mean Squared Error 0.07573996171310597 R-squared: 0.359507251050331 Gradient Boosting Metrics: Mean Absolute Error: 0.2184404032308931 Mean Squared Error 0.06691055543719977 R-squared: 0.43417286441134295
Predictions¶
In [31]:
#We want this dataframe to be same as the training data so that model can predict the value
zip_codes_df= top_5_zip_codes_df.drop(['Zip Code', 'Latte Price'], axis=1)
zip_codes_df= sc.transform(zip_codes_df)
In [32]:
for model_name, model in models.items():
#Predict the prices for lattes in the top 5 zip codes
predicted_prices = model.predict(zip_codes_df)
In [33]:
predictions = {}
for model_name, model in models.items():
# Predict the prices for lattes in the top 5 zip codes
predicted_prices = model.predict(zip_codes_df)
predictions[model_name] = predicted_prices
# Convert the predictions dictionary to a DataFrame
predictions_df = pd.DataFrame(predictions)
# Add the zip codes to the predictions DataFrame
predictions_df['Zip Code'] = top_5_zip_codes_df['Zip Code'].values
# Rearrange the columns to have 'Zip Code' as the first column
cols = ['Zip Code'] + [col for col in predictions_df.columns if col != 'Zip Code']
predictions_df = predictions_df[cols]
predictions_df
Out[33]:
Zip Code | Linear Regression | Random Forest | Gradient Boosting | |
---|---|---|---|---|
0 | 94110 | 4.785633 | 4.999094 | 4.909164 |
1 | 94110 | 4.781253 | 5.006316 | 4.909164 |
2 | 94110 | 4.778173 | 4.959572 | 4.909164 |
3 | 94110 | 4.777830 | 4.965996 | 4.909164 |
4 | 94110 | 4.777146 | 4.989925 | 4.909164 |
5 | 90805 | 5.379973 | 5.424480 | 5.304014 |
6 | 90805 | 5.376927 | 5.364141 | 5.292189 |
7 | 90805 | 5.375490 | 5.301565 | 5.276940 |
8 | 90805 | 5.375284 | 5.297322 | 5.276940 |
9 | 95823 | 4.734783 | 4.902095 | 4.838262 |
10 | 95823 | 4.733003 | 4.896670 | 4.838262 |
11 | 95823 | 4.732592 | 4.579438 | 4.765748 |
12 | 94544 | 4.698490 | 4.909205 | 4.838262 |
13 | 94544 | 4.698010 | 4.897126 | 4.838262 |
14 | 94544 | 4.697839 | 4.789904 | 4.838262 |
15 | 94544 | 4.697600 | 4.706985 | 4.838262 |
16 | 94544 | 4.697600 | 4.706985 | 4.838262 |
17 | 94544 | 4.695478 | 4.670329 | 4.778698 |
18 | 90025 | 5.231136 | 5.266994 | 5.216369 |
In [34]:
agg_df = predictions_df.groupby('Zip Code')['Gradient Boosting'].agg([("Highest", "max"), ("Lowest", "min")]).reset_index()
agg_df.columns = ['Zip Code', 'Highest', 'Lowest']
print(agg_df)
Zip Code Highest Lowest 0 90025 5.216369 5.216369 1 90805 5.304014 5.276940 2 94110 4.909164 4.909164 3 94544 4.838262 4.778698 4 95823 4.838262 4.765748
Top five zip code and prices¶
In [35]:
print("Top five zip code and prices")
print(agg_df)
Top five zip code and prices Zip Code Highest Lowest 0 90025 5.216369 5.216369 1 90805 5.304014 5.276940 2 94110 4.909164 4.909164 3 94544 4.838262 4.778698 4 95823 4.838262 4.765748