!pip install pandas scikit-learn matplotlib
!pip install alpha_vantage pandas

Requirement already satisfied: pandas in c:\users\nayel\anaconda3\lib\site-packages (2.2.1)
Requirement already satisfied: scikit-learn in c:\users\nayel\anaconda3\lib\site-packages (1.2.2)
Requirement already satisfied: matplotlib in c:\users\nayel\anaconda3\lib\site-packages (3.8.0)
Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: scipy>=1.3.2 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\nayel\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nayel\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: six>=1.5 in c:\users\nayel\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: alpha_vantage in c:\users\nayel\anaconda3\lib\site-packages (2.3.1)
Requirement already satisfied: pandas in c:\users\nayel\anaconda3\lib\site-packages (2.2.1)
Requirement already satisfied: aiohttp in c:\users\nayel\anaconda3\lib\site-packages (from alpha_vantage) (3.9.3)
Requirement already satisfied: requests in c:\users\nayel\anaconda3\lib\site-packages (from alpha_vantage) (2.31.0)
Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nayel\anaconda3\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\nayel\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.2.0)
Requirement already satisfied: attrs>=17.3.0 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (23.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.4.0)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\nayel\anaconda3\lib\site-packages (from aiohttp->alpha_vantage) (1.9.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\nayel\anaconda3\lib\site-packages (from requests->alpha_vantage) (2024.6.2)

#import all of required libraries and classes right here
import pandas as pd
import re
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import warnings
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Data Ingestion and Analysis
df= pd.read_excel('Coffee_shop_data.xlsx')
population = pd.read_csv('population.csv', skiprows=[0])

population.head()

df.head()

# check for data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID no.           1000 non-null   int64  
 1   Business Name    1000 non-null   object 
 2   Street address   1000 non-null   object 
 3   City             1000 non-null   object 
 4   State            1000 non-null   object 
 5   Zip Code         1000 non-null   int64  
 6   Phone            1000 non-null   object 
 7   Rating           1000 non-null   float64
 8   Gender majority  1000 non-null   object 
 9   Median Salary    1000 non-null   int64  
 10  Latte Price      1000 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 86.1+ KB

df.shape

(1000, 11)

population.shape

(1705, 13)

#get basic stats about the data
df.describe()

ax=df['City'].value_counts().head(5).plot(kind='bar')
ax.set_title('Top 5 cities with most coffee shops')
plt.show()

#the top 10 mosr famous brands
ax=df['Business Name'].value_counts().head(10).plot(kind='bar')
ax.set_title('Top most famous brands')
plt.show()

#if we have null values we would impute it. Missing values -replace it with the mode (most occuring values)
df.isna().sum()

ID no.             0
Business Name      0
Street address     0
City               0
State              0
Zip Code           0
Phone              0
Rating             0
Gender majority    0
Median Salary      0
Latte Price        0
dtype: int64

# converting zipcode to object data (str) in order to join the zip code with the population data
df['Zip Code']=df['Zip Code'].astype(str)

#extract zip code from population
#Getting the last 5 digits from the population zip code. Creating a new column called zip code
def find_zip_code(geocode):
    pattern = r'\d{5}$'
    
    match = re.search(pattern, geocode)
    
    if match:
        zip_code = match.group(0)
        return zip_code

#the actual conversion is below. The above is the function
population['Zip Code']=population['Geography'].apply(find_zip_code)

#merging the population via zip code as population is an important feature to determinate the price / locations 
cafe_data=df.copy()
#notice that the data size is reduce after a join
df=pd.merge(cafe_data,population)

#keeping only Total from population.In the pop dataset, keeoing total population column and other columns.
columns=cafe_data.columns.values.tolist()+['Total']
df=df[columns]
#rename Total to population
df=df.rename(columns={"Total":"Population"})

df

#Keeping only relevant features
df=df[['Zip Code', 'Rating', 'Median Salary', 'Latte Price', 'Population']]
df.shape

(412, 5)

df.columns

Index(['Zip Code', 'Rating', 'Median Salary', 'Latte Price', 'Population'], dtype='object')

# Calculate the total number of coffee shops for each zip code
coffee_shop_counts = df['Zip Code'].value_counts().reset_index()
coffee_shop_counts.columns = ['Zip Code', 'CoffeeShopCount']

# Ensure 'Zip Code' is of type string in both DataFrames
df.loc[:, 'Zip Code'] = df['Zip Code'].astype(str)
coffee_shop_counts['Zip Code'] = coffee_shop_counts['Zip Code'].astype(str)

# Merge the counts back into the original DataFrame
df = df.merge(coffee_shop_counts, on='Zip Code', how='left', suffixes=('', '_coffee'))

# Print the updated DataFrame
print(df)

# Criteria:
# a. High population
# b. Low total number of coffee shops
# c. Low ratings
# d. High median salary

# Sorting the DataFrame based on the criteria
sorted_df = df.sort_values(by=['Population', 'CoffeeShopCount', 'Rating', 'Median Salary'],
                           ascending=[False, True, True, False]).reset_index(drop=True)

    Zip Code  Rating  Median Salary  Latte Price  Population  CoffeeShopCount
0      95210    5.00          87117      5.35000       11180                4
1      93291    1.20          78753      5.31765       15310                7
2      93726    4.20          80973      5.30665       13942                5
3      92867    1.90          74340      4.43900       13284                9
4      93291    4.50          78753      5.27165       15310                7
..       ...     ...            ...          ...         ...              ...
407    91411    4.00          76034      4.99170        9177                8
408    90405    4.82          86920      4.98200       14376                3
409    95133    4.04          78176      4.64680        7365                6
410    90035    3.00          87604      4.64000       12814                5
411    92867    4.50          74340      3.82000       13284                9

[412 rows x 6 columns]

# Created a list - if length of list 5, if the zip code is already present, it will not add that into the list. 
# Deduping zip code column and displaying all of the records for the top 5.
lst=[]
for i in range(len(sorted_df)):
    if len(lst)!=5:
        if (sorted_df['Zip Code'][i]) not in lst:
            lst.append(sorted_df['Zip Code'][i])
            
# Filter 'sorted_df' to include only rows where 'Zip Code' is in 'lst'
top_5_zip_codes_df = sorted_df[sorted_df['Zip Code'].isin(lst)]

top_5_zip_codes_df

# Features excluding 'Latte Price' and 'Zip Code'
X = df.drop(['Latte Price', 'Zip Code'], axis=1)  
# Target variable
y = df['Latte Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test=sc.transform(X_test)

#Model selection
models = {
    'Linear Regression':LinearRegression(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
}

param_grid = {
    'Random Forest':{'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]},
    'Gradient Boosting': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 10]},
}

# Perform hyperparameter tuning using GridSearchCV
for model_name, model in models.items():
    if model_name in param_grid:
        grid_search = GridSearchCV(model, param_grid[model_name], cv=5, scoring='neg_mean_squared_error')
        grid_search.fit(X, y)
        
        #Set the best hyperparameters to the model
        models[model_name] = grid_search.best_estimator_

#Modeling training
for model_name, model in models.items():
    model.fit(X_train, y_train)

#Model Evaluation
for model_name, model in models.items():
    #Evaluate the model on the testing set
    y_pred = model.predict(X_test)
    print(f"{model_name} Metrics:")
    print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
    print("Mean Squared Error", mean_squared_error(y_test, y_pred))
    print("R-squared:", r2_score(y_test, y_pred))
    
    print()

Linear Regression Metrics:
Mean Absolute Error: 0.21737899529390042
Mean Squared Error 0.06581667796516916
R-squared: 0.44342320694160287

Random Forest Metrics:
Mean Absolute Error: 0.23160748734464295
Mean Squared Error 0.07573996171310597
R-squared: 0.359507251050331

Gradient Boosting Metrics:
Mean Absolute Error: 0.2184404032308931
Mean Squared Error 0.06691055543719977
R-squared: 0.43417286441134295

#We want this dataframe to be same as the training data so that model can predict the value
zip_codes_df= top_5_zip_codes_df.drop(['Zip Code', 'Latte Price'], axis=1)
zip_codes_df= sc.transform(zip_codes_df)

for model_name, model in models.items():
    #Predict the prices for lattes in the top 5 zip codes
    predicted_prices = model.predict(zip_codes_df)

predictions = {}

for model_name, model in models.items():
    # Predict the prices for lattes in the top 5 zip codes
    predicted_prices = model.predict(zip_codes_df)
    predictions[model_name] = predicted_prices

# Convert the predictions dictionary to a DataFrame
predictions_df = pd.DataFrame(predictions)
# Add the zip codes to the predictions DataFrame
predictions_df['Zip Code'] = top_5_zip_codes_df['Zip Code'].values

# Rearrange the columns to have 'Zip Code' as the first column
cols = ['Zip Code'] + [col for col in predictions_df.columns if col != 'Zip Code']
predictions_df = predictions_df[cols]

predictions_df

agg_df = predictions_df.groupby('Zip Code')['Gradient Boosting'].agg([("Highest", "max"), ("Lowest", "min")]).reset_index()
agg_df.columns = ['Zip Code', 'Highest', 'Lowest']
print(agg_df)

  Zip Code   Highest    Lowest
0    90025  5.216369  5.216369
1    90805  5.304014  5.276940
2    94110  4.909164  4.909164
3    94544  4.838262  4.778698
4    95823  4.838262  4.765748

print("Top five zip code and prices")
print(agg_df)

Top five zip code and prices
  Zip Code   Highest    Lowest
0    90025  5.216369  5.216369
1    90805  5.304014  5.276940
2    94110  4.909164  4.909164
3    94544  4.838262  4.778698
4    95823  4.838262  4.765748

	Geography	Label for GEO_ID	Race/Ethnic Group	Population Groups	Total	Errata of Total	Total!!1-person household	Total!!2-person household	Total!!3-person household	Total!!4-person household	Total!!5-person household	Total!!6-person household	Total!!7-or-more-person household
0	8600000US89010	ZCTA5 89010	1	Total population	172	NaN	51	69	19	12	12	4	5
1	8600000US89019	ZCTA5 89019	1	Total population	1164	NaN	412	421	134	83	57	32	25
2	8600000US89060	ZCTA5 89060	1	Total population	4144	NaN	1106	1714	550	361	222	95	96
3	8600000US89061	ZCTA5 89061	1	Total population	2109	NaN	469	936	283	206	112	62	41
4	8600000US89439	ZCTA5 89439	1	Total population	671	NaN	193	314	88	55	17	2	2

	ID no.	Business Name	Street address	City	State	Zip Code	Phone	Rating	Gender majority	Median Salary	Latte Price
0	1	Brew Haven	8 Old Shore Place	Oakland	California	94616	415-810-4769	4.5	Male	72463	3.31000
1	2	Bean Bliss	6650 Clarendon Crossing	Stockton	California	95210	209-701-1665	5.0	Female	87117	5.35000
2	3	Caffeine Cove	7281 Buell Road	Fresno	California	93773	559-137-3554	4.7	Male	86394	4.15000
3	4	Mug Magic	670 Jackson Avenue	Torrance	California	90510	818-789-5573	4.3	Male	88343	5.34000
4	5	Daily Grind	37 Ludington Terrace	San Diego	California	92196	619-354-2389	1.6	Male	77795	4.51175

	ID no.	Zip Code	Rating	Median Salary	Latte Price
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	500.500000	92976.163000	3.784600	81182.842000	5.061491
std	288.819436	1706.943177	1.150717	5142.670356	0.352002
min	1.000000	90005.000000	1.000000	72001.000000	3.090000
25%	250.750000	91751.750000	3.200000	76776.000000	4.830175
50%	500.500000	92883.000000	4.240000	81113.000000	5.062450
75%	750.250000	94257.000000	4.580000	85684.000000	5.310025
max	1000.000000	96154.000000	5.000000	89978.000000	6.480000

	Zip Code	Rating	Median Salary	Latte Price	Population	CoffeeShopCount
0	94110	2.22	74020	5.04500	27128	5
1	94110	3.50	74020	5.04700	27128	5
2	94110	4.40	74020	5.05900	27128	5
3	94110	4.50	74020	4.77900	27128	5
4	94110	4.70	74020	4.58100	27128	5
5	90805	3.00	88140	5.47900	26056	4
6	90805	3.89	88140	5.31900	26056	4
7	90805	4.31	88140	5.02100	26056	4
8	90805	4.37	88140	5.20700	26056	4
9	95823	4.30	73165	4.96825	22470	3
10	95823	4.82	73165	4.87225	22470	3
11	95823	4.94	73165	4.47225	22470	3
12	94544	3.98	72092	4.94860	21872	6
13	94544	4.12	72092	4.96060	21872	6
14	94544	4.17	72092	4.79060	21872	6
15	94544	4.24	72092	4.64460	21872	6
16	94544	4.24	72092	4.91660	21872	6
17	94544	4.86	72092	4.59060	21872	6
18	90025	4.50	85001	5.37405	21228	1

Machine Learning: Using Python to Find Your Next Business Opportunity¶

Data Preprocessing¶

Scaling¶

Model Selection¶

Hyperparameter Tuning¶

Model Training and Evaluation¶

Predictions¶

Top five zip code and prices¶

	Zip Code	Linear Regression	Random Forest	Gradient Boosting
0	94110	4.785633	4.999094	4.909164
1	94110	4.781253	5.006316	4.909164
2	94110	4.778173	4.959572	4.909164
3	94110	4.777830	4.965996	4.909164
4	94110	4.777146	4.989925	4.909164
5	90805	5.379973	5.424480	5.304014
6	90805	5.376927	5.364141	5.292189
7	90805	5.375490	5.301565	5.276940
8	90805	5.375284	5.297322	5.276940
9	95823	4.734783	4.902095	4.838262
10	95823	4.733003	4.896670	4.838262
11	95823	4.732592	4.579438	4.765748
12	94544	4.698490	4.909205	4.838262
13	94544	4.698010	4.897126	4.838262
14	94544	4.697839	4.789904	4.838262
15	94544	4.697600	4.706985	4.838262
16	94544	4.697600	4.706985	4.838262
17	94544	4.695478	4.670329	4.778698
18	90025	5.231136	5.266994	5.216369