ML project 1

Machine learning for vector control decision making¶

Dataset¶

This data set is collected by KY Lee and the group from Seoul city, South Korea. It consist of the number of mosquitos per specific area in Seoul city.¶

The data has been collected daily from May 2016- December 2019.¶

Authors has made this data set freely available at https://www.kaggle.com/kukuroo3/mosquito-indicator-in-seoul-korea.¶

Question¶

Lets assume that the Seoul city health officials need to decide whether the indoor residual spraying (IRS) is necessary or not based on climatic factors. They need to know the number of mosquitos per specific area (mosquito indicator) for this. In this sample project I will develop a machine learning model to predict the mosquito indicator value by feeding climate data.¶

Importing required python libraries¶

In [22]:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler #To standardize the data to a common range
from sklearn.model_selection import train_test_split #to split data
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

Loading the dataset as a pandas dataframe¶

In [23]:

mosData = pd.read_csv('mosIndicator.csv') 

Exploring the data¶

In [24]:

mosData.head()

Out[24]:

	date	mosquito_Indicator	rain(mm)	mean_T(℃)	min_T(℃)	max_T(℃)
0	2016-05-01	254.4	0.0	18.8	12.2	26.0
1	2016-05-02	273.5	16.5	21.1	16.5	28.4
2	2016-05-03	304.0	27.0	12.9	8.9	17.6
3	2016-05-04	256.2	0.0	15.7	10.2	20.6
4	2016-05-05	243.8	7.5	18.9	10.2	26.9

In [25]:

mosData.describe()

Out[25]:

	mosquito_Indicator	rain(mm)	mean_T(℃)	min_T(℃)	max_T(℃)
count	1342.000000	1342.000000	1342.000000	1342.000000	1342.000000
mean	251.991803	3.539866	14.166021	10.005663	19.096870
std	295.871336	13.868106	10.943990	11.109489	11.063394
min	0.000000	0.000000	-14.800000	-17.800000	-10.700000
25%	5.500000	0.000000	4.500000	0.300000	9.300000
50%	91.900000	0.000000	16.500000	11.500000	21.900000
75%	480.400000	0.400000	23.300000	19.500000	28.175000
max	1000.000000	144.500000	33.700000	30.300000	39.600000

In [26]:

#### Changing column names

mosData = mosData.rename(columns={"date": "Date", 'mosquito_Indicator': 'Mosquitos', 'rain(mm)': 'Rain', "mean_T(℃)": "Meant", "min_T(℃)": "Mint", "max_T(℃)": "Maxt"})

In [27]:

mosData.head()

Out[27]:

	Date	Mosquitos	Rain	Meant	Mint	Maxt
0	2016-05-01	254.4	0.0	18.8	12.2	26.0
1	2016-05-02	273.5	16.5	21.1	16.5	28.4
2	2016-05-03	304.0	27.0	12.9	8.9	17.6
3	2016-05-04	256.2	0.0	15.7	10.2	20.6
4	2016-05-05	243.8	7.5	18.9	10.2	26.9

In [28]:

# Checking for missing values

mosData.isnull().values.any()

#If we had null values
# mosData = mosData.dropna()

Out[28]:

False

In [42]:

#If the data had very low values for some categories we can remove them with following function

#def drop_lows(category, cutoff):
#    categorical_map = {}
#    for i in range(len(category)):
#        if category.values[i] >= cutoff:
#            categorical_map[category.index[i] = category.index[i]]
#        else:
#            categorical_map[category.index[i] = "Other"
#    return categorical_map

In [43]:

# Area is a hypothetical column to illustrate how to do with different datasets
# country_map = drop_lows(mosData.Area.value_counts(), 400)
# mosData["Area"] = mosData["Area"].map(country_map)

In [44]:

# Removing outliers - as evident from boxplots

#fig, ax=plt.subplot(1,1,figsize=(12,7))
#mosData.boxplot["Mosquitos", "Area", ax=ax]
#plt.subtitle("Mosquitos vs Country")
#plt.title("")
#plt.ylabel("Mosquitos")
#plt.xticks("rotation=90")
#plt.show()

In [45]:

# If there are outliers There may be dots outside the boxes
# If the boxes of the box plot (including error bars) do not 
# go beyond 700 and 50. Lets remove those outliers as well
#mosData = mosData[mosData["Mosquitos"] <= 700]
#mosData = mosData[mosData["Mosquitos"] <= 20]
#mosData = mosData[mosData["Area"] = "Other"]

Separating the data and the features¶

In [29]:

xtemp = mosData.drop("Mosquitos", axis = 1)
x = xtemp.drop("Date",  axis = 1)
y = mosData["Mosquitos"]

* Both the categories has reasonable counts.¶

* All the values are numerical and do not need to convert¶

Data standardisation¶

The rainfall is in a different range than the temperatures. Thus, the data must be standardised.

In [30]:

scaler = StandardScaler() # Importing standard scaler function to a variable named scaler

In [31]:

scaler.fit(x) #Fitting and transforming x to the standard scaler function 

Out[31]:

StandardScaler()

In [32]:

st_data_x = scaler.transform(x)

In [33]:

x = st_data_x

Splitting train and test datasets¶

In [41]:

X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size = 0.2)
# Test size 20% of the data
# Using stratify will maintain similar amounts of 1s and 2s of mosquito numbers

Training the model¶

Selecting the correct algorithm require the consideration of several factors including the type of target variable, size of the data set. Here the target variable is an exact number. Thus,I can use a regression or a classification algorithm. I will use linear regression, lasso regression, support vector regression (Wu et al 2020), random forest (Ebrahimi-Khusf et al 2021), Stochastic gradient boosting (Ebrahimi-Khusf et al 2021) and will compare which one is best.¶

In [98]:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error

1. Linear regression¶

In [99]:

lin_reg = LinearRegression() 

lin_reg.fit(X_train, Y_train)

Ytrpreds = lin_reg.predict(X_train)
Ytestpred = lin_reg.predict(X_test)

trerror = np.sqrt(mean_squared_error(Y_train, Ytrpreds))
testerror = np.sqrt(mean_squared_error(Y_test, Ytestpred))

print(trerror)
print(testerror)

192.98943977171356
202.67642370681074

2. Lasso Regression¶

In [100]:

lasso = Lasso()
lasso.fit(X_train, Y_train)

lasso_Ytrpreds = lasso.predict(X_train)
lasso_Ytestpred = lasso.predict(X_test)

lasso_trerror = np.sqrt(mean_squared_error(Y_train, lasso_Ytrpreds))
lasso_testerror = np.sqrt(mean_squared_error(Y_test, lasso_Ytestpred))

print(lasso_trerror)
print(lasso_testerror)

193.17517395879298
203.53583520781154

3. SVR¶

In [101]:

svr = SVR(kernel="linear", C=100, gamma="auto")
svr.fit(X_train, Y_train)

svr_Ytrpreds = svr.predict(X_train)
svr_Ytestpred = svr.predict(X_test)

svr_trerror = np.sqrt(mean_squared_error(Y_train, svr_Ytrpreds))
svr_testerror = np.sqrt(mean_squared_error(Y_test, svr_Ytestpred))

print(svr_trerror)
print(svr_testerror)

194.02301601725773
206.40789365465545

4. RF¶

In [102]:

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, Y_train)

rf_Ytrpreds = rf.predict(X_train)
rf_Ytestpred = rf.predict(X_test)

rf_trerror = np.sqrt(mean_squared_error(Y_train, rf_Ytrpreds))
rf_testerror = np.sqrt(mean_squared_error(Y_test, rf_Ytestpred))

print(rf_trerror)
print(rf_testerror)

62.597274333213534
165.61398484014964

5. Stochastic gradient boosting¶

In [103]:

sgb = GradientBoostingRegressor(n_estimators = 1000, random_state = 0)
sgb.fit(X_train, Y_train)

sgb_Ytrpreds = sgb.predict(X_train)
sgb_Ytestpred = sgb.predict(X_test)

sgb_trerror = np.sqrt(mean_squared_error(Y_train, sgb_Ytrpreds))
sgb_testerror = np.sqrt(mean_squared_error(Y_test, sgb_Ytestpred))

print(sgb_trerror)
print(sgb_testerror)

33.115401993709604
184.6828138793723

Linear regression shows the best performance among all the algorithms.¶

Creating a prediction system using linear regression¶

In [109]:

x = np.array([[0.0,16.4,10.4,23.5]]) # If we paste the variables rainfall (0.0), 
# mean temperature (20.0), minimum temperature (14.8), maximum temperature (27.4) here we can get the predictions as below

# reshaping the array as I predict only for one instance
x = x.reshape(1,-1) 

# standardizing the data
x = scaler.transform(x)

predMosquitos = lin_reg.predict(x)

predMosquitos

Out[109]:

array([230.12553066])

Note: In this model the features are not selected or extracted. Such method is required if we have a large number of features and if they are redundant or irrelevant. If not it might cause the model to take the noise of the features to learn and it may be overfitting or perform below optimum level.¶

Saving the model¶

In [95]:

import pickle

In [96]:

with open("lin_reg.pkl", "wb") as file:
    pickle.dump(lin_reg, file)
#wb-write binary

To make a real time predictor app we can create an app using the saved model and flask (or streamlit). Heroku server provides the hosting for such applications.¶

References¶

Ebrahimi-Khusfi Z, Nafarzadegan AR & Dargahian F (2021). Predicting the number of dusty days around the desert wetlands in southeastern Iran using feature selection and machine learning techniques. Ecological Indicators 125, 107499.¶

Wang Z, Wang Y, Zeng R, Srinivasan RS & Ahrentzen S (2018). Random Forest based hourly building energy prediction. Energy and Buildings 171, 11–25.¶

Wu CH, Ho JM & Lee DT (2004). Travel-time prediction with support vector regression. IEEE Transactions on Intelligent Transportation Systems 5, 276–281.¶

Tharaka Wijerathna

Machine learning for vector control decision making