Machine learning for vector control decision making¶
Dataset¶
This data set is collected by KY Lee and the group from Seoul city, South Korea. It consist of the number of mosquitos per specific area in Seoul city.¶
The data has been collected daily from May 2016- December 2019.¶
Authors has made this data set freely available at https://www.kaggle.com/kukuroo3/mosquito-indicator-in-seoul-korea.¶
Question¶
Lets assume that the Seoul city health officials need to decide whether the indoor residual spraying (IRS) is necessary or not based on climatic factors. They need to know the number of mosquitos per specific area (mosquito indicator) for this. In this sample project I will develop a machine learning model to predict the mosquito indicator value by feeding climate data.¶
Importing required python libraries¶
In [22]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler #To standardize the data to a common range
from sklearn.model_selection import train_test_split #to split data
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
Loading the dataset as a pandas dataframe¶
In [23]:
mosData = pd.read_csv('mosIndicator.csv')
Exploring the data¶
In [24]:
mosData.head()
Out[24]:
| date | mosquito_Indicator | rain(mm) | mean_T(℃) | min_T(℃) | max_T(℃) | |
|---|---|---|---|---|---|---|
| 0 | 2016-05-01 | 254.4 | 0.0 | 18.8 | 12.2 | 26.0 |
| 1 | 2016-05-02 | 273.5 | 16.5 | 21.1 | 16.5 | 28.4 |
| 2 | 2016-05-03 | 304.0 | 27.0 | 12.9 | 8.9 | 17.6 |
| 3 | 2016-05-04 | 256.2 | 0.0 | 15.7 | 10.2 | 20.6 |
| 4 | 2016-05-05 | 243.8 | 7.5 | 18.9 | 10.2 | 26.9 |
In [25]:
mosData.describe()
Out[25]:
| mosquito_Indicator | rain(mm) | mean_T(℃) | min_T(℃) | max_T(℃) | |
|---|---|---|---|---|---|
| count | 1342.000000 | 1342.000000 | 1342.000000 | 1342.000000 | 1342.000000 |
| mean | 251.991803 | 3.539866 | 14.166021 | 10.005663 | 19.096870 |
| std | 295.871336 | 13.868106 | 10.943990 | 11.109489 | 11.063394 |
| min | 0.000000 | 0.000000 | -14.800000 | -17.800000 | -10.700000 |
| 25% | 5.500000 | 0.000000 | 4.500000 | 0.300000 | 9.300000 |
| 50% | 91.900000 | 0.000000 | 16.500000 | 11.500000 | 21.900000 |
| 75% | 480.400000 | 0.400000 | 23.300000 | 19.500000 | 28.175000 |
| max | 1000.000000 | 144.500000 | 33.700000 | 30.300000 | 39.600000 |
In [26]:
#### Changing column names
mosData = mosData.rename(columns={"date": "Date", 'mosquito_Indicator': 'Mosquitos', 'rain(mm)': 'Rain', "mean_T(℃)": "Meant", "min_T(℃)": "Mint", "max_T(℃)": "Maxt"})
In [27]:
mosData.head()
Out[27]:
| Date | Mosquitos | Rain | Meant | Mint | Maxt | |
|---|---|---|---|---|---|---|
| 0 | 2016-05-01 | 254.4 | 0.0 | 18.8 | 12.2 | 26.0 |
| 1 | 2016-05-02 | 273.5 | 16.5 | 21.1 | 16.5 | 28.4 |
| 2 | 2016-05-03 | 304.0 | 27.0 | 12.9 | 8.9 | 17.6 |
| 3 | 2016-05-04 | 256.2 | 0.0 | 15.7 | 10.2 | 20.6 |
| 4 | 2016-05-05 | 243.8 | 7.5 | 18.9 | 10.2 | 26.9 |
In [28]:
# Checking for missing values
mosData.isnull().values.any()
#If we had null values
# mosData = mosData.dropna()
Out[28]:
False
In [42]:
#If the data had very low values for some categories we can remove them with following function
#def drop_lows(category, cutoff):
# categorical_map = {}
# for i in range(len(category)):
# if category.values[i] >= cutoff:
# categorical_map[category.index[i] = category.index[i]]
# else:
# categorical_map[category.index[i] = "Other"
# return categorical_map
In [43]:
# Area is a hypothetical column to illustrate how to do with different datasets
# country_map = drop_lows(mosData.Area.value_counts(), 400)
# mosData["Area"] = mosData["Area"].map(country_map)
In [44]:
# Removing outliers - as evident from boxplots
#fig, ax=plt.subplot(1,1,figsize=(12,7))
#mosData.boxplot["Mosquitos", "Area", ax=ax]
#plt.subtitle("Mosquitos vs Country")
#plt.title("")
#plt.ylabel("Mosquitos")
#plt.xticks("rotation=90")
#plt.show()
In [45]:
# If there are outliers There may be dots outside the boxes
# If the boxes of the box plot (including error bars) do not
# go beyond 700 and 50. Lets remove those outliers as well
#mosData = mosData[mosData["Mosquitos"] <= 700]
#mosData = mosData[mosData["Mosquitos"] <= 20]
#mosData = mosData[mosData["Area"] = "Other"]
Separating the data and the features¶
In [29]:
xtemp = mosData.drop("Mosquitos", axis = 1)
x = xtemp.drop("Date", axis = 1)
y = mosData["Mosquitos"]
* Both the categories has reasonable counts.¶
* All the values are numerical and do not need to convert¶
Data standardisation¶
The rainfall is in a different range than the temperatures. Thus, the data must be standardised.
In [30]:
scaler = StandardScaler() # Importing standard scaler function to a variable named scaler
In [31]:
scaler.fit(x) #Fitting and transforming x to the standard scaler function
Out[31]:
StandardScaler()
In [32]:
st_data_x = scaler.transform(x)
In [33]:
x = st_data_x
Splitting train and test datasets¶
In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size = 0.2)
# Test size 20% of the data
# Using stratify will maintain similar amounts of 1s and 2s of mosquito numbers
Training the model¶
Selecting the correct algorithm require the consideration of several factors including the type of target variable, size of the data set. Here the target variable is an exact number. Thus,I can use a regression or a classification algorithm. I will use linear regression, lasso regression, support vector regression (Wu et al 2020), random forest (Ebrahimi-Khusf et al 2021), Stochastic gradient boosting (Ebrahimi-Khusf et al 2021) and will compare which one is best.¶
In [98]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
1. Linear regression¶
In [99]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
Ytrpreds = lin_reg.predict(X_train)
Ytestpred = lin_reg.predict(X_test)
trerror = np.sqrt(mean_squared_error(Y_train, Ytrpreds))
testerror = np.sqrt(mean_squared_error(Y_test, Ytestpred))
print(trerror)
print(testerror)
192.98943977171356 202.67642370681074
2. Lasso Regression¶
In [100]:
lasso = Lasso()
lasso.fit(X_train, Y_train)
lasso_Ytrpreds = lasso.predict(X_train)
lasso_Ytestpred = lasso.predict(X_test)
lasso_trerror = np.sqrt(mean_squared_error(Y_train, lasso_Ytrpreds))
lasso_testerror = np.sqrt(mean_squared_error(Y_test, lasso_Ytestpred))
print(lasso_trerror)
print(lasso_testerror)
193.17517395879298 203.53583520781154
3. SVR¶
In [101]:
svr = SVR(kernel="linear", C=100, gamma="auto")
svr.fit(X_train, Y_train)
svr_Ytrpreds = svr.predict(X_train)
svr_Ytestpred = svr.predict(X_test)
svr_trerror = np.sqrt(mean_squared_error(Y_train, svr_Ytrpreds))
svr_testerror = np.sqrt(mean_squared_error(Y_test, svr_Ytestpred))
print(svr_trerror)
print(svr_testerror)
194.02301601725773 206.40789365465545
4. RF¶
In [102]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, Y_train)
rf_Ytrpreds = rf.predict(X_train)
rf_Ytestpred = rf.predict(X_test)
rf_trerror = np.sqrt(mean_squared_error(Y_train, rf_Ytrpreds))
rf_testerror = np.sqrt(mean_squared_error(Y_test, rf_Ytestpred))
print(rf_trerror)
print(rf_testerror)
62.597274333213534 165.61398484014964
5. Stochastic gradient boosting¶
In [103]:
sgb = GradientBoostingRegressor(n_estimators = 1000, random_state = 0)
sgb.fit(X_train, Y_train)
sgb_Ytrpreds = sgb.predict(X_train)
sgb_Ytestpred = sgb.predict(X_test)
sgb_trerror = np.sqrt(mean_squared_error(Y_train, sgb_Ytrpreds))
sgb_testerror = np.sqrt(mean_squared_error(Y_test, sgb_Ytestpred))
print(sgb_trerror)
print(sgb_testerror)
33.115401993709604 184.6828138793723
Linear regression shows the best performance among all the algorithms.¶
Creating a prediction system using linear regression¶
In [109]:
x = np.array([[0.0,16.4,10.4,23.5]]) # If we paste the variables rainfall (0.0),
# mean temperature (20.0), minimum temperature (14.8), maximum temperature (27.4) here we can get the predictions as below
# reshaping the array as I predict only for one instance
x = x.reshape(1,-1)
# standardizing the data
x = scaler.transform(x)
predMosquitos = lin_reg.predict(x)
predMosquitos
Out[109]:
array([230.12553066])
Note: In this model the features are not selected or extracted. Such method is required if we have a large number of features and if they are redundant or irrelevant. If not it might cause the model to take the noise of the features to learn and it may be overfitting or perform below optimum level.¶
Saving the model¶
In [95]:
import pickle
In [96]:
with open("lin_reg.pkl", "wb") as file:
pickle.dump(lin_reg, file)
#wb-write binary
