Statistical Learning for Predicting Dengue Fever Rate in Surabaya

: Dengue fever happening most in tropical countries and considered as the fastest spreading mosquito-borne disease which is endemic and estimated to have 96 million cases annually. It is transmitted by Aedes mosquito which infected with a dengue virus. Therefore, predicting the dengue fever rate as become the subject of researches in many tropical countries. Some of them use statistical and machine learning approach to predict the rate of the disease so that the government can prevent that incident. In this study, we explore many models in the statistical learning approaches for predicting the dengue fever rate. We applied several methods in the predictive statistics such as regression, spatial regression, geographically weighted regression and robust geographically weighted regression to predict the dengue fever rate in Surabaya. We then analyse the results, compare them based on the mean square error. Those four models are chosen, to show the global estimator’s approaches, e.g. regression, and the local ones, e.g. geographically weighted regression. The model with the minimum mean square error is regarded as the most suitable model in the statistical learning area for solving the problem. Here, we look at the estimates of the dengue fever rate in the year 2012, to 2017, area, poverty percentage, precipitation, number of rainy days for predicting the dengue fever outbreak in the year 2018. In this study, the pattern of the predicted model can follow the pattern of the true dataset.


Introduction
Dengue fever happening most in tropical countries and considered as the fastest spreading mosquitoborne disease, which is endemic and estimated to have 96 million cases annually. It is transmitted by Aedes mosquito which infected with a dengue virus. Several factors cause dengue fever: the failure to control the Aedes mosquito populations, uncontrolled urbanization, and high population growth [1]. Dengue Fever most commonly happens in the urban environment. Indonesia, with its tropical climate and high humidity, has a high possibility for Dengue transmission. Indonesia reported as the second largest with dengue fever cases among 30 endemic countries [2]. Dengue fever increased rapidly over the past 45 years in Indonesia, with victims shifting from young children to older age groups [3]. The increasing number of dengue fever cases is more likely followed by the spread of the cities infected in all 34 provinces in Indonesia. From a total of 497 cities in Indonesia, about 80% reported the dengue fever cases in 2017 [2]. Therefore, predicting the dengue fever outbreak has become the subject of researches in many tropical countries. Some of those researchers predicted 1 Faculty of Industrial Technology, Industrial Engineering Department, Petra Christian University, Jl. Siwalankerto 121-131, Surabaya, 60238, Indonesia Email: halim@petra.ac.id, felecia@petra.ac.id, tanti@petra.ac.id * Corresponding author dengue fever outbreak in Srilanka [4], in Thailand [5], in the Northwest Coast of Yucatan, Mexico and San Juan, Puerto Rico [6]. These groups of researchers modelled the dengue fever outbreak using neural network approaches. In Indonesia, Mahdiana et al. [7] predicted dengue hemorrhagic fever (DHF) using vector autoregressive spatial autocorrelation (varsa). Mahdiana et al. [7] used five years dataset from Sleman, a district in Central Java, Indonesia, to predict the DHF outbreaks. In the model, they include min, max and average temperature, average humidity, and rainfall and irradiation time.
Past research to predict Dengue Fever Spreading in Surabaya has been focused on the effect of climate variability [8]. All three-climate variability parameters: average temperature, rainfall, humidity shows only moderate correlation with dengue fever incidence. Ramadona, [9], also try to predict dengue fever based on disease surveillance and meteorological data. The model gave reasonably accurate results for outbreak periods, but not for non-outbreaks periods. It can predict only up to two months ahead. Widyaningrum [10], use a different approach to model dengue fever spreading using dynamics transmission vector model. It is used to build a model in the mosquito model (oviposition rate and pre-adult mosquito), infected and death cases in dengue fever. The model for death cases precision still needs improvement by adding some variables that influence dengue fever death cases. Nuryunarsih [11], try to find Sociodemographic Factors to Dengue Hemorrhagic Fever Case in Indonesia. The univariate regression analysis showed that population density and poverty had a significant correlation to Dengue Fever, but the variable of age under 15 years old did not have any positive correlation.
In Surabaya's case, additional to the weather condition, we also explore the population density, the precipitation and the poverty percentage as the factors that may affect the DHF. In the previous study, we [12] used the geographically weighted regression to predict the dengue fever outbreak in Surabaya, to continue the exploration, in this paper, we model the outbreak prediction using statistical learning approaches.

Descriptive Analysis
We describe the data by clustering the location based on the dengue fever rate, poverty rate and the number of rainy days. We use partition around medoids clustering. Partitioning around medoids (PAM) is a clustering algorithm which is based on k-medoids instead of k-means. In the k-medoids clustering, each cluster is represented by one of the data points in the cluster. These points are named as cluster medoids. A medoid is an object within a cluster in which average dissimilarity between the point and all other members of the cluster is minimal [13]. The k-medoids is a robust alternative to k-means clustering. The PAM algorithm [14], is the most common algorithm for performing the k-medoids clustering. It consists of five steps: (1) Select k object to become a medoid, (2) calculate the dissimilarity matrix, (3) assign each object to its closest medoid, (4) for each cluster search if any object of the cluster can decrease the average dissimilarity coefficient if it does, select the object, (5) if there is one medoid changed then go to (3) else end the algorithm.

Statistical Learning
There are many approaches to modelling the dengue fever outbreak. They can be classified into two approaches, i.e., machine learning and statistical learning. In this study, we are exploring the statistical learning approach to find the most suitable model for predicting the DHF outbreaks. We applied several methods in predictive statistics such as regression, spatial regression, geographically weighted regression, and robust geographically weighted regression to predict the dengue fever outbreak in Surabaya. We then analyse the results, compare them based on the mean square error. Those four models are chosen, to show the global estimator's approaches, e.g. regression, and the local ones, e.g. geographically weighted regression. The model with the minimum mean square error is regarded as the most suitable model in the statistical learning area for solving the problem. We also test the spatial correlation of the dengue fever outbreak rate in each Puskesmas; we used the spatial local Moran I statistics. Anselin [15] suggested the local Moran's I statistics for identifying local clusters and local spatial outliers. The Local Moran's I statistics can be formulated as: where y is the variable of interest. is the weight for the , observations. We then modeled the data based on the statistical learning. The first model is the well-known regression model that can be formulated as where is the independent variables, is the number of dengue fever infected in each location, is the global parameter and is the random error. The global parameter means for the whole locations they will have the same . In the regression model, we all know that the error term should be independent. Moreover, in the regression, the spatial dependencies of the dataset do not appear in the model. Therefore, to accommodate those spatial dependencies, the spatial models should be considered.
First, we consider the first-order Spatial Autoregressive (SAR) model where is the spatial weigh matrix, (see Anselin [15] for the detail) Another spatial class model is Spatial Error Model where is identity matrix, is a scalar parameter, is the spatial weight matrix, ~(0, 2 ) is a vector of disturbance (see Anselin [15] for the detail) The last spatial class model used in this paper is Spatial Durbin Model (SDM). The SDM concerns about the spatial heterogeneity (see Anselin [15] for the detail) But and = + are correlated.
The spatial autoregressive models [16] have assumption that the structure of the models remains constant, i.e., there is no local variations in the parameter estimates. The GWR [17] allows the estimated parameters vary locally.
The geographically weighted regression models [17] can be formulated as where is the location in which the local parameters will be estimated.
The is the parameters at the location and can be estimated as where is the weight for the observation at location and location and formulated as the Gaussian function The is the Euclidean distance between the location and location , while ℎ is the bandwidth. The bandwidth ℎ can be selected such that the root mean square prediction error is minimum.
To identify and to reduce the effects of the outliers in GW regression, then various robust GW regression has been proposed. Two of them are described in [1717]. The first robust model re-fits a GW regression with a filtered data set that has been found by removing observations that correspond to large externally studentised residuals of an initial GW regression fit. An externally studentised residual for each regression location is defined as [18]: where is the residual at location ; ̂− is a leaveone-out estimate of ̂; and is the th element of ( − )( − ) . Observations are deemed outlying and filtered from the data if they have | | > 3. The second robust model, iteratively down-weights observations that correspond to large residuals. This (nongeographical) weighting function on the residual is typically taken as: Observe that both approaches have an element of subjectivity, where the filtered data approach depends on the chosen residual cut-off and the iterative (automatic) approach depends on the chosen downweighting function, with its associated cut-offs.

Results and Discussions
The data were collected data from 63 community health centers (pusat kesehatan masyarakat) in Surabaya. Community health centers are government mandated community health clinics located across Indonesia [19] and provide healthcare for the population on sub-district level (Kelurahan). In Surabaya, each community health center provides healthcare for one up to two sub-districts level. The community health center recorded diseases the often affects the community. One of the diseases is dengue fever. Surabaya government focus more on preventive action to reduce dengue fever outbreak. Therefore, the health center will promote dengue prevention through environmental cleaning programs, especially during the wet season [20]. This study will help the Surabaya government to predict outbreaks in the neighborhoods of the center of DHF outbreak.     We collected data of the number of rainy days in a year, precipitation, maximum and minimum temperature, maximum and minimum humidity, population density, and poverty percentage in each Surabaya's district [21]. The weather is collected from the Perak II Meteorology Station Surabaya. The Data in reported in Surabaya in numbers (Surabaya dalam Angka [21]) monthly. The poverty percentage is calculated per family. It is the percentage of total family in a sub-district to the number of considered as poor families by the government.

Clustering
First, we assumed there is any relationship between poverty and the number of dengue fever incidence (DFI) or the dengue fever rate. To prove our assumption, we then cluster the location based on the dengue fever rate (DFR), and the poverty percentage (based on [21]). The DFR is calculated as the number of dengue fever infected in the location at year divided by populations in the same location at the same year for 10,000 people, i.e.: While the number of optimal clusters is determined based on the optimum average silhouette width. Rousseeuw [22] defined the silhouette value as a measure of similarity of an object to its own cluster compare to the other clusters. Based on the optimum average silhouette, we get the optimal number of clusters is two (see Figure 2). We then used the partitioning around medoids with the number of clusters equal to two to cluster the regions with respect to the poverty percentage, DFI dan DFR. As a result, we get a Surabaya map which is clustered based on the poverty percentage, DFI dan DFR ( Figure 3).
We then test the hypothesis to prove the DFR and DFI between those two clusters are significantly differrent. The P-value of t-Test: Two-Sample shows that nor mean value of DFR or DFI 2018 is significantly different (see Table 2). We can conclude that poverty percentage does not influence the number of dengue fever incidence nor dengue fever rate.
In 2018, the three highest DFR was Tambak Wedi, Putat Jaya and Medokan Ayu. In Putat Jaya there were 9 DF infected, but since the population only 15155 people then per 10000 citizens, the DFR is 5.93. While in Putat Jaya, the DFI is 17 (the highest one in number), since the population is 44913 people, then per 10000 citizens the DFR is only 3.78. It is lower compared to Tambak Wedi. For Medokan Ayu, the DFI is 16, the population is 57647, and the DFR is 2.78 per 10000 citizens.
Secondly, we also assumed that precipitation and number of rainy days influence the DFI and DFR (see Figure 4). Using a similar approach, we have three clusters, and the summary of the clusters is given in

Global Linear and Spatial Model
The summary statistics and the clustering above give us a clear description of the dengue fever rate and the number of infected persons that occurred in Surabaya. In this section, we want to explore the global and spatial models to predict the DFR.
At the first step, we use the global Moran's I statistics to test the data are under randomization (Ho) or have spatial dependencies. The test shows that the data significantly have spatial dependencies (p-value = 0.0154. To see which districts are spatially correlated strongly, we then use the local Moran's I statistics. Figure 5 shows that there are two districts which Since the data has dependencies spatially globally and locally, we then use the spatial models for predicting the DFR. To predict the dengue fever rate in 2018 globally, we used simple linear regression. We regressed the DBR2018 to the area, poverty percentage, precipitation, number of rainy days and the dengue fever rate from 2012-2017. As a result, the global model shows only estimated parameters of DFR in 2013, 2016 and 2017 are significant for predicting the DFR2018. The models also not good, since the R2 only 37 percent. So, we cannot use this model for predicting the DFR 2018. The linear regression and the spatial models give similar results (Table 4) shows the summary statistics of those models. Among external parameters used to models the dengue fever rate in 2018, only three variables are significant, they are, DFR 2013, DFR 2016 dan DFR 2017. The external variables such as the area, poverty percentage, precipitation, number of rainfall days rate are not significant.
The p-value of Moran residual test shows that for all models, the residual is randomly distributed, i.e., the spatial process promoting the residual pattern of values is a random chance.
To improve the model performance, we then use the Geographically Weighted Regression (GWR) to model the Dengue Fever Outbreak Rate.

Geographically Weighted Regression
The geographically weighted regression (GWR) models permit the parameters to estimate locally in each district in which the community health centers locate. Table 5 presents the summary of GWR coefficient estimates at data points. The number of rainfall days, precipitation, max and min humidity, and temperature are not varied too much since those community health centers are in the same climate. Therefore, we only look at the local coefficient estimates at the dengue fever rate (DFR) 2016 and 2017 and poverty percentage. The global parameter weight resulting from the GWR model is the same as the parameter weight resulting from the linear model (compare Table 4 and Table 5).
The local coefficients estimate for each explanatory variable are depicted in Figure 5.
In this research, we decided to divide the local coefficients into three intervals. We want to analyze the low, middle and high local coefficient intervals with respect to the regions. It is well known that the absolute value of the coefficient parameter indicates the strongness of the relationship between the response to the respective explanatory variables. In contrast, the sign of the coefficient parameter indicates the direction of that relationship. Therefore, in this research, we used two different ways in defining the interval: (1) If the whole local coefficients are positive or negative, we then divide the local coefficient into three intervals, from the lowest coefficient (in absolute value) to the highest one (in absolute value).
(2) If some of the coefficients are negative, and some of them are positive, we divide the interval from the most negative to zero, and then zero to the highest positive. Between those two intervals, we divide longer one into two again so that in overall, we will have three intervals.
The interval is calculated from the min local coefficient -max local coefficient divided by three (See Table 4). For the Poverty percentage, the local coefficients are varied from -0.0149 to -0.0016. All local coefficients are negative. This sign shows that the highest the poverty percentage the lowest the estimate DFR2018 will be and vice versa. The color map of local coefficients with respect to the poverty percentage (PP) is presented in Figure 6a. As an example, we highlighted three regions in the red area with high poverty percentage but low DFR2018, and low  Jemursari in 2018 is predicted as 0.53. In overall, the mean square error of the forecast is 0.4368.

Conclusion
In this paper, we described the influences of poverty percentage, precipitation and number of rainy days to the 2018 dengue fever rate in Surabaya using partitioning around medoids approach. We then explored statistical learning to predict the DFR2018 using external factors and the DFR2012-DFR2017. The geographically weighted regression in the best model for solving this problem compared to the linear regression model and the spatial models. We also studying the characteristics of local parameters estimate with respect to poverty percentage and precipitation. There is no positive correlation between poverty to the DFR as we presumed in the beginning of the study, in fact the area with high poverty percentage yet has lower DFR compare to the area with the lowest poverty percentage. However, the precipitation has negative correlation even though it is not significant. It's meant the areas with high precipitation tend to have low DFR. This situation fit to the nature of the mosquitos. Mosquitoes breed in pools of water. As a result, we hope that the Surabaya public health officials can use this information to prevent the spread of dengue fever in the coming year.
In future research, we will use the machine learning approach for solving the problems and present the result interactively as an app.