Using Hierarchical clustering of time series for variable selection in Dengue forecasting


Flávio Codeço Coelho e Elisa Mussumeci


Praia de Botafogo, 190 - sala 317


26 de Outubro de 2017 às 16h

Forecasting of dengue incidence timeseries is a difficult task due its complex dynamics. These dynamics are the result of many exogenous factors which include but are not limited to seasonal weather patterns, unpredictable arrival of new cases, variations of the immunological structure of the population,etc.

The InfoDengue system ( is an online data analysis system focused on the dynamics of arboviroses. It integrates meteorological and epidemiological data, as well as social network data, to provide nowcasting of cases and a 4-level alert system, for Dengue and Chikungunya. Currently Infodengue services more than 500 Brazilian cities in three states.

The nowcasting provided by InfoDengue is already of great value to health authorities since the recording of reported cases may experience delays of up to 3 months. Nevertheless, Forecasting of the weekly incidence series has been a goal of InfoDengue since the beginning, and a hot request by many stakeholders.

t is well established that the spatial dimension is very important for the understanding of the dynamics of infectious diseases. Many infectious diseases display clear spatial patterns of spread, which when properly monitored, may provide early warning to certain localities on the path of this spatial spread. Even for endemic diseases with seasonal dynamics such as dengue, the spatio-temporal dynamics of exogenous factors such as weather, can also help to anticipate the increase in seasonal incidence.

Since the InfoDengue system maintain hundreds of continually updated incidence time series It makes easy to use non-local timeseries as exogenous variables for the forecast of any given timeseries. However, the problem of choosing which series are informative for the dynamics of one series is not a simple one. The full model, i.e., using all the other series to predict each one, is not a good idea, for obvious reasons.

In this paper we discuss the use of hierarchical clustering of all the timeseries in Infodengue to determine an optimal set of exogenous series for the forecast of each locality. We clustered the incidence timeseries within each state, using the correlation between the series as the distance metric and complete linkage as the clustering algorithm. The clustering led to the formation of well defined clusters with 2 to 15 cities each.

A Long Short Term Memory model (LSTM) was then trained to generate forecasts for each city using the cities in its cluster as exogenous series. Performance aspects of this model will be discussed. We found that the proposed method of variable selection is both light enough to be used on an online forecasting system and can propose models which are better for incidence forecasting than the univariate base model.

*Texto informado pelo autor. 


Flávio Codeço Coelho é professor da Escola de Matemática Aplicada da FGV, onde desenvolve pesquisa na área de modelagem matemática, estatística e computacional. Foi professor adjunto da UERJ (2001 a 2003) e pesquisador visitante da Fundação Oswaldo Cruz (2003 a 2008). Realizou pós-doutorado no Instituto Gulbenkian de Ciência, em Portugal, entre 2008 e 2010. Elisa Mussumeci é aluna do mestrado da EMAp, onde desenvolve pesquisas sobre o uso de Redes Neurais para apreensão de epidemias.