- Glauco Gomes de Azevedo
In this work we propose a methodology to estimate the pairwise distance between mixed continuous and categorical data with missing values. Distance estimation is the base for many regression/classification methods, such as nearest neighbors and discriminant analysis, and for clustering techniques such as k-means and k-medoids. Classical methods for handling missing data rely on mean imputation, that could underestimate the variance, or regression-based imputation methods. Unfortunately, when the goal is to estimate the distance between observations, data imputation may perform badly and bias the results toward the data imputation model. In this work we estimate the pairwise distances directly, treating the missing data as random. The joint distribution of the data is approximated using a multivariate mixture model for mixed continuous and categorical data. We present an EM-type algorithm for estimating the mixture and a general methodology for estimating the distance between observations. Simulation shows that the proposed method performs well in both simulated and real data.
*Texto enviado pelo aluno.
Membros da banca:
- Eduardo Fonseca Mendes (orientador) - FGV/EMAp
- Renato Rocha Souza - FGV/EMAp
- Carlos Eduardo Ribeiro de Mello - UNIRIO