Comparative analysis of traditional methods and a deep learning approach for multivariate imputation of missing values in the meteorological field
Main Article Content
Abstract
Climate observations are the groundwork for several real-world applications such as weather forecasting, climate change monitoring and environmental impact assessments. However, the data is mostly measured and recorded by external devices exposed to numerous variables, causatives of malfunctions and, therefore, missing values. Nowadays, data imputation in the time series field has been researched in depth and a wide variety of methods have been proposed, where traditional classification and regression algorithms predominate, even though there are also deep learning approaches that manage to capture temporal relationships between observations. In this article, a comparative analysis between a classification imputation algorithm, a regression imputation algorithm, and a deep learning imputation model is made: MissForest algorithm, based on random trees; Expectation Maximization with Bootstrap (EMB), the maximum likelihood estimation algorithm; and a proposed deep learning model, based on the Long-Short Term Memory (LSTM) architecture. Data from the Costa Rica meteorological field were used, which consist of multivariate data coming from several weather stations in the same geographical area.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Los autores conservan los derechos de autor y ceden a la revista el derecho de la primera publicación y pueda editarlo, reproducirlo, distribuirlo, exhibirlo y comunicarlo en el país y en el extranjero mediante medios impresos y electrónicos. Asimismo, asumen el compromiso sobre cualquier litigio o reclamación relacionada con derechos de propiedad intelectual, exonerando de responsabilidad a la Editorial Tecnológica de Costa Rica. Además, se establece que los autores pueden realizar otros acuerdos contractuales independientes y adicionales para la distribución no exclusiva de la versión del artículo publicado en esta revista (p. ej., incluirlo en un repositorio institucional o publicarlo en un libro) siempre que indiquen claramente que el trabajo se publicó por primera vez en esta revista.
References
Y. Zhang, P. J. Thorburn, W. Xiang and P. Fitch, “SSIM—A Deep Learning Approach for Recovering Missing Time Series Sensor Data” in IEEE Internet of Things Journal, vol. 6, no. 4, pp. 6618-6628, 2019, doi: 10.1109/JIOT.2019.2909038.
N. Bokde, M. W. Beck, F. Martínez-Alvarez and K. Kulat, “A novel imputation methodology for time series based on pattern sequence forecasting” in Pattern Recognition Letters, vol. 116, no. 7, pp. 88-96, 2018, doi: 10.1016/j.patrec.2018.09.020.
N. Donges. “A Guide to Recurrent Neural Networks: Understanding RNN and LSTM Networks” Built In, 2021, builtin.com/data-science/recurrent-neural-networks-and-lstm. Accessed 18 Apr. 2022.
J. M. Jerez, I. Molina, P. J. García-Laencina, E. Alba, N. Ribelles, M. Martín and L. Franco, “Missing data imputation using statistical and machine learning methods in a real breast cancer problem” in Artificial Intelligence in Medicine, vol. 50, no. 2, pp. 105-115, 2010, doi: 10.1016/j.artmed.2010.05.002.
T. Liu, H. Wei, and K. Zhang, “Wind power prediction with missing data using Gaussian process regression and multiple imputation” in Applied Soft Computing, vol. 71, pp. 905-916, 2018, doi: 10.1016/j.asoc.2018.07.027.
M. E. Quinteros, S. Lu, C. Blazquez, J. P. Cárdenas-R, X. Ossa, J.-M. Delgado-Saborit, R. M. Harrison, and P. Ruiz-Rudolph, “Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile” in Atmospheric Environment, vol. 200, pp. 40-49, 2019, doi: 10.1016/j.atmosenv.2018.11.053.
L. Chen, J. Xu, G. Wang, and Z. Shen, “Comparison of the multiple imputation approaches for imputing rainfall data series and their applications to watershed models” in Journal of Hydrology, vol. 572, pp. 449-460, 2019, doi: 10.1016/j.jhydrol.2019.03.025.
S. Moritz, A. Sardá, T. Bartz-Beielstein, M. Zaefferer, and J. Stork, “Comparison of different Methods for Univariate Time Series Imputation in R”, 2015, doi: 10.48550/arXiv.1510.03924.
W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “BRITS: Bidirectional Recurrent Imputation for Time Series”, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), 2018, doi: 10.48550/arXiv.1805.10572.
F. Oppong and S. Yao, “Assessing Univariate and Multivariate Normality, A Guide For Non-Statisticians”, in Mathematical Theory and Modeling, vol. 6, no. 2, pp. 26-33, 2016.
Y. Kim, H. Kim, G. Lee, and K.-H. Min, “A Modified Hybrid Gamma and Generalized Pareto Distribution for Precipitation Data”, in Asia-Pacific Journal of Atmospheric Sciences, vol. 55, no. 4, pp. 609-616, 2019, doi: 10.1007/s13143-019-00114-z.
A. Mohammed, “LSTM and Bidirectional LSTM for Regression - Towards Data Science”, Medium, 2022, towardsdatascience.com/lstm-and-bidirectional-lstm-for-regression-4fddf910c655. Accessed 10 Feb. 2022.
I. Sucholutsky, A. Narayan, M. Schonlau, and S. Fischmeister, “Deep Learning for System Trace Restoration”, 2019 International Joint Conference on Neural Networks (IJCNN) (2019): 1-8, doi: 10.48550/arXiv.1904.05411.
J. Honaker, G. King, and M. Blackwell, “Amelia II: A Program for Missing Data”, in Journal of Statistical Software, vol. 45, no. 7, pp. 1-47, 2011, doi: 10.18637/jss.v045.i07.
J. J. Miró, V. Caselles, and M. J. Estrela, “Multiple imputation of rainfall missing data in the Iberian Mediterranean context”, in Atmospheric Research, vol. 197, pp. 313-330, 2017, doi: 10.1016/j.atmosres.2017.07.016.
A. V. Desherevskii, I. Zhuravlev, N. Nikolsky, and Y. Sidorin, “Problems in Analyzing Time Series with Gaps and Their Solution with the WinABD Software Package” in Izvestiya, Atmospheric and Oceanic Physics, vol. 53, no. 7, pp. 659-678, 2018, doi: 10.1134/S0001433817070027.
A. Andiojaya and H. Demirhan, “A bagging algorithm for the imputation of missing values in time series”, in Expert Systems With Applications, vol. 129, no. 3, pp. 10-26, 2019, doi: 10.1016/j.eswa.2019.03.044.
L. Campozano, E. Sanchez, A. Avilés, and E. Samaniego, “Evaluation of infilling methods for time series of daily precipitation and temperature: The case of the Ecuadorian Andes”, in Maskana¸ vol. 5, no. 1, pp. 99-115, 2014, doi: 10.18537/mskn.05.01.07.
M. B. Richman, T. B. Trafalis, and I. Adrianto, “Multiple imputation through machine learning algorithms”, 87th AMS Annual Meeting, 2007.
C. Zhai, “A Note on the Expectation-Maximization (EM) Algorithm”, 2004.
J. Honaker, and G. King, “What to do About Missing Values in Time Series Cross-Section Data”, in American Journal of Political Science, vol. 54, no. 2, pp. 561-581, 2010, doi: 10.1111/j.1540-5907.2010.00447.x.
T. Khampuengson and W. Wang, “Novel Methods for Imputing Missing Values in Water Level Monitoring Data”, in Water Resources Management, vol. 37, no. 2, pp. 851-878, 2023, doi: 10.1007/s11269-022-03408-6