Effect of instance selection algorithms on prediction error of numerical variables
Main Article Content
Abstract
The main objective of this study is to analyze the effect of instance selection (IS) algorithms on the prediction error in regression tasks with machine learning. Six algorithms were evaluated; four from literature and two are new variants of one of them. Different percentages and magnitudes of noise were added to the output variable of 52 datasets to evaluate the algorithms. The results show that not all IS algorithms are effective. RegENN and its variants improve the prediction error (RMSE) of the regression task in most datasets for high percentages and magnitudes of noise. However, when the magnitude and percentage of noise are lower, for example, 10%-10%, 50%-10%, or 10%-30%, there is no evidence of improvement in most datasets. Other results are presented to answer four new questions about the performance of the algorithms.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Los autores conservan los derechos de autor y ceden a la revista el derecho de la primera publicación y pueda editarlo, reproducirlo, distribuirlo, exhibirlo y comunicarlo en el país y en el extranjero mediante medios impresos y electrónicos. Asimismo, asumen el compromiso sobre cualquier litigio o reclamación relacionada con derechos de propiedad intelectual, exonerando de responsabilidad a la Editorial Tecnológica de Costa Rica. Además, se establece que los autores pueden realizar otros acuerdos contractuales independientes y adicionales para la distribución no exclusiva de la versión del artículo publicado en esta revista (p. ej., incluirlo en un repositorio institucional o publicarlo en un libro) siempre que indiquen claramente que el trabajo se publicó por primera vez en esta revista.
References
S. García, J. Luengo, F. Herrera, Data preprocessing in data mining, Springer, 2015. doi:10.1007/978-3-319-10247-4
H.Liu, H.Motoda, On issues of instance selection, Data Min. Knowl. Disc. 6(2) (2002) 115–130
S. García, J. Luengo, F. Herrera, Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems, 98 (2016) 1-29 https://doi.org/10.1016/j.knosys.2015.12.006
A. Arnaiz-González, M. Blachnik, M. Kordos, C. García-Osorio, Fusion of instance selection methods in regression tasks, Information Fusion 30 (2016) 69–79 https://doi.org/10.1016/j.inffus.2015.12.002
X. Zhu , X. Wu, Class noise vs. attribute noise: a quantitative study, Artif. Intell. Rev. 22 (2004) 177–210 https://doi.org/10.1007/s10462-004-0751-8
J.Sáez, M. Galar, M. Luengo, F. Herrera, INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion 27 (2016) 19-32 https://doi.org/10.1016/j.inffus.2015.04.002
A. Arnaiz-González, J. F. Díez-Pastor, J.J .Rodríguez, C. I. García-Osorio, Instance selection for regression by discretization, Expert Systems with Applications 54 (2016) 340-350. https://doi.org/10.1016/j.eswa.2015.12.046
Y.Song, J. Liang, J.Lu, X.Zhao, X, An efficient instance selection algorithm for k nearest neighbor regression, Neurocomputing, 251 (2017) 26-34 https://doi.org/10.1016/j.neucom.2017.04.018
M.Kordos, M. Blachnik, Instance selection with neural networks for regression problems, in: A.E.P. Villa, W.Duch, P.Érdi, F.Masulli, G.Palm(Eds.),Artificial Neural Networks and Machine Learning ICANN 2012, volume 7553 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2012, pp. 263–270 https://doi.org/10.1007/978-3-642-33266-1_33
L. Daza, E. Acuna, An algorithm for detecting noise on supervised classification, in: Proceedings of the 1st World Conference on Engineering and Computer Science (WCECS), October 2007, San Francisco, USA, pp. 701–706.
F. Benoît, M.Verleysen, Classification in the presence of label noise: a survey.” IEEE transactions on neural networks and learning systems 25(5) (2013) 845-869 https://doi.org/10.1109/tnnls.2013.2292894
B.Zerhari, A. Lahcen, S. Mouline. Detection and elimination of class noise in large datasets using partitioning filter technique. in 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), 2016 https://doi.org/10.1109/cist.2016.7805041
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, J, S. García, Keel data- mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Multiple-Valued Logic and Soft Computing, 17 (2011) 255–287.
D, Dua, C. Graff, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019
M. Antonelli, P. Ducange, F. Marcelloni, Genetic training instance selection in multi objective evolutionary fuzzy systems: A coevolutionary approach, IEEE Transactions on Fuzzy Systems 20 (2) (2012) 276–290 https://doi.org/10.1109/tfuzz.2011.2173582
J. Tolvi, Genetic algorithms for outlier detection and variable selection in linear regression models, Soft Computing 8 (8) (2004) 527–533 https://doi.org/10.1007/s00500-003-0310-2
A. Arnaiz-González, J.F.Díez-Pastor, J.J. Rodríguez, C.García-Osorio, Instance selection for regression: Adapting DROP, Neurocomputing, 201 (2016) 66-81 https://doi.org/10.1016/j.neucom.2016.04.003
X. Zhu Wu, X, Class noise vs attribute noise: a quantitative study, Artif.Intell.Rev, 22 (3) (2004) 177–210 https://doi.org/10.1007/s10462-004-0751-8
A. Guillen, L. J. Herrera, G. Rubio, H. Pomares, A. Lendasse, I. Rojas, New method for instance or prototype selection using mutual information in time series prediction, Neurocomputing 73 (10) (2010) 2030–2038 https://doi.org/10.1016/j.neucom.2009.11.031
D.R. Wilson, T.R Martinez, Instance Pruning Techniques, ICML’97, 1997, pp. 403-411.
T. Snijders, R. Bosker, Multilevel analysis: an introduction to basic and advanced multilevel modeling, Sage, 1999.
M. Kordos, K. Łapa, Multi-Objective Evolutionary Instance Selection for Regression Tasks, Entropy, 20(10) (2018) 1-34 https://doi.org/10.3390/e20100746
M. Kordos, M. Blachnik, R. Scherer, Fuzzy Clustering Decomposition of Genetic Algorithm-based Instance Selection for Regression Problems, Information Sciences, 587 (2021) 23-40 https://doi.org/10.1016/j.ins.2021.12.016
C. Gong, P. Wang, Z. Su, An interactive nonparametric evidential regression algorithm with instance selection, Soft Computing, 24(5) (2020), 3125–3140 https://doi.org/10.1007/s00500-020-04667-4
O. Reyes, H. M. Fardoun, S. Ventura, An ensemble-based method for the selection of instances in the multi-target regression problem, Integrated Computer-Aided Engineering, 25(4) (2018) 305–https://doi.org/10.3233/ica-180581
M. Belgiu and L. Drăguţ, Random forest in remote sensing: A review of applications and future directions, ISPRS Journal of Photogrammetry and Remote Sensing, 114, (2016) 24–31 https://doi.org/10.1016/j.isprsjprs.2016.01.011
M. Kayri, I. Kayri, and M. T. Gencoglu, The performance comparison of Multiple Linear Regression, Random Forest and Artificial Neural Network by using photovoltaic and atmospheric data, 2017 14th International Conference on Engineering of Modern Electric Systems (EMES), Jun. 2017. DOI: 10.1109/EMES.2017.7980368
M. Čeh, M. Kilibarda, A. Lisec, and B. Bajat, Estimating the Performance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments, ISPRS International Journal of Geo-Information, 7, (2018), 1-16 doi:10.3390/ijgi7050168