Improving Balanced Accuracy for Minority Plant Species under Data Imbalance
Main Article Content
Abstract
Regardless of the widely known success of deep learning in classification, such models are
commonly measured by metrics that do not account for data imbalance, especially in terms of
predictions per class, ignoring minority classes. This can be a problem, as minority classes are
often the most difficult to predict and collect data for. In the plant domain, for example, species
with fewer samples are often the ones that are hardest to collect and predict in the field. As
we continue to identify more and more plant species, more of them become minority species,
making it increasingly difficult to accurately classify them using traditional machine learning
methods. To address this issue, we explore the combination of traditional data and machine
learning approaches with deep learning techniques such as self-supervision in a preprocessing
stage. By using self-supervised training together with different sampling algorithms and class
weights, we were able to improve the balanced accuracy metric for minority plant species by
between 7.9% and 13% without affecting general accuracy. This shows that using deep learning
techniques in combination with traditional machine learning methods can help to improve the
accuracy of predictions for minority classes, even in domains where data is limited.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Los autores conservan los derechos de autor y ceden a la revista el derecho de la primera publicación y pueda editarlo, reproducirlo, distribuirlo, exhibirlo y comunicarlo en el país y en el extranjero mediante medios impresos y electrónicos. Asimismo, asumen el compromiso sobre cualquier litigio o reclamación relacionada con derechos de propiedad intelectual, exonerando de responsabilidad a la Editorial Tecnológica de Costa Rica. Además, se establece que los autores pueden realizar otros acuerdos contractuales independientes y adicionales para la distribución no exclusiva de la versión del artículo publicado en esta revista (p. ej., incluirlo en un repositorio institucional o publicarlo en un libro) siempre que indiquen claramente que el trabajo se publicó por primera vez en esta revista.
References
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class
imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2012.
N. Bressler, “How to check the accuracy of your machine learning model,” Feb 2022. [Online]. Available:
https://deepchecks.com/how-to- check-the-accuracy-of-your-machine-learning-model/
Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data level approach for imbalanced class handling on educational data mining multiclass classification,” in 2018 International Conference on Information and Communications
Technology (ICOIACT), 2018, pp. 310–314.
S. Lu, F. Gao, C. Piao, and Y. Ma, “Dynamic weighted cross entropy for semantic segmentation with extremely
imbalanced data,” in 2019 Interna- tional Conference on Artificial Intelligence and Advanced Manufacturing
(AIAM), 2019, pp. 230–233.
J. Carranza-Rojas and E. Mata-Montero, “Combining leaf shape and texture for costa rican plant species identification,” CLEI Electronic journal, vol. 19, no. 1, pp. 7–7, 2016.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 770–778.
K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The balanced accuracy and its posterior
distribution,” in 2010 20th International Conference on Pattern Recognition, 2010, pp. 3121–3124.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- work for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
G. King and L. Zeng, “Logistic regression in rare events data,” Political analysis, vol. 9, no. 2, pp. 137–163,