Evaluation of different text representation techniques and distance metrics using KNN for documents classification
Main Article Content
Abstract
Nowadays, text data is a fundamental part in databases around the world and one of the biggest challenges has been the extraction of meaningful information from large sets of text. Existing literature about text classification is extensive, however, during the last 25 years the statistical methods (where similarity functions are applied over vectors of words) have achieved good results in many areas of text mining. Additionally, several models have been proposed to achieve dimensional reduction and incorporate the semantic factor, such as the topic modelling. In this paper we evaluate different text representation techniques including traditional bag of words and topics modelling. The evaluation is done by testing different combinations of text representations and text distance metrics (Cosine, Jaccard and Kullback-Leibler Divergence) using K-Nearest-Neighbors in order to determine the effectiveness of using topic modelling representations for dimensional reduction when classifying text. The results show that the simplest version of bag of words and the Jaccard similarity outperformed the rest of combinations in most of the cases. A statistical test showed that the accuracy values obtained when using supervised Latent Dirichlet Allocation representations, combined with the relative entropy metric, were no significantly different to the ones obtained by using traditional text classification techniques. LDA managed to abstract thousands of words in less than 60 topics for the main set of experiments. Additional experiments suggest that topic modelling can perform better when used for short text documents or when increasing the parameter of number of topics (dimensions) at the moment of generating the model.
Article Details
Los autores conservan los derechos de autor y ceden a la revista el derecho de la primera publicación y pueda editarlo, reproducirlo, distribuirlo, exhibirlo y comunicarlo en el país y en el extranjero mediante medios impresos y electrónicos. Asimismo, asumen el compromiso sobre cualquier litigio o reclamación relacionada con derechos de propiedad intelectual, exonerando de responsabilidad a la Editorial Tecnológica de Costa Rica. Además, se establece que los autores pueden realizar otros acuerdos contractuales independientes y adicionales para la distribución no exclusiva de la versión del artículo publicado en esta revista (p. ej., incluirlo en un repositorio institucional o publicarlo en un libro) siempre que indiquen claramente que el trabajo se publicó por primera vez en esta revista.
References
[2] C. T. Tran, M. Zhang, P. Andreae, Multiple imputation for missing data using genetic programming, in: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, ACM, 2015, pp. 583–590.
[3] M. Kocher, J. Savoy, Distance measures in author profiling, Information Processing & Management 53 (5) (2017) 1103–1119.
[4] V. K. Vijayan, K. R. Bindu, L. Parameswaran, A comprehensive study of text classification algorithms, in: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1109–1113. doi:10.1109/ICACCI.2017.8125990.
[5] Y. Zhao, Y. Qian, C. Li, Improved knn text classification algorithm with mapreduce implementation, in: Conference: Conference: 2017 4th International Conference on Systems and Informatics (ICSAI), 2017, pp. 1417–1422.
[6] A. J. Soto, A. Mohammad, A. Albert, A. Islam, E. Milios, M. Doyle, R. Minghim, M. C. Ferreira de Oliveira, Similarity-based support for text reuse in technical writing, in: Proceedings of the 2015 ACM Symposium on Document Engineering, ACM, 2015, pp. 97–106.
[7] D.-H. Bae, S.-H. Yoon, T.-H. Eom, J. Ha, Y.-S. Hwang, S.-W. Kim, Computing paper similarity based on latent dirichlet allocation, in: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, ACM, 2014, p. 77.
[8] K. Bougiatiotis, T. Giannakopoulos, Content representation and similarity of movies based on topic extraction from subtitles, in: Proceedings of the 9th Hellenic Conference on Artificial Intelligence, ACM, 2016, p. 17.
[9] M. Pavlinek, V. Podgorelec, Text classification method based on self-training and lda topic models, Expert Systems with Applications 80 (2017) 83–93.
[10] J. D. Mcauliffe, D. M. Blei, Supervised topic models, in: Advances in neural information processing systems, 2008, pp. 121–128.
[11] S. Seifzadeh, A. K. Farahat, M. S. Kamel, F. Karray, Short-text clustering using statistical semantics, in: Proceedings of the 24th International Conference on World Wide Web, ACM, 2015, pp. 805–810.
[12] N. Devraj, M. Chary, How do twitter, wikipedia, and harrison’s principles of medicine describe heart attacks?, in: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, ACM, 2015, pp. 610–614.
[13] C. C. Aggarwal, Data mining: the textbook, Springer, 2015.
[14] A. K. Uysal, S. Gunal, The impact of preprocessing on text classification, Information Processing & Management 50 (1) (2014) 104–112.
[15] H. K. Kim, H. Kim, S. Cho, Bag-of-concepts: Comprehending document representation through clustering words in distributed representation, Neurocomputing 266 (2017) 336–352.
[16] A. Onan, S. Koruko˘glu, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Systems with Applications 57 (2016) 232–247.
[17] N. Liebman, D. Gergle, Capturing turn-by-turn lexical similarity in text-based communication, in: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, ACM, 2016, pp. 553–559.
[18] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, arXiv preprint arXiv:1607.01759.
[19] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532– 1543.
[20] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research 3 (Jan) (2003) 993–1022.