research-article

The Distance Function Optimization for the Near Neighbors-Based Classifiers

Authors:
Marcel Jiřina

Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic

Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic

0000-0002-6686-1511
View Profile

,
Said Krayem

Faculty of Applied Informatics, Tomas Bata University, Nad Stranemi, Zlin, Czech Republic

Faculty of Applied Informatics, Tomas Bata University, Nad Stranemi, Zlin, Czech Republic
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 16 Issue 6Article No.: 101pp 1–21https://doi.org/10.1145/3434769

Published:30 July 2022Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Based on the analysis of conditions for a good distance function we found four rules that should be fulfilled. Then, we introduce two new distance functions, a metric and a pseudometric one. We have tested how they fit for distance-based classifiers, especially for the IINC classifier. We rank distance functions according to several criteria and tests. Rankings depend not only on criteria or nature of the statistical test, but also whether it takes into account different difficulties of tasks or whether it considers all tasks as equally difficult. We have found that the new distance functions introduced belong among the four or five best out of 23 distance functions. We have tested them on 24 different tasks, using the mean, the median, the Friedman aligned test, and the Quade test. Our results show that a suitable distance function can improve behavior of distance-based classification rules.

REFERENCES

[1] Alkasassbeh M., Altarawnwh G. A., and Hassanat A. B.. 2015. On enhancing the performance of nearest neighbor classifiers using hassanat distance metric. Canadian Journal of Pure and Applied Science 9, 1 (2015), 6.Google Scholar
[2] Angiulli F. and Fassetti F.. 2013. Nearest neighbor-based classification of uncertain data. ACM Transactions on Knowledge Discovery from Data 7, 1, Article 1 (2013), 35, DOI:Google ScholarDigital Library
[3] Ashraf M., Le K., and Huang X.. 2011. Iterative weighted k-NN for constructing missing feature values in wisconsin breast cancer dataset. In Proceedings of the 3rd International Conference on Data Mining and Intelligent Information Technology Applications , Macao, 24–26 Oct. 2011, 23–27, ISBN: 978-1-4673-0231-9 (IEEE)Google Scholar
[4] Benzi M., Cullum J. K., and Tu̇ma M.. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM Journal on Scientific Computing 22, 1318–1332.Google Scholar
[5] Cover T. M. and Hart P. E.. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (1967), 21–27.Google Scholar
[6] Derrac J., Garcia S., Molina D., and Herrera F.. 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation 1, 1 (2011), 3–18.Google Scholar
[7] Deza E. and Deza M. M.. 2006. Dictionary of Distances. Elsevier, Amsterdam, 391.Google Scholar
[8] Deza M. M. and Deza E.. 2009. Encyklopedia of Distances. Springer, Heildelberg, 590.Google Scholar
[9] Domeniconi C., Peng J., and Gunopulos D.. 2002. Locally adaptive metric nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 9 (2002), 1281–1285.Google Scholar
[10] Dua D. and Taniskidou E. Karra. 2017. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Retrieved 13 March, 2008 from http://archive.ics.uci.edu/ml.Google Scholar
[11] Duda R., Hart P., and Stork D. G.. 2000. Pattern Classification. John Wiley and Sons, 2000.Google Scholar
[12] Grassberger P. and Procaccia I.. 1983. Measuring the strangeness of strange attractors. Physica 9D, 1–2 (1983), 189–208.Google Scholar
[13] Hassanat A. B.. 2014. Dimensionality invariant similarity measure. Journal of American Science, 10, 8 (2014), 221–226.Google Scholar
[14] Hassanat A. B., Abbadi M. A., Altarawneh G. A., Alhasanat A. A.. 2014. Solving problem of K parameter in the KNN classifier using an ensemble learning approach. International Journal of Computer Science and Information Security 12, 8 (2014), 33–39.Google Scholar
[15] Jiřina M. and Jiřina M., Jr. 2013. Utilization of singularity exponent in nearest neighbor based classifier. Journal of Classification 30, 1 (2013), 3–29. ISSN 0176–4268.Google Scholar
[16] Jiřina M. and Jiřina M., Jr. 2014. Correlation dimension based classifier. IEEE Transactions on Cybernetics 44, 12 (2014), 2253–2263. ISSN 2168–2267.Google Scholar
[17] Jiřina M. and Jiřina M., Jr. 2015. Classification using zipfian kernel. Journal of Classification (Springer) 32, 2 (2015), 305–326. ISSN 0176–4268.Google Scholar
[18] Joachims T.. 1999. Making large-scale SVM learning practical. In Proceedings of the Advances in Kernel Methods - Support Vector Learning, (Eds). B. Scholkopf, C. Burges and A. Smola, MIT-Press.Google Scholar
[19] Joachims T.. 2008. Program codes for SVM-light and SVM-multiclass. Retrieved 30 Jan., 2014 from http://svmlight.joachims.org/.Google Scholar
[20] Kontorovich A. and Weiss R.. 2015. A bayes consistent 1-NN classifier. In Proceedings of the 18th International Conference on Artifficial Intelligence and Statistics 2015, San Diego, JMLR: W&CP 38, 480–488.Google Scholar
[21] Li B., Chen Y. W., and Chen Y. Q.. 2008. The nearest neighbor algorithm of local probability centers. IEEE Transactions on Systems, Man, and Cybernetics/Part B: Cybernetics 38, 1 (2008), 141–154.Google Scholar
[22] Luschow A. and Wartena C.. 2017. Classifying medical literature using k-nearest-neighbours algorithm. In Proceedings of the 17th European Networked Knowledge Organization Systems Workshop Co-located with the 21st International Conference on Theory and Practice of Digital Libraries 2017 , Mayr P., Tudhope D., Golub K., Wartena C., Luca E. W. D. (Eds.), CEUR-WS.org, CEUR Workshop Proceedings, Vol. 1937, pp. 26–38. Retrieved from http://ceur-ws.org/Vol-1937/paper3.pdf.Google Scholar
[23] Mandelbrot B. B.. 1982. The Fractal Geometry of Nature. W. H. Freeman and Co., ISBN 0-7167-1186-9.Google Scholar
[24] Mishra A.. 2020. k-nearest neighbor (k-NN) for machine learning. Data Science Foundation, May 2020, 4 pp., Retrieved from https://datascience.foundation/datatalk/k-nearest-neighbor-k-nn-for-machine-learning.Google Scholar
[25] Muja M. and Lowe D. G.. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 11 (2014), 2227–2240.Google Scholar
[26] Noh Y. -K., Zhang B. T., and Lee D. D.. 2018. Generative local metric learning for nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 1 (2018), 106–118.Google Scholar
[27] Paredes R.. 2008. CPW: Class and prototype weights learning. Retrieved 11 Dec., 2007 from http://www.dsic.upv.es/rparedes/research/CPW/index.html.Google Scholar
[28] Paredes R.. 2010. Data sets corpora. Retrieved 11 Dec., 2007 from Available http://algoval.essex.ac.uk/data/vector/UCI/, in fact, the primary source is S. M. Lucas, Algoval: Algorithm Evaluation over the Web.Google Scholar
[29] Paredes R. and Vidal E.. 2006. Learning weighted metrics to minimize nearest neighbor classification error. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 7 (2006), 1100–1110.Google Scholar
[30] Piryonesi S. M. and El-Diraby T. E.. 2020. Role of data analytics in infrastructure asset management: Overcoming data size and quality problems. Journal of Transportation Engineering, Part B: Pavements. 146, 2 (2020), 1–15. DOI:Google ScholarCross Ref
[31] Garcia S., Wozniak M., and Krawczyk B.. 2017. Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man and Cybernetics Systems 47, 10 (2017), 2727–2739.Google Scholar
[32] Samworth B. J.. 2012. Optimal weighted nearest neighbour classifiers. The Annals of Statistics 40, 5 (2012), 2733–2763. DOI:Google ScholarCross Ref
[33] Silverman B. W.. 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.Google Scholar
[34] Weinberger K. Q. and Saul L. K.. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 2 (2009), 207–244.Google Scholar
[35] Xiong F., Kam M., Hrebien L., Wang B., and Qi Y.. 2016. Kernelized information-theoretic metric learning for cancer diagnosis using high-dimensional moleculr profiling data. ACM Transactions on Knowledge Discovery from Data 10, 4, Article 38 (2016), 23. DOI:Google ScholarDigital Library
[36] Yu D., Yu X., and Wu A.. 2011. Making the nearest neighbor meaningful for time series classification. In Proceedings of the 4th International Congress on Image and Signal Processing. 2481–2485.Google Scholar
[37] Zhang B. and Srihari S. N.. 2004. Fast k-nearest neighbor classification using cluster-based trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 4 (2004), 525–528.Google Scholar
[38] Liang Y.. 2018. Integrating forest inventory data and MODIS data to map species-level biomass in chinese boreal forests. Canadian Journal of Forest Research 48, 5 (2018), 461–479. DOI:Google ScholarCross Ref

Index Terms

The Distance Function Optimization for the Near Neighbors-Based Classifiers
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Nearest-neighbor search
2. Mathematics of computing
  1. Probability and statistics
    1. Nonparametric statistics

Recommendations

A Normalized Levenshtein Distance Metric

Although a number of normalized edit distances presented so far may offer good performance in some applications, none of them can be regarded as a genuine metric between strings because they do not satisfy the triangle inequality. Given two strings X ...
Read More
An ADMM-based scheme for distance function approximation
Abstract
A novel variational problem for approximating the distance function (to a domain boundary) is proposed. It is shown that this problem can be efficiently solved by ADMM. A review of several other variational and PDE-based methods for distance ...
Read More
Graph Edit Distance or Graph Edit Pseudo-Distance?
Structural, Syntactic, and Statistical Pattern Recognition
Abstract
Graph Edit Distance has been intensively used since its appearance in 1983. This distance is very appropriate if we want to compare a pair of attributed graphs from any domain and obtain not only a distance, but also the best correspondence ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 6
December 2022
631 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3543989
Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2022
- Online AM: 24 February 2022
- Accepted: 1 November 2020
- Revised: 1 September 2020
- Received: 1 April 2019
Published in tkdd Volume 16, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Near neighbors
classification
distance function
metric
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 339
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

The Distance Function Optimization for the Near Neighbors-Based Classifiers

ACM Transactions on Knowledge Discovery from Data

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

A Normalized Levenshtein Distance Metric

An ADMM-based scheme for distance function approximation

Graph Edit Distance or Graph Edit Pseudo-Distance?