Feature Selection for Performance Estimation of Machine Learning Workflows

Neruda, Roman; Figueroa-García, Juan Carlos

doi:10.1007/978-3-031-33258-6_33

Roman Neruda¹² &
Juan Carlos Figueroa-García¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 691))

Included in the following conference series:

International Conference on Information Technology & Systems

152 Accesses

Abstract

Performance prediction of machine learning models can speed up automated machine learning procedures and it can be also incorporated into model recommendation algorithms. We propose a meta-learning framework that utilizes information about previous runs of machine learning workflows on benchmark tasks. We extract features describing the workflows and meta-data about tasks, and combine them to train a regressor for performance prediction. This way, we obtain the model performance prediction without any training, just by means of feature extraction and inference via the regressor. The approach is tested on OpenML-CC18 Curated Classification benchmark estimating the 75th percentile value of area under the ROC curve (AUC) of the classifiers. We were able to obtain consistent predictions with \(R^2\) score of 0.8 for previously unseen data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bischl B, Casalicchio G, Feurer M, Hutter F, Lang M, Mantovani RG, van Rijn JN, Vanschoren J (2019) OpenML benchmarking suites. arXiv:1708.03731v2 [stat.ML]
Brazdil P, van Rijn JN, Soares C, Vanschoren J (2022) Metalearning: applications to automated machine learning and data mining, 2nd edn. Springer, Cham
Book Google Scholar
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, New York
Book MATH Google Scholar
Goodfellow IJ, Bengio Y, Courville AC (2016) Deep learning. Adaptive computation and machine learning. MIT Press, Cambridge. http://www.deeplearningbook.org/
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer Series in statistics. Springer, Cham. http://www.worldcat.org/oclc/300478243
Hutter F, Hoos H, Leyton-Brown K (2014) An efficient approach for assessing hyperparameter importance. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of machine learning research, vol 32. PMLR, Beijing, pp 754–762. https://proceedings.mlr.press/v32/hutter14.html
Hutter F, Kotthoff L, Vanschoren J (eds) (2019) Automated machine learning - methods, systems, challenges. Springer, Cham
Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) LightGBM: a highly efficient gradient boosting decision tree. In: NIPS
Google Scholar
Lemke C, Budka M, Gabrys B (2013) Metalearning: a survey of trends and technologies. Artif Intell Rev 44:117–130
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, NIPS 2013, vol 2. Curran Associates Inc., Red Hook, pp 3111–3119
Google Scholar
Mueller AC, Guido S (2016) Introduction to machine learning with python: a guide for data scientists. O’Reilly Media, Inc.
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Post MJ, van der Putten P, van Rijn JN (2016) Does feature selection improve classification? A large scale experiment in OpenML. In: IDA
Google Scholar
van Rijn J, Hutter F (2018) Hyperparameter importance across datasets, pp 2367–2376
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag. 24(5):513–523. https://www.sciencedirect.com/science/article/pii/0306457388900210
Strang B, van der Putten P, van Rijn JN, Hutter F (2018) Don’t rule out simple models prematurely: a large scale benchmark comparing linear and non-linear classifiers in OpenML. In: IDA
Google Scholar
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) OpenML: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.264119
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Czech Academy of Sciences, Pod Vodárenskou věží 2, 182 07, Prague, Czech Republic
Roman Neruda
Universidad Distrital Francisco José de Caldas, Calle 13, 31–75, Bogotá D.C., Colombia
Juan Carlos Figueroa-García

Authors

Roman Neruda
View author publications
You can also search for this author in PubMed Google Scholar
Juan Carlos Figueroa-García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Neruda .

Editor information

Editors and Affiliations

ISEG, University of Lisbon, Lisbon, Portugal
Álvaro Rocha
Facultade de Geografía e Historia, University of Santiago de Compostela, Santiago de Compostela, Spain
Carlos Ferrás
Departamento de Informática, Universidad Nacional de San Antonio Abad del Cusco, Cusco, Peru
Waldo Ibarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neruda, R., Figueroa-García, J.C. (2023). Feature Selection for Performance Estimation of Machine Learning Workflows. In: Rocha, Á., Ferrás, C., Ibarra, W. (eds) Information Technology and Systems. ICITS 2023. Lecture Notes in Networks and Systems, vol 691. Springer, Cham. https://doi.org/10.1007/978-3-031-33258-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-33258-6_33
Published: 11 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33257-9
Online ISBN: 978-3-031-33258-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Feature Selection for Performance Estimation of Machine Learning Workflows