Random-Forest-Based Analysis of URL Paths

Puchýř, J.; Holeňa, Martin

Number of the records: 1

Random-Forest-Based Analysis of URL Paths

1.

SYSNO ASEP	0478626
Document Type	C - Proceedings Paper (int. conf.)
R&D Document Type	Conference Paper
Title	Random-Forest-Based Analysis of URL Paths
Author(s)	Puchýř, J. (CZ) Holeňa, Martin (UIVT-O)_{SAI, RID}
Source Title	Proceedings ITAT 2017: Information Technologies - Applications and Theory. - Aachen & Charleston : Technical University & CreateSpace Independent Publishing Platform, 2017 / Hlaváčová J. - ISSN 1613-0073 - ISBN 978-1974274741
Pages	s. 129-135
Number of pages	7 s.
Publication form	Online - E
Action	ITAT 2017. Conference on Theory and Practice of Information Technologies - Applications and Theory /17./
Event date	22.09.2017 - 26.09.2017
VEvent location	Martinské hole
Country	SK - Slovakia
Event type	EUR
Language	eng - English
Country	DE - Germany
Keywords	malicious URLs detection ; classification ; random forest
Subject RIV	IN - Informatics, Computer Science
OECD category	Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
R&D Projects	GA17-01251S GA ČR - Czech Science Foundation (CSF)
Institutional support	UIVT-O - RVO:67985807
EID SCOPUS	85045771719
Annotation	One of the key sources of spreading malware are malicious web sites - either tricking user to install malware imitating legitimate software or, in the case of various exploit kits, initiating malware installation even without any user action. The most common technique against such web sites is blacklisting. However, it provides little to no information about new sites never seen before. Therefore, there has been important research into predicting malicious web sites based on their features. This work-in-progress paper presents a light-weight prediction method using solely lexical features of the site URL and classification by random forests. To this end, three possibilities of feature extraction have been elaborated and investigated on real-world data sets with respect to precision and recall. The obtained results indicate that there is nearly never a significant difference betweeen the considered methods, and that in spite of the limitation to the lexical features of the site URL, they have an impressive performance in terms of area under the precision-recall curve for the path parts of URLs.
Workplace	Institute of Computer Science
Contact	Tereza Šírová, sirova@cs.cas.cz, Tel.: 266 053 800
Year of Publishing	2018
Electronic address	http://ceur-ws.org/Vol-1885/129.pdf

Number of the records: 1