Interpreting and Clustering Outliers with Sapling Random Forests

Kopp, Martin; Pevný, T.; Holeňa, Martin

Number of the records: 1

Interpreting and Clustering Outliers with Sapling Random Forests

To the basket
RIV
OpenAIRE
Bookmark
1.
0432410 - ÚI 2015 RIV CZ eng C - Conference Paper (international conference)
Kopp, Martin - Pevný, T. - Holeňa, Martin
Interpreting and Clustering Outliers with Sapling Random Forests.
ITAT 2014. Information Technologies - Applications and Theory. Part II. Prague: Institute of Computer Science AS CR, 2014 - (Kůrková, V.; Bajer, L.; Peška, L.; Vojtáš, R.; Holeňa, M.; Nehéz, M.), s. 61-67. ISBN 978-80-87136-19-5.
[ITAT 2014. European Conference on Information Technologies - Applications and Theory /14./. Demänovská dolina (SK), 25.09.2014-29.09.2014]
R&D Projects: GA ČR GA13-17187S
Grant - others:GA ČR(CZ) GPP103/12/P514
Institutional support: RVO:67985807
Keywords : anomaly detection * anomaly interpretation * clustering * decision trees * feature selection * random forest
Subject RIV: IN - Informatics, Computer Science

The main objective of outlier detection is finding samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover interesting events within data. Consequently, a considerable amount of statistical and data mining techniques to identify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was labelled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as conjunctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by humans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anomalies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our approach is demonstrated on several synthetic and one real world datasets.
Permanent Link: http://hdl.handle.net/11104/0236773
File Download Size Commentary Version Access

0432410.pdf 26 156.4 KB Publisher’s postprint open-access

Number of the records: 1