Anomaly explanation with random forests
Introduction
Anomaly detection algorithms identify samples deviating from the rest of the data so much that they raise a suspicion they have been generated by a different process. The fact that anomalies are, by definition, rare and they can substantially differ from each other, poses different issues and challenges than standard supervised classification. Despite that fact, anomaly detection algorithms were successfully applied in many domains, for example: network security (Garcia-Teodoro, Diaz-Verdejo, Maciá-Fernández, & Vázquez, 2009), bioinformatics (Tibshirani & Hastie, 2007), astronomy (Dutta, Giannella, Borne, & Kargupta, 2007) or space exploration (Fujimaki, Yairi, & Machida, 2005). With the increasing volume of mostly unlabelled data presently generated, it receives more and more attention, as it helps to identify interesting samples without human intervention.
Because each application domain has its definition of anomaly and application constraints, many different algorithms have already been proposed, cf. the recent book of Aggarwal (2013) or the comparison by Pevný (2016). However, the identification of an anomaly is only half of the problem, the second, equally important, is the interpretation of found anomalies because humans have more trust in obtained results if they understand them. In high dimensional application domains like network security or bioinformatics, a proper interpretation is even more important. Every additional piece of knowledge about the anomaly provides invaluable help to users evaluating the suspicious samples to gain insights. Despite the extensive prior art in anomaly detection, very few works even mention the explanation of detected anomalies. This is surprising because, in practice, any detection is usually followed by a more in-depth investigation. Especially since the pioneering works in the non-parametric anomaly detection, Knorr, Ng, 1998, Knorr, Ng, 1999, explicitly consider anomaly explanation.
This paper contributes to filling this gap by proposing an algorithm called Explainer, which explains how a sample identified as an anomaly by an arbitrary detector differs from the rest of the data. The explanation is presented as a set of rules because rules are commonly considered easily understandable for humans (Huysmans, Dejaeger, Mues, Vanthienen, & Baesens, 2011). One of the main Explainer’s advantages is that the anomaly detection algorithm used to find anomalies is treated as a black-box, which means that the Explainer can be employed as an additional step for a vast majority of the state of the art algorithms. Also, it can provide an explanation using different features than the anomaly detector. Because the Explainer is based on an ensemble of specifically trained decision trees (random forest), which can be trained fast, it is lightweight and can be used effectively with minor memory requirements on large databases and data-streams.
The rest of this paper is organised as follows. Section 2 reviews related work. The Explainer is described in Section 3. Section 4 presents the results of a comparison to the state of the art. The same section also shows the robustness of the Explainer with respect to the setting of its parameters. Section 5 summarises the advantages and limitations of the proposed approach.
Section snippets
Anomaly detectors
Most unsupervised anomaly detection methods can be classified as either model-centric or data-centric.
Model-centric methods learn a model from the available data, and then during detection, they measure how a given sample deviates from that model. Their training phase is usually computationally expensive while the detection is rather cheap. The most basic models represent data by parameters of some statistical distribution. To this class also belongs the method proposed in the first work that
The Explainer algorithm
The goal of the Explainer is to explain, why a point is an anomaly with respect to the rest of points in Explainer does not impose constraints on the anomaly detection algorithm that has identified xa, but assumes that the anomalous point xa deviates from the rest in a subset of features (corresponding to dimensions of ). From the point of view of the interaction between features and hypothesis space, the main focus are anomalies that can be explained as an aggregation of individual
Experiments
This section presents an experimental comparison of the Explainer to the state of the art algorithms by Dang et al. (2014, 2013) (Dang13,Dang14), Micenková et al. (2013) (Micenkova) and Lundberg and Lee (2017) (SHAP) on 34 classification datasets from the UCI machine learning repository (Bache & Lichman, 2013), with the number of features ranging from 4 to 5000 and number of samples from 22 up to 12k. Larger datasets were not included because of the computational complexity of the LOF and k-nn
Conclusion
This paper has presented a novel approach to explaining the deviation of a sample, identified by an arbitrary anomaly detection algorithm, from the rest of the data. The proposed approach relies on a novel kind of random forests suitable to extract rules explaining the anomaly. Those rules are then assembled into a set of classification rules sharing the same consequent, or into the equivalent rule with an antecedent in disjunctive normal form and presented to the user.
The proposed approach has
CRediT authorship contribution statement
Martin Kopp: Formal analysis, Investigation, Software, Validation, Visualization, Writing - original draft. Tomáš Pevný: Data curation, Methodology, Supervision. Martin Holeňa: Funding acquisition, Supervision, Writing - review & editing.
References (48)
- et al.
Anomaly-based network intrusion detection: Techniques, systems and challenges
Computers & Security
(2009) - et al.
An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models
Decision Support Systems
(2011) - et al.
Use of k-nearest neighbor classifier for intrusion detection
Computers & Security
(2002) - et al.
Performance of classification models from a user perspective
Decision Support Systems
(2011) Outlier analysis
(2013)- et al.
Detecting outlying properties of exceptional objects
ACM Transactions on Database Systems (TODS)
(2009) - Bache, K., & Lichman, M. (2013). UCI Machine learning repository....
- et al.
Classification and regression trees
(1984) - et al.
LOF: Identifying density-based local outliers
Proceedings of the 2000 ACM SIGMOD international conference on management of data
(2000) - et al.
Xgboost: A scalable tree boosting system
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
(2016)
Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning
Foundations and Trends® in Computer Graphics and Vision
Discriminative features for identifying and interpreting outliers
ICDE 2014, 30th IEEE international conference on data engineering
Local outlier detection with interpretation
European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD 2013)
Statistical comparisons of classifiers over multiple data sets
The Journal of Machine Learning Research
Distributed top-k outlier detection from astronomy catalogs using the demac system.
SDM
Systematic construction of anomaly detection benchmarks from real data
Proceedings of the ACM SIGKDD workshop on outlier detection and description
A geometric framework for unsupervised anomaly detection
Applications of data mining in computer security
An approach to spacecraft anomaly detection problem using kernel feature space
Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining
Sample criteria for testing outlying observations
The Annals of Mathematical Statistics
Lemna: Explaining deep learning based security applications
Proceedings of the 2018 ACM SIGSAC conference on computer and communications security
Measuring classifier performance: a coherent alternative to the area under the ROC curve
Machine Learning
A simple sequentially rejective multiple test procedure
Scandinavian Journal of Statistics
HiCS: High contrast subspaces for density-based outlier ranking
ICDE 2012, 28th IEEE international conference on data engineering
Algorithms for mining distance-based outliers in large datasets
Proceedings of the international conference on very large data bases
Cited by (0)
The research reported in this paper has been supported by the Czech Science Foundation (GAČR) grant 17-01251, 18-21409S and student grant SGS18/210/OHK3/3T/18.