Anomaly explanation with random forests

https://doi.org/10.1016/j.eswa.2020.113187Get rights and content

Abstract

Anomaly detection has become an important topic in many domains with many different solutions proposed until now. Despite that, there are only a few anomaly detection methods trying to explain how the sample differs from the rest. This work contributes to filling this gap because knowing why a sample is considered anomalous is critical in many application domains. The proposed solution uses a specific type of random forests to extract rules explaining the difference, which are then filtered and presented to the user as a set of classification rules sharing the same consequent, or as the equivalent rule with an antecedent in a disjunctive normal form. The quality of that solution is documented by comparison with the state of the art algorithms on 34 real-world datasets.

Introduction

Anomaly detection algorithms identify samples deviating from the rest of the data so much that they raise a suspicion they have been generated by a different process. The fact that anomalies are, by definition, rare and they can substantially differ from each other, poses different issues and challenges than standard supervised classification. Despite that fact, anomaly detection algorithms were successfully applied in many domains, for example: network security (Garcia-Teodoro, Diaz-Verdejo, Maciá-Fernández, & Vázquez, 2009), bioinformatics (Tibshirani & Hastie, 2007), astronomy (Dutta, Giannella, Borne, & Kargupta, 2007) or space exploration (Fujimaki, Yairi, & Machida, 2005). With the increasing volume of mostly unlabelled data presently generated, it receives more and more attention, as it helps to identify interesting samples without human intervention.

Because each application domain has its definition of anomaly and application constraints, many different algorithms have already been proposed, cf. the recent book of Aggarwal (2013) or the comparison by Pevný (2016). However, the identification of an anomaly is only half of the problem, the second, equally important, is the interpretation of found anomalies because humans have more trust in obtained results if they understand them. In high dimensional application domains like network security or bioinformatics, a proper interpretation is even more important. Every additional piece of knowledge about the anomaly provides invaluable help to users evaluating the suspicious samples to gain insights. Despite the extensive prior art in anomaly detection, very few works even mention the explanation of detected anomalies. This is surprising because, in practice, any detection is usually followed by a more in-depth investigation. Especially since the pioneering works in the non-parametric anomaly detection, Knorr, Ng, 1998, Knorr, Ng, 1999, explicitly consider anomaly explanation.

This paper contributes to filling this gap by proposing an algorithm called Explainer, which explains how a sample identified as an anomaly by an arbitrary detector differs from the rest of the data. The explanation is presented as a set of rules because rules are commonly considered easily understandable for humans (Huysmans, Dejaeger, Mues, Vanthienen, & Baesens, 2011). One of the main Explainer’s advantages is that the anomaly detection algorithm used to find anomalies is treated as a black-box, which means that the Explainer can be employed as an additional step for a vast majority of the state of the art algorithms. Also, it can provide an explanation using different features than the anomaly detector. Because the Explainer is based on an ensemble of specifically trained decision trees (random forest), which can be trained fast, it is lightweight and can be used effectively with minor memory requirements on large databases and data-streams.

The rest of this paper is organised as follows. Section 2 reviews related work. The Explainer is described in Section 3. Section 4 presents the results of a comparison to the state of the art. The same section also shows the robustness of the Explainer with respect to the setting of its parameters. Section 5 summarises the advantages and limitations of the proposed approach.

Section snippets

Anomaly detectors

Most unsupervised anomaly detection methods can be classified as either model-centric or data-centric.

Model-centric methods learn a model from the available data, and then during detection, they measure how a given sample deviates from that model. Their training phase is usually computationally expensive while the detection is rather cheap. The most basic models represent data by parameters of some statistical distribution. To this class also belongs the method proposed in the first work that

The Explainer algorithm

The goal of the Explainer is to explain, why a point xaX is an anomaly with respect to the rest of points in X. Explainer does not impose constraints on the anomaly detection algorithm that has identified xa, but assumes that the anomalous point xa deviates from the rest in a subset of features (corresponding to dimensions of X). From the point of view of the interaction between features and hypothesis space, the main focus are anomalies that can be explained as an aggregation of individual

Experiments

This section presents an experimental comparison of the Explainer to the state of the art algorithms by Dang et al. (2014, 2013) (Dang13,Dang14), Micenková et al. (2013) (Micenkova) and Lundberg and Lee (2017) (SHAP) on 34 classification datasets from the UCI machine learning repository (Bache & Lichman, 2013), with the number of features ranging from 4 to 5000 and number of samples from 22 up to 12k. Larger datasets were not included because of the computational complexity of the LOF and k-nn

Conclusion

This paper has presented a novel approach to explaining the deviation of a sample, identified by an arbitrary anomaly detection algorithm, from the rest of the data. The proposed approach relies on a novel kind of random forests suitable to extract rules explaining the anomaly. Those rules are then assembled into a set of classification rules sharing the same consequent, or into the equivalent rule with an antecedent in disjunctive normal form and presented to the user.

The proposed approach has

CRediT authorship contribution statement

Martin Kopp: Formal analysis, Investigation, Software, Validation, Visualization, Writing - original draft. Tomáš Pevný: Data curation, Methodology, Supervision. Martin Holeňa: Funding acquisition, Supervision, Writing - review & editing.

References (48)

  • A. Criminisi et al.

    Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning

    Foundations and Trends® in Computer Graphics and Vision

    (2012)
  • X.H. Dang et al.

    Discriminative features for identifying and interpreting outliers

    ICDE 2014, 30th IEEE international conference on data engineering

    (2014)
  • X.-H. Dang et al.

    Local outlier detection with interpretation

    European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD 2013)

    (2013)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    The Journal of Machine Learning Research

    (2006)
  • H. Dutta et al.

    Distributed top-k outlier detection from astronomy catalogs using the demac system.

    SDM

    (2007)
  • A.F. Emmott et al.

    Systematic construction of anomaly detection benchmarks from real data

    Proceedings of the ACM SIGKDD workshop on outlier detection and description

    (2013)
  • E. Eskin et al.

    A geometric framework for unsupervised anomaly detection

    Applications of data mining in computer security

    (2002)
  • R. Fujimaki et al.

    An approach to spacecraft anomaly detection problem using kernel feature space

    Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining

    (2005)
  • F.E. Grubbs

    Sample criteria for testing outlying observations

    The Annals of Mathematical Statistics

    (1950)
  • W. Guo et al.

    Lemna: Explaining deep learning based security applications

    Proceedings of the 2018 ACM SIGSAC conference on computer and communications security

    (2018)
  • D.J. Hand

    Measuring classifier performance: a coherent alternative to the area under the ROC curve

    Machine Learning

    (2009)
  • S. Holm

    A simple sequentially rejective multiple test procedure

    Scandinavian Journal of Statistics

    (1979)
  • F. Keller et al.

    HiCS: High contrast subspaces for density-based outlier ranking

    ICDE 2012, 28th IEEE international conference on data engineering

    (2012)
  • E. Knorr et al.

    Algorithms for mining distance-based outliers in large datasets

    Proceedings of the international conference on very large data bases

    (1998)
  • Cited by (0)

    The research reported in this paper has been supported by the Czech Science Foundation (GAČR) grant 17-01251, 18-21409S and student grant SGS18/210/OHK3/3T/18.

    View full text