1 Introduction

The ultimate goal of causal inference is understanding relationships between random variables and predicting the outcomes resulting from their modification or manipulation (Peters et al. 2017). Causal inference finds utility across a wide array of scientific domains. For instance, in the field of medicine, it aids in comprehending the propagation of epileptic seizures across distinct brain regions (Imbens and Rubin 2015). Or in climate science, it facilitates the prediction of variables such as temperature and rainfall (Naveau et al. 2020) while unraveling the causes behind sudden changes in river discharges (Mhalla et al. 2020). Extensive efforts have been invested in establishing its mathematical foundation (Pearl 2009). Structural causal models (SCMs) serve as mathematical tools for representing causal relationships between variables, especially in scenarios where temporal order is unavailable. Recent advancements include estimating causal mechanisms within structural causal models (Peters et al. 2014; Mooij et al. 2016; Glymour et al. 2019). However, causal discovery in time series often necessitates alternative models (Peters et al. 2017, Chapter 10) and faces different problems (Runge et al. 2019a).

A commonly used concept for describing a causality in time series is Granger causality (Granger 19691980). Though various definitions of causality, such as Sims causality (Sims 1972), structural causality (White and Lu 2010), and interventional causality (Eichler and Didelez 2010), coexist, Granger causality holds a prominent position in the literature (Berzuini et al. 2012, Chapter 22.2). It rests upon two key principles: precedence of cause over effect and the unique information contained in the cause that is otherwise unattainable. In this paper, we primarily consider strong Granger causality, which employs conditional independence (see Section 1.3). Variations include linear Granger causality (Hosoya 1977) and Granger causality in mean (Ho 2015). In the autoregressive models considered in this paper, all previously mentioned notions are closely related, and their differences are typically not of practical relevance (Berzuini et al. 2012, Chapter 22.8).

Several approaches exist for learning a causal structure in time series. State-of-the-art methods for causal discovery in time series use consecutive conditional independence testing (Zhang and Zhang 2008) or fitting the vector autoregressive (VAR) models (Eichler 2012). Runge et al. (2019b) introduce a PCMCI method that combines a state-of-the-art PC method (named after the inventors Peter Spirtes and Clark Glymour; Spirtes et al. (1993)) and MCI (momentary conditional independence; Pompe and Runge (2011)) to identify and adjust for many potential confounders. However, there is still a need for more robust methods against a variety of hidden confounders. Gerhardus and Runge (2021) introduce the LPCMCI algorithm that is adapted to handle latent confounders. Another approach leverages Shannon’s information theory, utilizing entropy and mutual information to probe causal aspects of dynamic and complex systems (Hlaváčková-Schindler et al. 2007).

Typically, causal inference methods in time series describe the causality in the body of the distribution (causality in the mean). Several articles deal with the second-order causality (causality in variance; Gemici and Polat (2021)) using, e.g., GARCH (Generalized Autoregressive Conditional Heteroskedastic) modeling (Hafner and Herwartz 2008). However, causality in extremes is a new field of research. Different perspectives can be seen when looking mainly at the tails of the distributions (Coles 2001). Many causal mechanisms are present only during extreme events, and interventions often carry information that is likely to be causal (Cox and Wermuth 1996). While the connection between extreme value theory and time series has been thoroughly studied (Yang and Hongzhi 2005; Mikosch and Wintenberger 2015; Kulik and Soulier 2020), the area of causal inference at this juncture remains unexplored.

Recent work intertwines extreme value theory and causality in SCM. Engelke and Hitz (2020) propose graphical models in the context of extremes. Kiriliouk and Naveau (2020) study probabilities of necessary and sufficient causation as defined in the counterfactual theory using the multivariate generalized Pareto distributions. Deuber et al. (2021) develop a method for estimating extremal quantiles of treatment effects. Further approaches involve recursive max-linear models on directed acyclic graphs (Gissibl and Klüppelberg 2018; Klüppelberg and Krali 2021).

Our paper aims to establish a framework of causality in extremes of time series. This work builds on the work of Gnecco et al. (2021), who first introduced the concept of causal tail coefficient in the context of SCM, followed by Pasche et al. (2022), who extended this coefficient by incorporating possible covariates into the model. We aim to move the theory of causality in extremes from SCMs into a context of Granger-type causality in time series.

The paper is organized as follows. The subsequent section provides a concise overview of the causal tail coefficient’s existing developments, presents a motivating time series example, and introduces preliminary results and notation. Section 2 contains the main results and provides an illustrative model example. Section 3 extends the proposed method and discusses outcomes under assumptions violations. The section also addresses scenarios involving hidden confounders and a choice of the minimal time delay. Section 4 addresses the estimation problem, examining estimator properties and employing simulations on synthetic datasets. In Section 5, we apply the method to two real-world datasets. First, we consider an application concerning space weather and geomagnetic storms, corroborating prior findings that employ conditional mutual information. Lastly, we employ our methodology on a hydrometeorological dataset, investigating six distinct weather and climate phenomena. To maintain brevity, proofs are relocated to the Appendix. Appendix A introduces auxiliary propositions used in the proofs, while the proofs themselves can be found in Appendix B.

1.1 SCM and existing work on the causal tail coefficient

Gnecco et al. (2021) established the groundwork for the causal tail coefficient within linear SCMs (Pearl 2009). A linear SCM over real random variables \(X_1,\dots ,X_p\) is a collection of p assignments

$$\begin{aligned} X_j = \sum _{k\in pa(j)}\beta _{jk}X_k +\varepsilon _j, \quad j\in V, \end{aligned}$$
(1)

where \(pa(j)\subseteq V=\{1, \dots , p\}\) denotes parents of j, \(\varepsilon _1, \dots , \varepsilon _p\) are jointly independent random variables and \(\beta _{j,k}\in \mathbb {R}\setminus \{0\}\) are the causal weights. We suppose that the associated graph \(\mathcal {G}=(V,E)\), with the directed edge (ij) present if and only if \(i\in pa(j)\), is a directed acyclic graph (DAG) with nodes V and edges E. We say that \(X_i\) causes \(X_j\) if a directed path exists from i to j in \(\mathcal {G}\). Structural causal models describe not only observational distributions but also distributions under interventions (or manipulations) on the variables. By intervening on \(X_j\), we understand a new SCM with p assignments, all of them identical with (1) except the assignment for \(X_j\) (Pearl 2009).

Let \(X_i, X_j\) be a pair of random variables from linear SCM. We assume that \(X_i, X_j\) are heavy-tailed with respective distributions \(F_i, F_j\). The causal (upper) tail coefficient of \(X_i\) on \(X_j\) is defined in Gnecco et al. (2021) as

$$\Gamma _{i,j}:= \lim _{u\rightarrow 1^-}\mathbb {E}[ F_j(X_j)\mid F_i(X_i)>u ],$$

if the limit exists. This coefficient lies between zero and one and captures the causal influence of \(X_i\) on \(X_j\) in the upper tail since, intuitively, if \(X_i\) has a monotonically increasing causal influence on \(X_j\), we expect \(\Gamma _{i,j}\) to be close to unity. The coefficient is asymmetric, as extremes of \(X_j\) need not lead to extremes of \(X_i\), and in that case, \(\Gamma _{j,i}\) will be appreciably smaller than \(\Gamma _{i,j}\). In Section 1.2, we discuss the intuition of the choice of this coefficient in more detail (in the context of time series).

Under certain assumptions on the tails of \(\varepsilon _i\) and the linear SCM, the values of \(\Gamma _{i,j}\) and \(\Gamma _{j,i}\) allow us to discover the causal relationship between \(X_i\) and \(X_j\). These relations are summarized in Table 1.

Table 1 Causal relationship between a pair of random variables \(X_i, X_j\) following appropriately restricted linear SCM under different values of the causal tail coefficients. Blank entries and cases when \(\Gamma _{i,j} < 0.5\) or \(\Gamma _{j,i}<0.5\) cannot occur (Gnecco et al. 2021, Theorem 1)

Gnecco et al. (2021) proposed a consistent non-parametric estimator of \(\Gamma _{i,j}\). Without loss of generality, consider \(i = 1\) and \(j = 2\). If \((x_{1,1},x_{1,2}), \dots ,(x_{n,1},x_{n,2})\) are \(n\in \mathbb {N}\) independent replicates of \((X_1, X_2)\), the estimator of \(\Gamma _{1,2}\) is defined as

$$\begin{aligned} \hat{\Gamma }_{1,2}:=\frac{1}{k}\sum _{i=1}^n\hat{F}_2(x_{i,2})\mathbb {1}(x_{i,1}>x_{(n-k+1),1}), \end{aligned}$$
(2)

where \(x_{(n-k+1),1}\) is the k-th largest value of \(x_{1,1}, \dots , x_{n,1}\), \(\hat{F}_2(t)=\frac{1}{n}\sum _{i=1}^n \mathbb {1}(x_{i,2}\le t)\), and \(\mathbb {1}(\cdot )\) is the indicator function.

Pasche et al. (2022) adapted the estimation process to account for potential confounding factors. The authors modified (2) by replacing \(\hat{F}_2(x)\) by a covariate-dependent estimator, where the upper tail of \(\hat{F}_2(x)\) is modeled by a Pareto approximation (Coles 2001). Moreover, Pasche et al. (2022) implemented a permutation test to formally test the hypothesis that the causal tail coefficient is equal to 1.

1.2 Motivating example and the main idea in the context of time series

The following example illustrates a typical case considered in this paper. Let \((\textbf{X},\textbf{Y})^\top = ((X_t,Y_t)^\top , t\in \mathbb {Z})\) be a bivariate strictly stationaryFootnote 1 time series described by the following recurrent relations

$$\begin{aligned} X_t&=\frac{1}{2}X_{t-1}+\varepsilon _t^X,\\ Y_t&=\frac{1}{2}Y_{t-1}+ \sqrt{X_{t-5}} + \varepsilon _t^Y, \end{aligned}$$

where \(\varepsilon _t^X, \varepsilon _t^Y\overset{iid}{\sim }\) Pareto(1, 1).Footnote 2

Fig. 1
figure 1

The figure represents a sample realisation of \((\textbf{X},\textbf{Y})^\top\) from Section 1.2 (\(\textbf{X}\) is the cause and \(\textbf{Y}\) is the effect). The delay represents the time delay between the time series, in this case, equal to 5

This scenario is depicted in Fig. 1. Here, \(\textbf{X}\) causes \(\textbf{Y}\) (in Granger sense, see Definition 3 given later), simply because the knowledge of \(\textbf{X}\) improves the prediction of \(\textbf{Y}\). However, the converse is not true.

Consider the data shown in Fig. 1. Our goal is to identify any causal relationship between these time series. There is (at least in this realization) an evident asymmetry between the two time series in the extremes. If the cause (\(\textbf{X}\)) is extremely large, then the effect (\(\textbf{Y}\)) will be also extremely large (see the second and third “jump”). However, if \(\textbf{Y}\) is extremely large, then \(\textbf{X}\) will not necessarily be extremely large (as evident in the first “jump”). This indicates that an extreme value of \(\textbf{X}\) tends to trigger an extreme value of \(\textbf{Y}\), implying a causal link in an intuitive sense. Yet, a crucial factor is the presence of a time delay (or time lag). The extremes don’t have to be concurrent – there’s a time lag during which the influence of \(\textbf{X}\) on \(\textbf{Y}\) becomes apparent. In this specific example, this lag is exactly 5 time units.

To encapsulate this idea mathematically, we introduce the causal tail coefficient for time series

$$\begin{aligned} \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p):=\lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[\max \{F_Y(Y_0), \dots , F_Y(Y_{p})\}\mid F_X(X_0)>u], \end{aligned}$$
(3)

where \(F_X, F_Y\) are the marginal distribution functions of \(\textbf{X}, \textbf{Y}\), respectively. This coefficient mathematically expresses natural questions: Does an extreme value in \(\textbf{X}\) invariably lead to an extreme value in \(\textbf{Y}\)? How large \(\textbf{Y}\) will be in the next p steps if \(\textbf{X}\) is extremely large (in their respective scales)? In our example, we can consider \(p=5\). If \(X_0\) is extremely large, then \(Y_5\) will surely also be extremely large (large cause implies large effect), but not the other way around (large effect does not imply large cause). Hence intuitively, the following should hold: \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)=1\), but \(\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)<1\). The main part of the paper consists of determining the assumptions under which this is true.

1.3 Preliminaries and notation

We use bold capital letters to represent notation for time series and random vectors. A time series, or a random process, consists of a set of random variables \(\textbf{Z} = (Z_t, t \in \mathbb {Z})\) defined on the same probability space. We exclusively work with time series defined over integers.

We will use the standard notion of regular variation (Resnick 1987; Embrechts et al. 1997). A real random variable X is regularly varying with tail index \(\theta >0\), if its distribution function has a form \(F_X(x)=1-x^{-\theta } L(x)\) for some slowly varying function L, i.e., a function satisfying \(\lim _{x\rightarrow \infty }\frac{L(\alpha x)}{L(x)} = 1\) for every \(\alpha > 0\) (Kulik and Soulier 2020, Section 1.3). This property is denoted by \(X\sim \textrm{RV}(\theta )\). Examples of regularly varying distributions include Pareto, Cauchy or Fréchet distributions, to name a few (Mikosch 1999). For real functions fg, we denote \(f(x)\sim g(x) \iff \lim _{x\rightarrow \infty }\frac{f(x)}{g(x)}=1\).

The main principle that we aim to use is the so-called max-sum equivalence, that is, for two random variables XY we have \(\mathbb {P}(X+Y>x)\sim \mathbb {P}(X>x)+\mathbb {P}(Y>x)\sim \mathbb {P}(\max (X,Y)>x)\) as \(x\rightarrow \infty\). This is satisfied when \(X,Y\overset{iid}{\sim }\ \textrm{RV}(\theta )\) (Embrechts et al. 1997, Section 1.3.1). Similar results hold even if we deal with finite or infinite sums of random variables (Resnick 1987, Section 4.5).

Consider a bivariate process \((\textbf{X},\textbf{Y})^\top =( (X_t, Y_t)^\top , t\in \mathbb {Z})\). The concept of strong Granger causality (Berzuini et al. 2012, Chapter 22.2) can be expressed as follows: The process \(\textbf{X}\) is said to cause \(\textbf{Y}\) (denoted as \(\textbf{X}\rightarrow \textbf{Y}\)) if \(Y_{t+1}\) is not independent with the past of \(\textbf{X}\) given all relevant variables in the universe up to time t except the past values of \(\textbf{X}\); that is,

$$Y_{t+1} \not \!\perp \!\!\!\perp \textbf{X}_{past(t)}\mid \mathcal {C}_{t}\setminus \textbf{X}_{past(t)},$$

where \(past(t) = (t, t-1, t-2, \dots )\) and \(\mathcal {C}_t\) represents all relevant variables in the universe up to time t. The philosophical notion of \(\mathcal {C}_{t}\) is typically replaced by only a finite set of relevant variables. To illustrate with an example, given a three-dimensional process \((\textbf{X},\textbf{Y},\textbf{Z})^\top =( (X_t, Y_t, Z_t)^\top , t\in \mathbb {Z})\), we replace the information set \(\mathcal {C}_t\) by \((\textbf{X}_{past(t)}, \textbf{Y}_{past(t)}, \textbf{Z}_{past(t)})^\top\) and say that the process \(\textbf{X}\) causes \(\textbf{Y}\) with respect to \((\textbf{X},\textbf{Y},\textbf{Z})^\top\) if \(Y_{t+1} \not \!\perp \!\!\!\perp \textbf{X}_{past(t)}\mid (\textbf{Y}_{past(t)}, \textbf{Z}_{past(t)})\). We have to note that such \(\textbf{X}\) has to be seen only as a potential cause, since enlarging the information set can lead to a change in the causal structure.

In certain models, the definition of causality can be simplified (Palachy 2019). For instance, Sims (1972) demonstrated that the Granger causality definition is equivalent to certain parameter restrictions within a linear framework. We provide the formal definition in Section 2.1 for a specific class of autoregressive models.

2 The causal tail coefficient for time series

The central concept introduced in this paper is the causal tail coefficient for time series \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) with extremal delay \(p\in \mathbb {N}\) defined in (3). In scenarios where we exclude instantaneous effects (such as when \(X_0\) directly causes \(Y_0\)), we can directly employ the following coefficient:

$$\begin{aligned} \begin{aligned} \Gamma ^{time, -0}_{\textbf{X}\rightarrow \textbf{Y}}(p):=\lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[\max \{F_Y(Y_1), \dots , F_Y(Y_{p})\}\mid F_X(X_0)>u], \end{aligned} \end{aligned}$$

where, as usual, \(F_X, F_Y\) are the marginal distribution functions of \(\textbf{X}, \textbf{Y}\), respectively.

Notice that \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\in [0,1]\), and \(\Gamma ^{time,-0}_{\textbf{X}\rightarrow \textbf{Y}}(p)\le \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\le \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p+1)\). Furthermore, for any increasing functions \(h_1\) and \(h_2:\mathbb {R}\rightarrow \mathbb {R}\), we observe \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p) = \Gamma ^{time}_{h_1(\textbf{X})\rightarrow h_2(\textbf{Y})}(p)\), where \(h_1(\textbf{X})=(h_1(X_t), t\in \mathbb {Z})\), as \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) depends solely on rescaled margins \(F_X(X_i)\) and \(F_Y(Y_i)\).

2.1 Models

In this paper, we work with two models of time series— \(\textrm{VAR}(q)\) process (vector autoregressive process of order q, Section 2.3.1 in Lütkepohl (2005)) and \(\textrm{NAAR}(q)\) model (nonlinear additive autoregressive model of order q). We now introduce the notation.

Definition 1

We say that \((\textbf{X},\textbf{Y})^\top =((X_t,Y_t)^\top , t\in \mathbb {Z})\) follows the bivariate \(\textrm{VAR}(q)\) model, if it has the following representation:

$$\begin{aligned} X_t&= \alpha _{1}X_{t-1}+\dots + \alpha _{q}X_{t-q}+ \gamma _{1}Y_{t-1}+\dots +\gamma _{q}Y_{t-q} + \varepsilon _t^X,\\ Y_t&=\beta _{1}Y_{t-1}+\dots + \beta _{q}Y_{t-q} + \delta _{1}X_{t-1}+\dots +\delta _{q}X_{t-q} + \varepsilon _t^Y, \end{aligned}$$

where \(\alpha _i, \beta _i, \gamma _i, \delta _i\in \mathbb {R}\), \(i=1,\dots ,q\), are real constants, and \((\varepsilon _t^X, t\in \mathbb {Z})\), \((\varepsilon _t^Y, t\in \mathbb {Z})\) are white noises. We say that it satisfies the stability condition, if \({{\,\textrm{det}\,}}(I_{d}-A_1z-\dots -A_qz^q)\ne 0\) for all \(|z|\le 1\), where \(I_d\) denotes the d-dimensional identity matrix and \(A_i = \begin{pmatrix} \alpha _i &{} \gamma _i \\ \beta _i &{} \delta _i \end{pmatrix}\).

We say that \(\textbf{X}\) (Granger) causes \(\textbf{Y}\) (notation \(\textbf{X}\rightarrow \textbf{Y}\)) if there exists \(i\in \{1, \dots , q\}\) such that \(\delta _i\ne 0\).

Let \(i,j\in \mathbb {Z}: 0\le j-i\le q\). We can specify that \(X_i\) causes \(Y_j\) if \(\delta _{j-i}\ne 0\). Note that \(\textbf{X}\) causes \(\textbf{Y}\) if and only if there exist \(i\le j\) such that \(X_i\) causes \(Y_j\). In this paper, our focus is on exploring whether \(\textbf{X}\) causes \(\textbf{Y}\). It’s also important to highlight that a situation where \(\textbf{X}\) causes \(\textbf{Y}\) and vice versa simultaneously is admissible.

Under the stability assumption, we can rewrite these time series using causal representations (Lütkepohl 2005, Section 2.1.3):

$$\begin{aligned} X_t=\sum _{i=0}^\infty a_i\varepsilon ^X_{t-i}+\sum _{i=0}^\infty c_i\varepsilon ^Y_{t-i};\qquad Y_t=\sum _{i=0}^\infty b_i\varepsilon ^Y_{t-i}+\sum _{i=0}^\infty d_i\varepsilon ^X_{t-i}, \end{aligned}$$
(4)

with suitable constants \(a_i, b_i, c_i, d_i\in \mathbb {R}\). Thus, \(\textbf{X}\) causes \(\textbf{Y}\) if and only if there exists \(i: d_i\ne 0\) (Kuersteiner 2010).

Now, we state our model assumptions.

Definition 2

(Heavy-tailed VAR model) Let \((\textbf{X},\textbf{Y})^\top\) follow the stable \(\textrm{VAR}(q)\) model as defined above with its causal representation given by (4). We introduce the assumptions:

  • \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{RV}(\theta )\) for some \(\theta >0\),

  • \(\alpha _i, \beta _i, \gamma _i, \delta _i\ge 0\),

  • \(\exists \delta >0\) such that \(\sum _{i=0}^\infty a_i^{\theta -\delta }<\infty , \sum _{i=0}^\infty b_i^{\theta -\delta }<\infty , \sum _{i=0}^\infty c_i^{\theta -\delta }<\infty\),\(\sum _{i=0}^\infty d_i^{\theta -\delta }<\infty\).

Under these assumptions, we refer to \((\textbf{X},\textbf{Y})^\top\) as following the heavy-tailed \(\textrm{VAR}(q, \theta )\) model (\(\textrm{hVAR}(q,\theta )\) for short).

The first assumption is crucial, as it ensures the regular variation of our time series. The second assumption can be relaxed and can be replaced by the extremal causal condition discussed in Section 3.1. The third assumption is aimed at ensuring the a.s. summability of the sums \(\sum _{i=0}^\infty a_i\varepsilon ^X_{t-i}\). Moreover, this assumption guarantees the stationarity of \(\textbf{X}\) and \(\textbf{Y}\). It also establishes a crucial max-sum equivalence relationship, namely \(\mathbb {P}(\sum _{i=0}^\infty \alpha _i\varepsilon _i^\cdot>u)\sim [\sum _{i=0}^\infty \alpha _i^\theta ] \mathbb {P}(\varepsilon _1^\cdot >u)\) (Mikosch and Samorodnitsky 2000, Lemma A.3).

We consider a nonlinear generalization in order to minimize the assumptions on the body of the time series. First, we introduce the nonlinear counterpart of Definition 1.

Definition 3

We define \((\textbf{X},\textbf{Y})^\top =((X_t,Y_t)^\top , t\in \mathbb {Z})\) to follow the bivariate \(\textrm{NAAR}(q)\) model, specified by the equations:

$$\begin{aligned} X_t=f_{1}(X_{t-1}) + f_{2}(Y_{t-q}) + \varepsilon _t^X; \qquad Y_t=g_{1}(Y_{t-1}) + g_{2}(X_{t-q}) + \varepsilon _t^Y. \end{aligned}$$

We state that \(\textbf{X}\) (Granger) causes \(\textbf{Y}\) if \(g_2\) is a non-constant function on the support of \(X_{t-q}\) (a.s.).

It is interesting to note that in the univariate case \(d=1\), an often used condition \(\lim _{|x|\rightarrow \infty }\frac{|f_1(x)|}{|x|}<1\) is “almost” necessary for stationarity ((Bhattacharya and Lee 1995, Corollary 2.2) or (Andel 1989, Theorem 2.2)).

Definition 4

(Heavy-tailed NAAR model) Let \((\textbf{X},\textbf{Y})^\top\) follow the stationary \(\textrm{NAAR}(q)\) model from Definition 3. We require functions \(f_1, f_2, g_1, g_2\) to either be zero functions or continuous non-negative functions satisfying \(\lim _{x\rightarrow \infty }h(x)=\infty\) and \(\lim _{x\rightarrow \infty }\frac{h(x)}{x}<1\) for \(h=f_1, f_2, g_1, g_2\). Moreover, let \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{RV}(\theta )\) be non-negative for some \(\theta >0\). Then, we say that \((\textbf{X},\textbf{Y})^\top\) follows the heavy-tailed \(\textrm{NAAR}(q,\theta )\) model (\(\textrm{hNAAR}(q,\theta )\) for short).

It’s important to emphasize that our restrictions on the functions \(f_1, f_2, g_1, g_2\) are minimal in the body – we’ve only imposed conditions on their tails. By imposing the constraint \(\lim _{x\rightarrow \infty }\frac{h(x)}{x}<1\) for \(h=f_2, g_2\), we ensure that the tail of \(f_2(Y_{t-q})\) (resp. \(g_2(X_{t-q})\)) is not larger than the tail of \(Y_{t-q}\) (resp. \(X_{t-q}\)). Assumption \(\lim _{x\rightarrow \infty }\frac{h(x)}{x}<1\) for \(h=f_1, g_1\) is closely related to the time series’ stationarity. In the \(\textrm{hNAAR}\) models, our assumption of regular variation for the noise variables directly implies that \(X_t, Y_t\sim \textrm{RV}(\theta )\) (Yang and Hongzhi 2005, Theorem 2.3). Moreover, our framework doesn’t assume the existence of any moments; it accommodates cases like \(\theta <1\), where the expectation \(\mathbb {E}(X_t)\) does not exist.

Obviously, \(\textrm{hNAAR}(q,\theta )\) models encompass a broader class of time series models compared to \(\textrm{hVAR}(q,\theta )\) models. However, they are not nested. In the \(\textrm{hNAAR}(q,\theta )\) case, we assumed that the \(X_t\) and \(Y_t\) are functions of only two previous values (\(X_t\) being a function of \(X_{t-1},Y_{t-q}\) and \(Y_t\) being a function of \(Y_{t-1}, X_{t-q}\)). In the VAR case, \(X_t, Y_t\) can depend on more than two previous values. Moreover, the NAAR case has an additional assumption \(\varepsilon _t^X, \varepsilon _t^Y\ge 0\). Nevertheless, up to these two differences, the class of \(\textrm{hVAR}(q,\theta )\) models lies inside of the class of \(\textrm{hNAAR}(q,\theta )\) models.

2.2 Causal direction

The subsequent two theorems constitute the core of this paper, connecting the classical concept of causality with causality in extreme events.

Theorem 1

Let \((\textbf{X},\textbf{Y})^\top\) be a bivariate time series which follows either the \(\textrm{hVAR}(q,\theta )\) model or the \(\textrm{hNAAR}(q,\theta )\) model. If \(\textbf{X}\) causes \(\textbf{Y}\), then \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(q)=1\).

The proof can be found in Appendix B. Intriguingly, the regular variation condition is not used within the proof. We assume that we know the exact (correct) order q. Nevertheless, for every \(p\ge q\), we have \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\ge \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(q) =1\). The choice of an appropriate delay p will be discussed in Section 3.3.

Theorem 2

Let \((\textbf{X},\textbf{Y})^\top\) be a bivariate time series which follows either the \(\textrm{hVAR}(q, \theta )\) model or the \(\textrm{hNAAR}(q, \theta )\) model. If \(\textbf{Y}\) does not cause \(\textbf{X}\), then \(\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)<1\) for all \(p\in \mathbb {N}\).

The proof can be found in Appendix B. The primary step of the proof stems from Proposition 2, presented in Appendix A. The core idea is that large sums of independent, regularly varying random variables tend to be driven by only a single large value. Consequently, if \(Y_0\) is large, it could be attributed to the largeness of an \(\varepsilon _i^Y\), which does not affect \(\textbf{X}\).

Note that distinct notation is employed for the time series order (denoted as \(q\in \mathbb {N}\)) and the extremal delay (denoted as \(p\in \mathbb {N}\)). Although these two coefficients need not be equivalent, Theorem 1 prompts our primary focus on scenarios where \(p\ge q\).

Example 1

Consider the bivariate time series \((\textbf{X},\textbf{Y})^\top\) described by the equations:

$$\begin{aligned} X_t&=0.5X_{t-1}+ \varepsilon _t^X;\qquad Y_t=0.5Y_{t-1}+0.5X_{t-1}+ \varepsilon _t^Y, \end{aligned}$$

where \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{Pareto}(1,1)\), with a tail index \(\theta =1\). A causal representation of this \(\textrm{hVAR}(1,1)\) model is given by:

$$\begin{aligned} X_t=\sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^X_{t-i}; \qquad Y_t=\sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^Y_{t-i}+\sum _{i=0}^\infty \frac{i}{2^i}\varepsilon ^X_{t-i}. \end{aligned}$$

In this case, the order is \(q=1\), and it is sufficient to take only

$$\Gamma ^{time, -0}_{\textbf{X}\rightarrow \textbf{Y}}(1)=\lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[F_Y(Y_1)\mid F_X(X_0)>u],$$

as discussed in Section 3.3. Let us give some vague computation of this coefficient. From Theorem 1, we know that \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(1)=1\). For the other direction, rewrite

$$\begin{aligned} \begin{aligned} \lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)\mid F_Y(Y_0)>u]= \lim _{v\rightarrow \infty }{{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)\mid \sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^Y_{-i}+\sum _{i=0}^\infty \frac{i}{2^i}\varepsilon ^X_{-i}>v]. \end{aligned} \end{aligned}$$

First, note the following relations (the first follows from the independence and the second from Lemma 5 in Appendix A):

$$\begin{aligned}&\lim _{v\rightarrow \infty }{{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)\mid \sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^Y_{-i}>v]={{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)]=1/2,\\&\lim _{v\rightarrow \infty }{{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)\mid \sum _{i=0}^\infty \frac{i}{2^i}\varepsilon ^X_{-i}>v]=1. \end{aligned}$$

Furthermore, we know that \(\mathbb {P}(X_1<\lambda \mid \sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^Y_{-i}+\sum _{i=0}^\infty \frac{i}{2^i}\varepsilon ^X_{-i}>v)= \frac{\mathbb {P}(X_1<\lambda )}{2}\) for every constant \(\lambda \in \mathbb {R}\) (using Proposition 2Footnote 3). This implies that, with a probability of 1/2, \(X_1|{F_Y(Y_0)>u}\) has the same distribution as non-conditional \(X_1\), and with complementary probability, it tends to \(\infty\) as \(u\rightarrow 1^-\). Combining these results, we obtain:

$$\Gamma ^{time, -0}_{\textbf{Y}\rightarrow \textbf{X}}(1) = \lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[F_X(X_1)\mid F_Y(Y_0)>u]=\frac{1}{2}\cdot \frac{1}{2} + \frac{1}{2}\cdot 1=\frac{3}{4}.$$

The order q is usually unknown. If we take \(p=2\), then the value of

$$\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)=\lim _{u\rightarrow 1^-}{{\,\mathrm{\mathbb {E}\,}\,}}[\max \{F_X(X_0), F_X(X_1), F_X(X_2)\}\mid F_Y(Y_0)>u]$$

will be slightly larger than \(\frac{3}{4}\). More precisely, it will be equal to \(\frac{1}{2}\cdot {{\,\mathrm{\mathbb {E}\,}\,}}[\max \{F_X(X_0), F_X(X_1),F_X(X_2)\}] + \frac{1}{2}\cdot 1\). The true value is around 0.80, as determined through simulations and the use of computer software.

3 Properties and extensions

In this section, we exclusively focus on Heavy-tailed VAR models. While we believe that similar results also apply to Heavy-tailed NAAR models, the formulation and proof of such results are beyond the scope of this paper.

3.1 Real-valued coefficients

We discuss the extension to include potentially negative coefficients (as indicated by the second assumption in Definition 2) and non-directly proportional dependencies. Until now, we have assumed that a large positive \(\textbf{X}\) causes large positive \(\textbf{Y}\). In other words, we have assumed that all coefficients in the \(\textrm{hVAR}\) model are non-negative.

The most straightforward modification arises when a large positive \(\textbf{X}\) causes large negative \(\textbf{Y}\) (where the gain for one causes a loss for others). In such cases, it suffices to consider the maxima of the pair \((\textbf{X}, -\textbf{Y})\). While we apply this simple modification in our application, it cannot be universally applied.

The concept behind extending the causal tail coefficient for time series involves utilizing the absolute values \(|\textbf{X}|\) and \(|\textbf{Y}|\) instead of \(\textbf{X}\) and \(\textbf{Y}\). However, to implement this extension, we need to restrict the family of \(\textrm{VAR}(q)\) models in a specific manner. Theorems 1 and 2 do not hold if we allow arbitrary coefficients \(\alpha _i\), \(\beta _i\), \(\gamma _i\), \(\delta _i \in \mathbb {R}\), whether positive or negative, where \(i=1, \dots , q\). The following example illustrates a problematic scenario where Theorems 1 and 2 do not hold.

Example 2

Consider a vector \((\textbf{X}, \textbf{Y})^\top\) following a \(\textrm{VAR}(2)\) model defined as:

$$\begin{aligned} X_t&=0.5X_{t-1}+ \varepsilon _t^X; \qquad Y_t=X_{t-1} - 0.5 X_{t-2}+ \varepsilon _t^Y. \end{aligned}$$

Its causal representation can be written as

$$\begin{aligned} X_t&=\sum _{i=0}^\infty \frac{1}{2^i}\varepsilon ^X_{t-i};\qquad Y_t=\varepsilon _t^Y + \varepsilon ^X_{t-1}. \end{aligned}$$

Detecting extreme causal relationships can be challenging in this model. This is because, despite \(\textbf{X}\) being a cause of \(\textbf{Y}\), it cannot be assumed that the extreme in \(X_{t-1}\) will necessarily result in an extreme in \(Y_t\). Even if \(X_{t-2}\) is large, leading to a large \(X_{t-1}\), the same does not hold for \(Y_t\). To address this, we will impose a restriction on our model in the following manner.

Definition 5

(Extremal causal condition) Consider the pair \((\textbf{X},\textbf{Y})^\top\), which follows the stable \(\textrm{VAR}(q)\) model, with its causal structure taking the form:

$$\begin{aligned} X_t=\sum _{i=0}^\infty a_i\varepsilon ^X_{t-i}+\sum _{i=0}^\infty c_i\varepsilon ^Y_{t-i};\qquad Y_t=\sum _{i=0}^\infty b_i\varepsilon ^Y_{t-i}+\sum _{i=0}^\infty d_i\varepsilon ^X_{t-i}. \end{aligned}$$

Let \(\textbf{X}\) cause \(\textbf{Y}\). We say that \((\textbf{X},\textbf{Y})^\top\) satisfies an extremal causal condition, if there exists an integer \(r \le q\) such that, for every \(i\in \mathbb {N}\cup \{0\}\):

$$a_i\ne 0 \implies d_{i+r} \ne 0.$$

Lemma 1

The extremal causal condition holds in the \(\textrm{hVAR}(q, \theta )\) models (i.e., where the coefficients are non-negative) when \(\textbf{X}\) causes \(\textbf{Y}\).

The proof can be found in Appendix B. Clearly, the aforementioned condition holds in models where \(d_i \ne 0\) for all \(i > 0\).

The extremal causal condition embodies the following concept: Take \(k\ge r\). If \(\varepsilon _t^X\) is “extreme”, this extreme event will also influence \(Y_{t+k}\). This implication is in particular true if some of the coefficients are negative. If \(\varepsilon _t^X\) is “extremely positive”, this implies that \(Y_{t+k}\) will be “extremely negative”. Nevertheless, under the extremal causal condition, if \(\varepsilon _t^X\) is “extreme”, \(Y_{t+k}\) will also be “extreme”.

Formulating similar conditions for \(\textrm{NAAR}\) models is challenging, since we would require that an appropriate combination of functions is non-zero. Due to the extremal causal condition, our focus in this section remains exclusively on \(\textrm{VAR}\) models, excluding \(\textrm{NAAR}\) models. The following theorem demonstrates that, even when negative coefficients are introduced, the principles outlined in Section 2 remain applicable, thanks to the extremal causal condition.

Theorem 3

Let \((\textbf{X},\textbf{Y})^\top\) be a time series which follows the \(\textrm{hVAR}(q, \theta )\) model, with possibly negative coefficients, satisfying the extremal causal condition. Moreover, let \(\varepsilon _t^X, \varepsilon _t^Y\) have full support on \(\mathbb {R}\), and \(|\varepsilon _t^X|, |\varepsilon _t^Y|\sim \textrm{RV}(\theta )\). If \(\textbf{X}\) causes \(\textbf{Y}\), and \(\textbf{Y}\) does not cause \(\textbf{X}\), then \(\Gamma ^{time}_{|\textbf{X}|\rightarrow |\textbf{Y}|}(q)=1\), and \(\Gamma ^{time}_{|\textbf{Y}|\rightarrow |\textbf{X}|}(q)<1\).

The proof can be found in Appendix B.

3.2 Common cause and different tail behavior

Reichenbach’s common cause principle (Reichenbach 1956) states that for every given pair of random variables \((\textbf{X}, \textbf{Y})\), precisely one of the following statements holds true: \(\textbf{X}\) causes \(\textbf{Y}\), \(\textbf{Y}\) causes \(\textbf{X}\), they are independent or there exists \(\textbf{Z}\) causing both \(\textbf{X}\) and \(\textbf{Y}\). The challenge lies in distinguishing between true causality and dependence stemming from a common cause. The subsequent theorem demonstrates that our methodology allows us to distinguish between causality and a correlation due to a common cause.

Theorem 4

Let \((\textbf{X}, \, \textbf{Y}, \, \textbf{Z})^\top = ((X_t,Y_t,Z_t)^\top , t\in \mathbb {Z})\) follow the three-dimensional stable \(\textrm{VAR}(q)\) model, with iid regularly varying noise variables. Let \(\textbf{Z}\) be a common cause of both \(\textbf{X}\) and \(\textbf{Y}\), and neither \(\textbf{X}\) nor \(\textbf{Y}\) cause \(\textbf{Z}\). If \(\textbf{Y}\) does not cause \(\textbf{X}\), then \(\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)<1\) for all \(p\in \mathbb {N}\).

The proof can be found in Appendix B. Theorem 4 makes an initial assumption that the tail indexes of \(\textbf{X}, \textbf{Y},\) and \(\textbf{Z}\) are equal. However, it can be easily extended to the case when \(\textbf{Z}\) has lighter tails (see discussion in Section 4.5). In practical scenarios, complete observation of all pertinent data is often unattainable. Nevertheless, Theorem 4 remains valid even when we do not observe the common cause. However, the common cause still needs to fulfill the condition that noise is regularly varying with not heavier tails than those of \(\textbf{X}\) and \(\textbf{Y}\). We can not check this assumption in practice.

Example 3

Let \((\textbf{X}, \, \textbf{Y}, \, \textbf{Z})^\top\) follow the three-dimensional \(\textrm{VAR}(1)\) model, specified by

$$\begin{aligned} Z_t&=0.5Z_{t-1}+\varepsilon _t^Z,\\ X_t&=0.5X_{t-1}+0.5Z_{t-1} + \varepsilon _t^X,\\ Y_t&=0.5Y_{t-1}+0.5Z_{t-1} +0.5X_{t-1} + \varepsilon _t^Y, \end{aligned}$$

where \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{Pareto}(2,2)\) and \(\varepsilon _t^Z \overset{\text {iid}}{\sim }\textrm{Pareto}(1,1)\) (i.e., \(\varepsilon _t^Z\) has a heavier tail than \(\varepsilon _t^X, \varepsilon _t^Y\)). Then \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(1)=\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(1)=1\),Footnote 4 even though \(\textbf{Y}\) does not cause \(\textbf{X}\).

Even within the context of the unconfounded scenario (or when \(\varepsilon _t^Z\) possesses tails that are lighter than those of \(\varepsilon _t^X\) and \(\varepsilon _t^Y\)), a distinct challenge emerges when the tail behaviors of \(\varepsilon _X\) and \(\varepsilon _Y\) differ. Results from Section 2 remain applicable if \(\varepsilon _X\) has lighter tails than \(\varepsilon _Y\). However, if \(\varepsilon _X\) has heavier tails than \(\varepsilon _Y\), then \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(1)=\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(1)=1\) and we can not distinguish between the cause and the effect. A more detailed discussion can be found in Section 4.5.

3.3 Estimating the extremal delay p

Up to this point, we have assumed prior knowledge of the exact order q of our time series, with p being set equal to q. However, what if this order is unknown? In cases where p is too small, accurate causal relationships cannot be obtained (refer to Lemma 2). Conversely, as p becomes larger, \(\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) approaches 1, posing challenges for empirical inference.

One practical approach to address this challenge is to utilize the extremogram (Davis and Mikosch 2009). Similar to the role of correlograms in classical cases, extremograms help us select a “reasonable” value for p by examining plots. However, we can examine the values of \(\Gamma ^{time}_{\textbf{Y}\rightarrow \textbf{ X}}(p)\) for a range of values for p. An illustrative example can be found in Section 4.4.

Now, let’s delve into another consideration: the problem of estimating the temporal synchronization between two time series \((\textbf{X}, \textbf{Y})^\top\), where \(\textbf{X}\) is the cause of \(\textbf{Y}\). The objective is to determine the time required for information originating from \(\textbf{X}\) to impact \(\textbf{Y}\). In the context of an intervention applied to \(\textbf{X}\), the question is: when will this intervention influence \(\textbf{Y}\)? A practical instance of this scenario might arise in economics, such as when dealing with two time series representing milk and cheese prices over time. Consider a situation where the government imposes a large tax increase on milk prices at a specific moment. In such a case, when can we expect to observe a subsequent rise in cheese prices? Mathematically, we aim to estimate a parameter known as the minimal delay.

Definition 6

(Minimal delay) Let \((\textbf{X},\textbf{Y})^\top\) follow the stable \(\textrm{VAR}(q)\) model specified in Definition 2. We call \(s\in \mathbb {N}\) the minimal delay, if \(\gamma _1=\dots = \gamma _{s-1} = \delta _1=\dots = \delta _{s-1}=0\) and either \(\delta _s\ne 0\) or \(\gamma _s\ne 0\). If such s does not exist, we define the minimal delay as \(+\infty\).

The subsequent lemma establishes connections between the minimal delay and the causal tail coefficient for time series.

Lemma 2

Let \((\textbf{X},\textbf{Y})^\top\) follow the \(\textrm{hVAR}(q, \theta )\) model, where \(\textbf{X}\) causes \(\textbf{Y}\). Let s be the minimal delay. Then, \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(r)<1\) for all \(r<s\), and \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(r)=1\) for all \(r\ge s\).

The proof can be found in Appendix B. An approach to estimating the minimal delay s involves identifying the smallest value of s for which \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(s)=1\).

Estimating the minimal delay s through extreme values holds relevance across various applications. One such application involves the assessment of synchrony between two time series. Synchrony, as described in Cheong (2020), quantifies the interdependence of two time series, seeking to determine an optimal lag that aligns the time series most closely. Conventional methodologies encompass techniques like time-lagged cross-correlation or dynamic time warping. Utilizing the causal tail coefficient for time series can offer an avenue to proceed using extreme values, facilitating the conception of synchronization of extremes. This may pave the way for future research and methodological developments.

4 Estimations and simulations

All the methodologies presented in this section have been implemented in the R programming language (R Core Team 2022). The corresponding code can be accessed either in the supplementary package or through the online repository available at https://github.com/jurobodik/Causality-in-extremes-of-time-series.git.

4.1 Non-parametric estimator

We propose an estimator of the causal tail coefficient \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) based on a finite sample \((x_1, y_1)^\top , \dots , (x_n, y_n)^\top , n\in \mathbb {N}\). The estimator uses only the values of \(y_i\) where the corresponding \(x_i\) surpasses a given threshold.

Definition 7

We define

$$\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p):=\frac{1}{k}\sum _{\begin{array}{c} i = 1, \dots , n-p \\ x_i\ge \tau _k^X \end{array}}\max \{\hat{F}_Y(y_i), \dots , \hat{F}_Y(y_{i+p})\},$$

where \(\tau _k^X=x_{(n-k+1)}\) is the k-th largest value of \(x_1, \dots , x_n\), where \(k\le n\), and \(\hat{F}_Y(t)=\frac{1}{n}\sum _{j=1}^n \mathbb {1}(y_j\le t)\).

The value k signifies the number of extremes considered in the estimator. In the upcoming discussions, k will be contingent upon n, leading us to employ the notation \(k_n\) instead of k. A fundamental condition in extreme value theory is expressed as:

$$\begin{aligned} k_n\rightarrow \infty , \; \frac{k_n}{n}\rightarrow 0, \quad \text { as } n\rightarrow \infty . \end{aligned}$$
(5)

Note that \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) always lies between zero and one. The next theorem shows that such a statistic is “reasonable” by showing that it is asymptotically unbiased. This result doesn’t necessitate any stringent model assumptions; it only relies on the assumption of a certain rate of convergence for the empirical distribution function.

Theorem 5

Let \((\textbf{X},\textbf{Y})^\top =((X_t,Y_t)^\top , t\in \mathbb {Z})\) be a stationary bivariate time series, whose marginal distributions are absolutely continuous with support on some neighborhood of infinity. Let \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) exist. Let \(k_n\) satisfy (5) and

$$\begin{aligned} \frac{n}{k_n}\mathbb {P}\left( \frac{n}{k_n} \sup _{x\in \mathbb {R}}|\hat{F}_X(x)-F_X(x)|>\delta \right) \overset{n\rightarrow \infty }{\longrightarrow }0, \,\,\, \forall \delta >0. \end{aligned}$$
(6)

Then, \({{\,\mathrm{\mathbb {E}\,}\,}}\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\overset{n\rightarrow \infty }{\rightarrow }\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\).Footnote 5

Consequence 1

Let \(\textbf{X}\rightarrow \textbf{Y}\). Under the assumptions of Theorems 1 and 5, the proposed estimator is consistent; that is, \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\overset{P}{\rightarrow }\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) as \(n\rightarrow \infty\).

The proofs of Theorem 5 and Consequence 1 can be found in Appendix B. To establish asymptotic properties of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\), it is reasonable to assume the consistency of \(\hat{F}_X\). However, this assumption is accompanied by an additional requirement for the rate of its convergence, as specified in condition (6). While this condition is not overly restrictive, we believe that the condition (6) can be improved. For iid random variables, condition (6) holds if \(k_n^2/n \rightarrow \infty\) (follows from the Dvoretzky–Kiefer–Wolfowitz inequality). The following lemma establishes that condition (6) is also fulfilled within a certain subset of autoregressive models.

Lemma 3

Let \(\textbf{X} = (X_t, t\in \mathbb {Z})\) has the form

$$X_t = \sum _{k=0}^\infty a_k \varepsilon _{t-k},$$

where \((\varepsilon _t, t \in \mathbb {Z})\) are iid random variables with density \(f_\varepsilon\). Assume \(|a_k|\le \gamma k^{-\beta }\) for some \(\beta >1\), \(\gamma >0\) and all \(k \in \mathbb {N}\). Let \(f^\star = \max (1, |f_\varepsilon |_{\infty }, |f'_{\varepsilon }|_{\infty }) < \infty\), where \(|f|_{\infty } = \sup _{x\in \mathbb {R}}|f(x)|\) is the supremum norm. Assume \({{\,\mathrm{\mathbb {E}\,}\,}}|\varepsilon _0|^q<\infty\) for \(q>2\) and \(\mathbb {P}(|\varepsilon _0|>x) \le L(\log x)^{-r_0}x^{-q}\) for some constants \(L>0, r_0>1\) and for every \(x>1\). If the sequence \((k_n)\) satisfies (5) and

$$\begin{aligned} \exists c>\max \left\{ \frac{1}{2},\frac{2}{1+q\beta }\right\} : \frac{k_n}{n^c}\rightarrow \infty , \text { as }n\rightarrow \infty , \end{aligned}$$
(7)

then the condition (6) is satisfied.

The proof can be found in Appendix B. It is based on Proposition 13 in Chen and Wu (2018). Several concentration bounds exist that describe the convergence speed of the empirical distribution functions, and different assumptions can be made to guarantee the validity of the condition (6). According to Theorem 1.2 in Kontorovich and Weiss (2014), condition (6) holds for geometrically ergodic Markov chains. Regrettably, this result was originally presented solely for \(\mathbb {N}\)-valued random variables. The Dvoretzky–Kiefer–Wolfowitz inequality provides a similar outcome, but we did not come across a specific adaptation of this inequality for autoregressive time series.

Lemma 3 requires a finite second moment of \(\varepsilon _i\), a technical condition that follows from the findings in Chen and Wu (2018). In cases of heavy-tailed time series characterized by a tail index \(\theta\), this condition necessitates \(\theta >2\). Notably, the coefficient \(\beta\) in Lemma 3 encapsulates the strength of dependence in the time series. Larger values of \(\beta\) correspond to faster decay of dependence, thereby allowing for milder assumptions on the intermediate sequence \((k_n)\). Throughout the paper, we opt for the choice \(k_n=\sqrt{n}\), a decision briefly explored in Section 4.3.

4.2 Some insight using simulations

We will simulate the behavior of the estimates \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\) for a series of models. Initially, we employ the Monte Carlo principle to estimate the distributions of \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\) and \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}\) for the following model.

Model 1

Let \((\textbf{X},\textbf{Y})^\top\) follow the \(\textrm{VAR}(2)\) model

$$\begin{aligned} X_t&=0.5X_{t-1}+ \varepsilon _t^X; \qquad Y_t=0.5Y_{t-1}+ \delta X_{t-2}+\varepsilon _t^Y, \end{aligned}$$

where \(\varepsilon _t^X, \varepsilon _t^Y\) are independent noise variables and \(\delta \in \mathbb {R}\).

Figure 2 presents the histograms of \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(2)\) and \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(2)\) from 1000 simulated instances of time series, each with a length of \(n=5000\), following Model 1. The parameter \(\delta\) is set to 0.5, and \(\varepsilon _t^X\) and \(\varepsilon _t^Y\) are drawn from the Cauchy distribution. Figure 2 illustrates that within the causal direction, the values of \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\) predominantly fall within the range of (0.9, 1). Conversely, in the non-causal direction, the values of \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}\) typically lie below 0.9. This underscores the asymmetry between the cause and the effect.

Fig. 2
figure 2

The histograms provide an approximate visualization of the distributions of \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(2)\) (in blue) and \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(2)\) (in red) based on Model 1 with a correct causal direction \(\textbf{X}\rightarrow \textbf{Y}\) and a sample size \(n=5000\). The threshold \(k_n\) is chosen as \(k_n=70=\lfloor {\sqrt{5000}}\rfloor\)

We proceed by simulating \((\textbf{X},\textbf{Y})^\top\) according to Model 1. We explore three different values for the parameter \(\delta\) (0.1, 0.5, and 0.9) and three distinct sample sizes (\(n = 100\), 1000, and 10000). The random variables \(\varepsilon _t^X\) and \(\varepsilon _t^Y\) are generated from either the standard normal distribution (thus violating the RV assumption) or the standard Pareto distribution. For each value of \(\delta\), noise distribution, and sample size, we calculate the estimators \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}:= \hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\). This procedure is repeated 200 times, and we compute the means and quantiles of these estimators. The summarized outcomes are presented in Table 2, where each cell corresponds to a specific model defined by \(\delta\), noise distribution, and sample size.

Table 2 The results obtained from 200 simulated time series following Model 1 where \(\textbf{X}\) causes \(\textbf{Y}\). Each cell in the table corresponds to a distinct coefficient \(\delta\), a specific number of data-points n, and a particular noise distribution. The value \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}=\cdot \pm \cdot\) represents the mean of all 200 estimated coefficients \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\), along with the difference between the \(95\%\) empirical quantile and the mean, based on all 200 simulations

Conclusion

The method’s performance is unexpectedly robust even when the assumption of regular variation is violated. Although we lack theoretical justification for this observation, particularly with Gaussian noise where the true theoretical value is \(\Gamma _{\textbf{X}\rightarrow \textbf{Y}}^{time}(p) = \Gamma _{\textbf{Y}\rightarrow \textbf{X}}^{time}(p) = 1\), empirical results suggest that \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(p)\) converges faster than \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(p)\). This sub-asymptotic behavior reveals an inherent asymmetry. Further investigation is required to determine whether this observation holds as a general result.

Conversely, when the tails of the noise variables differ between the cause and the effect, our method does not perform well. If \(\varepsilon _t^X\) possesses heavier tails than \(\varepsilon _t^Y\), for large n, both estimates tend to converge close to 1. Alternatively, if \(\varepsilon _t^X\) has lighter tails than \(\varepsilon _t^Y\), both \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\) and \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}\) tend to be very small, significantly deviating from 1. Consequently, different tail behaviors of the noise variables appear to be the primary challenge. This issue is explored in greater detail in Section 4.5.

4.3 Choice of a threshold

A common challenge in extreme value theory is the selection of an appropriate threshold. In our case, this corresponds to the choice of the parameter k in Definition 7. There exists a trade-off between bias and variance; smaller values of k lead to reduced bias (but increased variance), and larger values of k result in higher bias (but lower variance). Unfortunately, there is no universally applicable method for threshold selection.

To provide insight into this behavior, consider Model 2 with \(n=1000\) and \(\delta = 0.5\). Figure 3 illustrates the estimators \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)\) using various values of k. The variance of \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)\) for small k is very large. Conversely, as k increases, the bias of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\) will increase.

Based on this example and several others, the choice \(k=\sqrt{n}\) appears to be a reasonable option. It is crucial to emphasize that while this choice may be pragmatic, it might not necessarily be optimal.

By visually evaluating Fig. 3, it is possible to infer the values of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)\). The observation seems to be that \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\) approaches one as k decreases, while the same trend does not hold for \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)\).

Fig. 3
figure 3

The figure illustrates the behavior of the estimators \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(2)\) (in blue) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(2)\) (in red) under varying selections of the parameter k. The time series are generated based on Model 1, utilizing \(n=1000\) and \(\delta =0.5\). The thick line denotes the mean value across 100 realizations, while the thin lines correspond to the \(5\%\) and \(95\%\) empirical pointwise quantiles

4.4 Choice of the extremal delay

To provide an example of how \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) behaves for various selections of p, we consider the following model.

Model 2

Let \((\textbf{X},\textbf{Y})^\top\) follow the \(\textrm{hVAR}(6,1)\) model

$$\begin{aligned} X_t&=0.5X_{t-1}+ \varepsilon _t^X; \qquad Y_t=0.5Y_{t-1}+ 0.5 X_{t-6}+\varepsilon _t^Y, \end{aligned}$$

where \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{Cauchy}\).

Notice that the minimal delay is equal to 6. Similar to the approach in Section 4.3, we simulate data from Model 2 with a size of \(n=1000\). Subsequently, we compute \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) for different values of p. This process is repeated 100 times. Figure 4 shows the mean, \(5\%\) and \(95\%\) empirical quantiles of these estimates.

Fig. 4
figure 4

The figure visualizes the behavior of the estimators \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) (in blue) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) (in red) for various selections of the extremal delay p. This is done using the \(\textrm{VAR}(6)\) model (Model 2) with \(n=1000\). The thick line represents the mean value across 100 realizations, while the thin lines correspond to the \(5\%\) and \(95\%\) empirical quantiles of these estimates

The coefficient \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) in the correct causal direction rises much faster than in the other direction until it reaches the “correct” minimal delay. Following this, the coefficient approaches 1, remaining in close proximity even for larger p, aligning with theoretical expectations. Conversely, \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) experiences a slower ascent, gradually converging towards 1.

Nevertheless, practical scenarios often demand the selection of a specific p. In cases involving a small number of time series, like the application in Section 5.1, it’s feasible to calculate \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) for several p choices and create visualizations, as exemplified in Fig. 8 further ahead. In instances involving a greater number of time series, such as the application in Section 5.2, a pragmatic decision for p should align with the maximum expected physical time delay within the system. The challenge of selecting a suitable time delay is not exclusive to the extreme value context and is encountered in non-extreme scenarios as well (Hacker and Hatemi-J 2008; Runge et al. 2019b).

4.5 Hidden confounder and different tail behavior of the noise

The aforementioned examples have exclusively addressed scenarios where \(\textbf{X}\) causes \(\textbf{Y}\). However, it is important to investigate how the methodology fares when the correlation stems from a shared confounder. Furthermore, how does it respond when the tail indexes of the noise variables differ? To address these inquiries, we intend to analyze the following time series.

Model 3

Let \((\textbf{X}, \, \textbf{Y}, \, \textbf{Z})^\top\) follow the \(\textrm{VAR}(3)\) model

$$\begin{aligned} Z_t&=0.5Z_{t-1}+\varepsilon _t^Z,\\ X_t&=0.5X_{t-1}+ 0.5Z_{t-2}+ \delta _YY_{t-3}+\varepsilon _t^X,\\ Y_t&=0.5Y_{t-1}+ 0.5Z_{t-1}+ \delta _XX_{t-3}+\varepsilon _t^Y, \end{aligned}$$

with independent noise variables \(\varepsilon _t^X\sim \textrm{t}_{\theta _{X}}, \varepsilon _t^Y\sim \textrm{t}_{\theta _{Y}}, \varepsilon _t^Z\sim \textrm{t}_{\theta _{Z}}\) (notation \(\textrm{t}_{\theta }\) corresponds to Student’s t-distribution with \(\theta\) degrees of freedom) and constants \(\delta _X, \delta _Y\in \mathbb {R}\).

The process \(\textbf{Z}\) represents an unobserved common cause, illustrating various scenarios: non-causal (\(\delta _X=\delta _Y=0\)), \(\textbf{X}\) causing \(\textbf{Y}\) (\(\delta _X\ne 0\)), \(\textbf{Y}\) causing \(\textbf{X}\) (\(\delta _Y\ne 0\)), and bidirectional causality (\(\delta _X,\delta _Y\ne 0\), notation \(\textbf{X}\leftrightarrow \textbf{Y}\)). Additionally, the tail indexes \(\theta _X\), \(\theta _Y\), and \(\theta _Z\) characterize the tail index of the time series. If \(\theta _X\) is large, then the distribution of \(\varepsilon _t^X\) is close to Gaussian. The case \(\theta _X < \theta _Y\) indicates that \(\varepsilon _t^X\) exhibits heavier tails than \(\varepsilon _t^Y\).

Simulating time series according to Model 3 with \(n=1000\), \(\delta _X, \delta _Y\in \{0,0.5, 1\}\), and various tail index combinations, we compute \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}:= \hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(3)\). This procedure is repeated 500 times to compute means and empirical quantiles of the estimators. The difference \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}} - \hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) is also calculated alongside \(95\%\) empirical quantiles. The outcomes are displayed in Table 3.

The results indicate that whenever either \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) or \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) is smaller than one, we can detect the correct causal direction. When \(\textbf{X}\) does not cause \(\textbf{Y}\) and vice versa, \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) are very similar and smaller than 1. When \(\textbf{X}\) causes \(\textbf{Y}\), \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) is generally much larger than \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\). When \(\textbf{X}\leftrightarrow \textbf{Y}\), both \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) closely approach 1. However, if \(\varepsilon _t^Z\) exhibits heavier tails than \(\varepsilon _t^X\) and \(\varepsilon _t^Y\), both \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) are in close proximity to 1, regardless of the true causal relationship between \(\textbf{X}\) and \(\textbf{Y}\). Consequently, if \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\) are close to 1, we can not distinguish whether this is due \(\textbf{X}\leftrightarrow \textbf{Y}\), or due to the presence of a hidden confounder with heavier tails.

An interesting scenario arises when \(\theta _X\ne \theta _Y\). Theory suggests that the method should work well as long as \(\theta _X\ge \theta _Y\) (heavier tails of \(\varepsilon _t^Y\)) and should no longer work if \(\theta _X<\theta _Y\) (a supporting argument for this claim can be found in Appendix B, Claim 1). However, it’s important to note that these results pertain to a population (limiting) context. For small sample sizes, the situation where \(\theta _X<\theta _Y\) appears to “work well”, as evident from the bold values in Tables 3 and 4. There is an evident asymmetry between the cause and the effect, but it vanishes for large n. Both cases (\(\theta _X<\theta _Y\) and \(\theta _X>\theta _Y\)) can create problems for the estimation part, and further investigation is needed to understand better the method’s behavior under different tail indices.

To summarize, the most challenging situation occurs when a heavier-tailed noise variable acts as a common cause. This situation necessitates careful consideration, particularly when both \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}},\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}\approx 1\). This ambiguity can be attributed to either true bidirectional causality or the influence of a hidden confounder.

Table 3 The estimation of \(\hat{\Gamma }^{time}(3)\) is performed on time series generated from Model 3. The tail indexes \(\theta _X\), \(\theta _Y\), and \(\theta _Z\) correspond to the tail index of the respective time series, where a large \(\theta _X\) signifies lighter tails and a small \(\theta _X\) denotes heavy tails in \(\varepsilon _X\). Each value of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}=\cdot \pm \cdot\) represents the mean of the estimated coefficients \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}\), along with the difference between this mean and the \(95\%\) empirical quantile out of all 500 simulations

4.6 Testing

One approach to establish a formal methodology for detecting the causal direction between two time series involves the utilization of a threshold. Specifically, we say that \(\textbf{X}\) causes \(\textbf{Y}\) if and only if \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\ge \tau\), where \(\tau =0.9\) or 0.95. Our goal is to evaluate the hypothesis \(H_0: \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p) < 1\) against the alternative \(H_A: \Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p) = 1\). An understanding of the distribution of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) is imperative. However, computing this distribution lies beyond the purview of our current study. We considered estimating confidence intervals using the block bootstrap technique (Section 5.4 in Hesterberg (2014) and the accompanying code in the supplementary package). However, we opted not to incorporate this methodology within the confines of this paper, as its performance in our specific scenario proved to be suboptimal. Future research can reveal a better-working methodology. Additionally, visual aids such as Figs. 3 and 4 can serve as valuable tools for enhancing understanding. Subsequently, we will delve into the scenario wherein we infer \(\textbf{X}\rightarrow \textbf{Y}\) if and only if the value of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) meets or exceeds the designated threshold \(\tau\).

In the following, we provide an insight into the process of selecting the appropriate threshold value \(\tau\). This choice should depend on the sample size n and the extremal delay p. Opting for a small \(\tau\) runs the risk of yielding erroneous conclusions, while selecting a large \(\tau\) can curtail the statistical power. Analogous methodologies are commonly employed, particularly within the realm of extreme value theory. For instance, in contexts such as distinguishing between extremal dependence and independence (Haan and Zhou 2011). To develop an intuitive understanding, we will focus on Model 1 with \(\delta =0.5\) and \(\varepsilon _t^X, \varepsilon _t^Y\overset{\text {iid}}{\sim }\textrm{Pareto}(1,1)\).

We will observe two objectives as a function of \(\tau\). The first is the percentage of correctly concluding that \(\textbf{X}\rightarrow \textbf{Y}\) and the other is the percentage of correctly concluding that \(\textbf{Y}\not \rightarrow \textbf{X}\) (that is, concluding that \(\textbf{Y}\) does not cause \(\textbf{X}\)).

Figures 5 and 6 illustrate the sensitivity of the results to the choice of the threshold \(\tau =\tau (n,p)\) across a range of (np) values. It is reasonable to expect that a larger value of p should correspond to a larger value of \(\tau\). To ensure that \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)< \tau\) occurs in less than \(5\%\) of cases, simulations suggest values of \(\tau (500, 6)=0.90\) and \(\tau (500, 11)=0.93\). For larger values of \(p>11\), \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\) and \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) exhibit substantial similarity.

It appears that as n increases, the accuracy of \(\hat{\Gamma }^{time}_{\textbf{Y}\rightarrow \textbf{X}}(p)\) improves. The \(5\%\) significance level seems consistent across all n values, with the main enhancement lying in the power to detect causality.

In conclusion, the appropriate selection of \(\tau\) exhibits minimal variation with changes in n and p. In all cases, opting for \(\tau \approx 0.9\) appears reasonable, unless p becomes exceedingly large.

Fig. 5
figure 5

The figures depict the performance of our methodology for various \(\tau (500,p)\) selections in Model 3. In the left figure, we illustrate the percentage of instances where we accurately infer the relation \(\textbf{X}\rightarrow \textbf{Y}\). The right figure represents the percentage of cases in which we correctly identify the absence of the relation \(\textbf{Y}\not \rightarrow \textbf{X}\). If \(\tau\) is large, we rarely find the correct relation \(\textbf{X}\rightarrow \textbf{Y}\), but we almost always correctly find \(\textbf{Y}\not \rightarrow \textbf{X}\). If \(\tau\) is small, we almost always find the correct relation \(\textbf{X}\rightarrow \textbf{Y}\), but we almost never correctly find \(\textbf{Y}\not \rightarrow \textbf{X}\). A red horizontal line marks the \(95\%\) success rate. Note that the scenario where \(p=1\) signifies an erroneous selection of the extremal delay-specifically, a situation where p is smaller than the minimum delay

Fig. 6
figure 6

Choice of the threshold as a function of n and fixed \(p=6\). The left figure represents the percentage of cases when we correctly inferred \(\textbf{X}\rightarrow \textbf{Y}\). The right figure represents the percentage of cases when we correctly inferred \(\textbf{Y}\not \rightarrow \textbf{X}\). A red horizontal line marks the \(95\%\) success rate

Table 4 A comparison of four methods for the causal inference on two time series models, a simple \(\textrm{VAR}(2)\) model and a more complex nonlinear model with a hidden common cause. The percentage obtained after 100 repetitions indicates how many times the estimation was correct

4.7 Comparison with state-of-the-art methods

We conduct a comparative analysis of our method with three classical techniques found in the literature. The initial approach is the classical Granger test utilizing default parameters and a significance level of \(\alpha =0.05\). The second and third approaches encompass the PCMCI and LPCMCI methods, both employing default parameters and featuring a robust partial correlation independence test. For comprehensive insights into their implementations, please refer to the supplementary materials. Our method corresponds to estimating \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(3)\) and concluding that \(\textbf{X}\) causes \(\textbf{Y}\) if and only if \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(3)\ge \tau\) with the choice \(\tau =0.9\).

We examine two distinct models. The first is the VAR Model 1, characterized by \(\delta =0.5\) and noise following a \(\textrm{Pareto}(1,1)\) distribution. Notably, this noise exhibits an infinite expected value, which can potentially pose challenges for classical methodologies. The second is more complex NAAR model with a hidden confounder, as defined in the subsequent description.

Model 4

Let \((\textbf{X}, \, \textbf{Y}, \, \textbf{Z})^\top\) follow the \(\textrm{hNAAR}(3,1)\) model

$$\begin{aligned} Z_t&=0.5Z_{t-1}+\varepsilon _t^Z,\\ X_t&=0.5X_{t-1}+ 0.5Z_{t-2} +\varepsilon _t^X,\\ Y_t&=0.5Y_{t-1}+ 0.5Z_{t-1}+ f_X(X_{t-3})+\varepsilon _t^Y, \end{aligned}$$

with independent noise variables \(\varepsilon _t^X, \varepsilon _t^Y, 2\varepsilon _t^Z\sim \textrm{Pareto}(1,1)\) and \(f_X(x) = x^{\frac{3}{4}}\mathbb {1}(x>50)\).

The function \(f_X\) in Model 4 represents a case when the causality presents itself only in the tail, not in the body of the distribution.

We conduct simulations on the aforementioned time series, considering two data set sizes: \(n=500\) and \(n=5000\). Subsequently, employing the four methodologies, we compute the count of simulations (out of 100) where the causal relationships \(\textbf{X}\rightarrow \textbf{Y}\) and \(\textbf{Y}\not \rightarrow \textbf{X}\) are accurately inferred. The outcome percentages are presented in Table 4.

The results suggest that state-of-the-art methods behave well within the context of the simple model. Although these methods inherently assume finite expected values, it appears that the \(\textrm{Pareto}(1,1)\) noise distribution does not introduce significant challenges. Our method demonstrates satisfactory performance but exhibits slightly reduced power when compared with other approaches.

On the other hand, state-of-the-art methods encounter difficulties when confronted with the more complex model. PCMCI and LPCMCI frequently miscalculate the directionality, erroneously estimating that \(\textbf{Y}\) causes \(\textbf{X}\). This misjudgment arises from the presence of a hidden common cause \(\textbf{Z}\), which creates a spurious effect that even LPCMCI can not handle well. In contrast, our method successfully discerns genuine causality from causal connections attributed to a shared cause. To conclude, the outcomes presented in Table 4 imply that our method is particularly well-suited for large sample sizes, where it outperforms state-of-the-art techniques.

5 Applications

We apply our methodology to two real datasets. The first one pertains to geomagnetic storms in the field of space weather science, while the second one is related to earth science and explores causal connections between hydro-meteorological phenomena. All data and a detailed R code are available in the supplementary package.

5.1 Space weather

In the subsequent sections, we address a problem within the realm of space weather studies. The term “space weather” refers to the variable conditions on the Sun and in space that can influence the performance of technology we use on Earth. Extreme space weather could potentially cause damage to critical infrastructures – especially the electric grid. In order to protect people and systems that might be at risk from space weather effects, we need to understand the causes of space weather.Footnote 6

Fig. 7
figure 7

Space weather. The first plot represents AE (substorm index), the second one BZ (vertical component of an interplanetary magnetic field) and the last one SYM (magnetic storm index). Data were measured in 5-minute intervals for the year 2000 by NASA. The unit of measurement is nanotesla (nT)

Geomagnetic storms and substorms serve as indicators of geomagnetic activity. A substorm manifests as a sudden intensification and heightened motion of auroral arcs, potentially leading to magnetic field disturbances in the auroral zones of up to 1000 nT (Tesla units). The origin of this geomagnetic activity lies within the Sun itself. More precisely, a substantial correlation exists between this geomagnetic activity, the solar wind (a flow of negatively charged particles emitted by the Sun), and the interplanetary magnetic field (which constitutes a portion of the solar magnetic field carried outward by the solar wind).

A fundamental challenge in this field revolves around the determination and prediction of specific characteristics: the magnetic storm index (SYM) and the substorm index (AE). It may seem that AE is a driving factor (cause) of SYM because the accumulation of successive substorms usually precedes the occurrence of magnetic storms. However, a recent article (Manshour et al. 2021) suggests otherwise. It proposes that a vertical component of the interplanetary magnetic field (BZ) serves as a shared trigger for both indices. Our approach involves applying our methodology to validate this finding and ascertain whether this causal influence becomes evident in extreme scenarios.

Our dataset comprises three distinct time series (SYM, AE, BZ), encompassing approximately \(100\,000\) measurements, each captured at 5-minute intervals over the entire year 2000. These data, in addition to being available in the supplementary package, can also be accessed online through the NASA website.Footnote 7 A visual representation of the data is presented in Fig. 7. From the nature of the data, we focus on analyzing extremes characterized by extremely small SYM, extremely large AE, and extremely small BZ (that is, we consider −SYM, \(+\)AE, −BZ and compare maxima). Given the inherent characteristics of the data, an appropriate delay will be smaller than \(p=24\) (2 hours).

First, we discuss whether the assumptions for our method are fulfilled. We estimate the tail indexes of our dataFootnote 8. Results are the following: SYM has the estimated tail index \(0.25 \,\,(0.015, 0.5)\), AE has \(0.18 \,\,(0.08, 0.28)\) and BZ has \(0.30 \,\,(0.12, 0.46)\). Therefore, the assumption of regular variation characterized by the same tail index appears reasonable. Furthermore, none of the confidence intervals encompass the zero value, although SYM comes quite close. This observation suggests that our time series can be deemed as regularly varying. We also require a stationarity assumption of our time series, which seems to hold. It is believed that our variables do not contain a seasonal pattern or a trend (at least not in a horizon of a few years). The Dickey–Fuller test also suggests stationarity, although it is recognized that tests for stationarity aren’t very dependable (Dickey 2005). In summation, all the assumptions seem to be reasonably satisfied.

Lastly, we proceed to calculate the causal tail coefficient across varying extreme delays p and a range of considered extremes k. The numerical outcomes are depicted in Fig. 8. These findings impart the following insights:

  • BZ and SYM have strong asymmetry (\(\hat{\Gamma }_{BZ\rightarrow SYM}^{time}(p)\approx 1\), and \(\hat{\Gamma }_{SYM\rightarrow BZ}^{time}(p)\ll 1\), see blue lines),

  • BZ and AE have an asymmetry (\(\hat{\Gamma }_{BZ\rightarrow AE}^{time}(p)\approx 1\), and \(\hat{\Gamma }_{SYM\rightarrow AE}^{time}(p)<1\), see green lines)

  • SYM and AE have no asymmetry (\(\hat{\Gamma }_{AE\rightarrow SYM}^{time}(p),\hat{\Gamma }_{SYM\rightarrow AE}^{time}(p)<1\) and both \(\hat{\Gamma }_{SYM\rightarrow AE}^{time}(p)\), \(\hat{\Gamma }_{AE\rightarrow SYM}^{time}(p)\) are very similar, see brown lines).

Fig. 8
figure 8

The first figure represents all values of \(\hat{\Gamma }^{time}_{\cdot \rightarrow \cdot }(p)\) across a range of extreme delays \(p\in [1, 24]\) with k set to \(k=\sqrt{n}\), for all possible pairs of: SYM (magnetic storm index), AE (substorm index) and BZ (interplanetary magnetic field). An evident asymmetry in the causal influence between the time series BZ-SYM and BZ-AE is visible from the plot. The second figure represents all values of \(\hat{\Gamma }^{time}_{\cdot \rightarrow \cdot }(1)\) for different number of extremes k. The red vertical line corresponds to \(k=\sqrt{n}\)

These results align with the hypothesis that BZ is a common cause of SYM and AE, with no causal relation between them. Note that our methodology has the capability to accommodate scenarios involving a common cause, at least in theory. While conventional methods might suggest a causal relationship from AE to SYM, our analysis reveals instances of extreme events where AE exhibits extremity, yet SYM does not.

5.2 Hydrometeorology

Numerous significant phenomena have an impact on weather and climate, even though their importance and effects are not yet well understood. Within this application, we focus on a selection of the most intriguing phenomena to explore potential causal relationships among them. By delving into these relationships, we aim to gain insights into the intricate interplay among these variables, contributing to a deeper understanding of their influence on weather and climate dynamics.

  • El Niño Southern Oscillation (ENSO) is a climate pattern that describes the unusual warming of surface waters in the eastern tropical Pacific Ocean. It has an impact on ocean temperatures, the speed and strength of ocean currents, the health of coastal fisheries, and local weather from Australia to South America and beyond.

  • North Atlantic Oscillation (NAO) is a weather phenomenon over the North Atlantic Ocean of fluctuations in the difference of atmospheric pressure.

  • Indian Dipole Index (DMI) (sometimes referred to as the Dipole Mode Index) is commonly measured by an index describing the difference between sea surface temperature anomalies in two specific regions of the tropical Indian Ocean.

  • Pacific Decadal Oscillation (PDO) is a recurring pattern of ocean–atmosphere climate variability centred over the mid-latitude Pacific basin.

  • East Asian Summer Monsoon Index (EASMI) is defined as an area-averaged seasonally dynamically normalized seasonality at 850 hPa within the East Asian monsoon domain.

  • Amount of rainfall in a region in India (AoR) is a data set consisting of the amount of rainfall in the region in the centre of India.

For all six variables, we possess monthly measurements starting from 1.1.1953. Following preliminary data manipulation, such as mitigating seasonality and ensuring data stationarity (for comprehensive specifics, please refer to the supplementary package), we turn our attention to assessing the assumption of heavy-tailness.

It seems that the assumption of identical tail indexes across all time series does not hold. Employing the same methodology as in the preceding subsection, we calculate the estimated tail indexes along with their corresponding confidence intervals, as presented in Table 5. It is evident that ENSO and AoR exhibit notably larger tail indexes in comparison to the other variables. On the other hand, variables NAO, DMI, PDO, and EASMI appear to share similar tail indexes (with NAO possibly having a slightly smaller tail index). Still, we will proceed keeping in mind that these relations between the first and second group may be misleading.

Finally, we proceed to compute the causal tail coefficients for each pair of time series. In this application, we conclude that \(\textbf{X}\) causes \(\textbf{Y}\) if and only if both \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(p)>0.9\) and \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(p)<0.8\). This selection is motivated by the outcomes discussed in Section 4.5. Simulation results indicate that if both \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(p), \hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(p)\) are close to 1, we can not distinguish between the case when \(\textbf{X}\leftrightarrow \textbf{Y}\) and when there exists a confounder with heavier tails causing both \(\textbf{X}\) and \(\textbf{Y}\). In this application, the absence of a confounder with heavier tails cannot be assumed, given the distinct tail indexes observed in the relevant time series. Hence, if both \(\hat{\Gamma }_{\textbf{X}\rightarrow \textbf{Y}}^{time}(p), \hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(p)\) are large, we can not conclude any causal relation. Therefore, we are only interested in asymmetric extremal behavior. We chose the threshold 0.9 with the same reasoning discussed in Section 4.6. A similar line of thought guides the choice of the threshold 0.8: a suitable threshold ought to reside within the interval (0.7, 0.9), given that most coefficients \(\hat{\Gamma }_{\textbf{Y}\rightarrow \textbf{X}}^{time}(p)\) from simulations (and the results of this application in Table 6) lie within this interval. Furthermore, the results remain the same if we chose any threshold in the interval (0.77, 0.86) since no arrows in Fig. 9 would change. Consequently, the chosen threshold logically fits within this gap in the computed coefficients.

We opt to use an extremal delay of \(p=12\), equivalent to one year, for all pairs in our analysis.Footnote 9 The resulting causal relationships are outlined in Table 6. Subsequently, upon constructing the corresponding graph, we identify specific extremal causal relationships, as depicted in Fig. 9. Here, directed arrows indicate the pairs with an apparent asymmetry in the corresponding \(\hat{\Gamma }^{time}(p)\) coefficients (bold numbers in Table 6). The red arrows correspond to a pair of time series with significantly different tail indexes.

Concluding that the discovered causal relations are surely true would be a very bold statement. These highly complex datasets may not conform to a straightforward time series model. Their tail indexes are also different. Additionally, the absence of certain causal relationships doesn’t imply their nonexistence; for instance, a causal connection might exist without exerting significant influence on extreme events. while this methodology doesn’t offer definitive conclusions, it does identify asymmetries in extreme behavior and provides a preliminary notion of directions that might warrant further exploration.

Table 5 Estimations of the tail indexes and their \(95\%\) confidence intervals
Table 6 Estimation of the causal tail coefficient for time series for each pair of variables. Concerning the order of subscripts, the column comes first, then the row. For example, \(\hat{\Gamma }_{DMI\rightarrow NAO}^{time}(p)=0.89\) and \(\hat{\Gamma }_{NAO\rightarrow DMI}^{time}(p)=0.83\)
Fig. 9
figure 9

The estimated causal directions, determined using the causal tail coefficient for time series, involve the following variables: AoR (Amount of rainfall in India), DMI (Indian Dipole Index), PDO (Pacific Decadal Oscillation), NAO (North Atlantic Oscillation), EASMI (East Asian Summer Monsoon Index) and ENSO (El Niño Southern Oscillation). The presence of arrows denotes asymmetries in extreme events. The presence of arrows denotes asymmetries in extreme events. In cases where the corresponding tail indexes significantly differ (indicating that the assumptions are not satisfied), the arrow is depicted in red

As a means of comparison with existing research in the field of earth science, we offer a selection of pertinent results and articles that tend to concur with our findings. It’s important to note that this list is not exhaustive.

  • ENSO \(\rightarrow\) EASMI (Pothapakula et al. 2020) (using information theory approach), (Chan and Zhou 2005),

  • ENSO \(\rightarrow\) AoR (Sarkar et al. 2004) (although rigorous causal methodology was not used),

  • ENSO \(\rightarrow\) NAO (Mokhov and Smirnov 2006) (using a phase dynamics modeling which can give different results than looking at extremes),

  • DMI \(\rightarrow\) ENSO (Le et al. 2020),

  • DMI \(\rightarrow\) EASMI (Pothapakula et al. 2020) (using information theory approach),

  • PDO \(\rightarrow\) AoR (Sarkar et al. 2004) (although rigorous causal methodology was not used),

  • PDO \(\rightarrow\) EASMI (Chan and Zhou 2005) (although rigorous causal methodology was not used),

  • NAO is known to influence air temperature and precipitation in Europe, Northern America and a part of Asia (Wang et al. 2019). However, whether there are relations between NAO and other investigated variables is yet not known.

Our findings align with the majority of results found in the literature. However, there are three exceptions worth noting: our results suggest DMI \(\rightarrow\) PDO (we couldn’t find any reason for this causal relation in the literature). Also, our results do not suggest DMI \(\rightarrow\) ENSO and ENSO \(\rightarrow\) NAO, although these causal relations seem to hold (Mokhov and Smirnov 2006; Le et al. 2020).

6 Conclusion

Causal inference in time series presents a recurring challenge within the literature. In numerous real-world scenarios, the causal mechanisms become apparent during extreme periods rather than within the bulk of the distribution. We introduce an innovative method that leverages extreme events for detecting causal relationships. This method is formalized within a mathematical framework, and several of its properties are systematically demonstrated under specific model assumptions.

This work sheds light on some connections between causality and extremes. Many scientific disciplines use causal inference as a baseline of their work. A method that can detect a causal direction in complex heavy-tailed datasets can be very useful in such domains.

This subject offers an extensive array of prospects for future research. For instance, what is the underlying distribution of \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\)? Could we devise a better (and consistent) statistic that mitigates the negative bias inherent in \(\hat{\Gamma }^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p)\)? Is there a means to establish a rigorous testing methodology for the hypothesis \(\Gamma ^{time}_{\textbf{X}\rightarrow \textbf{Y}}(p) = 1\)? Furthermore, can we incorporate observed covariates into the causal inference between \(\textbf{X}\) and \(\textbf{Y}\), moving beyond reliance solely on pairwise discovery? Does this method retain its efficacy even when applied to light-tailed time series? Can a similar approach be adapted to a more generalized framework? These questions can lead to potential future research and important results.