Exploring non-linear relationships among redundant variables through non-parametric principal component analysis: An empirical analysis with land-use data

Regional Statistics; Filippo gambella

Principal Component Analysis (PCA) is a widely applied statistical technique aimed at summarising a multidimensional set of input (partly redundant) variables into a restricted number of independent components that are linear combinations of the inputs. PCA transforms the original data matrix by performing a spectral decomposition of the related variance/covariance (or correlation) matrix. When decomposing a correlation matrix, Pearson product-moment correlation coefficients are traditionally used in the correlation matrix. The statistical properties of Pearson correlation coefficients (being insensitive to non-linear, high-order correlations) represent an intrinsic limitation of PCA, restricting its applicability to linear relationships among inputs. However, working with variables displaying (more or less intense) deviations from linearity is common in both socioeconomic research and environmental studies. Following the theoretical assumptions of earlier studies, a generalisation of PCA aimed at exploring non-linear multivariate relationships among inputs is illustrated in the present article by using non-parametric Spearman and Kendall coefficients to replace linear Pearson coefficients in the correlation matrix. The per cent share of 19 land-use classes in the total landscape in a given study area (the Athens metropolitan region, Greece), obtained from a high-resolution map at the local scale, were used as inputs. The results of the standard PCA (via decomposition of a Pearson linear correlation matrix) and a generalised approach (via decomposition of a nonparametric correlation matrix based on Spearman or Kendall rank coefficients) were compared using traditional diagnostics. The PCA performed by decomposing a Spearman correlation matrix exhibited the highest variance extracted by the principal components, giving refined loadings and scores that allow recognition of latent landuse patterns. Contributing to a recent debate on the use of multidimensional techniques in regional studies, non-parametric approaches are promising tools for analysis of large datasets displaying complex, almost nonlinear relationships among inputs.

Exploring non-linear relationships among redundant variables through non-parametric principal component analysis: An empirical analysis with land-use data Gianluca Egidi Principal Component Analysis (PCA) is a Tuscia University, Italy widely applied statistical technique aimed at E-mail: egidi.gianluca@unitus.it summarising a multidimensional set of input (partly redundant) variables into a restricted Magda Edwards number of independent components that are linear combinations of the inputs. PCA Global Change Research Institute transforms the original data matrix by (CzechGlobe), Czech Republic performing a spectral decomposition of the E-mail: edwards.m@czechglobe.cz related variance/covariance (or correlation) matrix. When decomposing a correlation Sirio Cividino matrix, Pearson product-moment correlation University of Udine, Italy coefficients are traditionally used in the E-mail: sirio.cividino@uniud.it correlation matrix. The statistical properties of Pearson correlation coefficients (being Filippo Gambella insensitive to non-linear, high-order University of Sassari, Italy correlations) represent an intrinsic limitation E-mail:gambella@uniss.it of PCA, restricting its applicability to linear relationships among inputs. However, Luca Salvati working with variables displaying (more or University of Macerata, Italy less intense) deviations from linearity is common in both socioeconomic research and E-mail: luca.salvati@unimc.it environmental studies. Following the theoretical assumptions of earlier studies, a generalisation of PCA aimed at exploring non-linear multivariate relationships among inputs is illustrated in the present article by using non-parametric Spearman and Kendall coefficients to replace linear Pearson coefficients in the correlation matrix. The per cent share of 19 land-use classes in the total landscape in a given study area (the Athens metropolitan region, Greece), obtained from a high-resolution map at the local scale, were used as inputs. The results of the standard PCA (via decomposition of a Pearson linear correlation matrix) and a generalised approach (via decomposition of a nonparametric correlation matrix based on Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 2 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati Keywords: multidimensional techniques, spearman non-parametric coefficients, Principal Component Analysis (PCA), large data sets, indicators, regional science Spearman or Kendall rank coefficients) were compared using traditional diagnostics. The PCA performed by decomposing a Spearman correlation matrix exhibited the highest variance extracted by the principal components, giving refined loadings and scores that allow recognition of latent landuse patterns. Contributing to a recent debate on the use of multidimensional techniques in regional studies, non-parametric approaches are promising tools for analysis of large datasets displaying complex, almost nonlinear relationships among inputs. Introduction By integrating data collected at various spatial resolutions from different information sources, the multidimensional analysis of large datasets is a pivotal issue in regional studies, which requires empirical refinement (Jolliffe 2002, Bansal et al. 2010, Chelli et al. 2009, Ciommi et al. 2019). Multivariate statistical techniques considering the combination of time, space, and/or other relevant analysis’ dimensions may adequately explore latent patterns in the input variables, illustrating, for example, spatio-temporal trends characterising factors that affect the overall matrix variability more or less intensely (Johnson–Wichern 2002, Mérigot et al. 2007, Han–Liu 2012, Ciommi et al. 2018). With the advent of big data and the continuous implementation of data mining techniques, exploratory statistical frameworks have been increasingly oriented towards the analysis of complex datasets (e.g. Chelli et al. 2016, Ciommi et al. 2017, Salvati et al. 2017). In this context, the treatment of variables deviating from normality and of non-linear relationships among inputs is a key (theoretical and practical) challenge in regional science (Cano-Orellana–Delgado-Cabeza 2015, Lamonica–Chelli 2018, Salvati et al. 2019). New estimators of non-linear relationships (e.g. through new approaches decomposing correlation matrices) have been proposed, sometimes in the context of robust regression, for example, reducing model outcomes’ sensitivity to outliers (Rousseeuw–Leroy 2005). Closely related to factor analysis (Salvati–Zitti 2009), principal component analysis (PCA) is likely the most widely used multivariate technique in the quantitative analysis of measurement variables. In this regard, PCA is an effective tool for exploring complex datasets, producing interpretative and predictive models in vastly different disciplinary fields, from economics to sociology, physics to chemistry, and from ecology to psychology (e.g. Vyas– Kumaranayake 2006, Linting et al. 2007, Rosti–Chelli 2009, Khatun 2009, Colantoni et al. 2016). Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 3 To overcome the increasing complexity and non-linearity of large datasets, generalisations of the standard PCA originally proposed by Karl Pearson in 1901 have been advanced, using multi-linear, higher-order, robust and weighted approaches (Lu et al. 2011, Rosti–Chelli 2012, Carlucci et al. 2018). More specifically, the idea of replacing the classic correlation coefficient estimators in a correlation matrix by non-parametric estimators has already been discussed in an ecological framework (Mérigot et al. 2007, 2017) and applied to specific environmental case studies with field measurements of biophysical phenomena (Gaertner et al. 2010, 2013, Vásquez-Correa–Laniado Rodas 2019). Practical approaches to exploratory analysis of non-linear relationships in complex data matrices—a typical data configuration in regional science—are still necessary for socioeconomic research, especially when more traditional statistical techniques (e.g. econometric models) fail because of the intrinsic deviation of input variables from the working assumptions of these approaches (Chelli–Rosti 2002, Gigliarano–Chelli 2016, Salvati et al. 2018). Grounded in the broad debate of exploratory multidimensional statistics applied to practical cases of multivariate analysis (e.g. Tóth–Csomós 2016, Varga et al. 2016, Lamonica et al. 2020), the present study contributes to this important topic by applying to empirical data a simplified, non-parametric procedure aimed at appropriately investigating non-linear relationships among inputs through a generalised PCA. The generalisation proposed here was aimed at (i) improving the overall performance of a PCA in terms of representation of the original data matrix and (ii) providing a better representation of complex relationships among variables. The procedure was run on a data matrix quantifying land-use composition (based on the per cent share of 19 land-use classes in the total landscape obtained from a highresolution map) in 115 local domains (municipalities) belonging to a sufficiently large region (the Athens’ metropolitan region in Greece). This two-dimensional structure (variables and spatial units) represents a typical data matrix in regional studies. Displaying highly complex and fragmented landscapes resulting in a non-linear relationship between land-use extent over space, the study area was taken as representative of metropolitan regions all over Europe. Such regions are experiencing a progressive shift from monocentric structures ordered along the urban gradient towards more heterogeneous and dispersed configurations with sprawled settlements increasingly mixed with cropland and forests (Salvati 2020). To improve the regional analysis and contribute more significantly to land-use science with a refined understanding of non-linear relationships in the land-use composition of a given landscape, the results of the generalised approach described in this paper were finally compared with the outcomes of a standard PCA. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 4 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati Methodology PCA: Definitions and operational framework A PCA performed on data collected on n variables for j cases investigates both the jdimensional space defined by n variables and the n-dimensional space defined by j cases (Inkoom et al. 2018). In other words, PCA performs an orthogonal transformation of n inputs – measured on a cardinal scale – into n linearly uncorrelated variables called components (Primpas et al. 2010). The first step is to compute a diagonal symmetric matrix based on the variances-covariances (or Pearson linear correlations), depending on whether the data need to be standardised or centred about the mean values (Tripathi–Singal 2019). In both cases, the result is a new set of variables (principal components) that are linear combinations of the inputs (Rajesh et al. 2018). Components are selected using optimisation procedures (e.g. the least squares criterion), since linear combinations of inputs are sought that best fit the clouds of points in the vector spaces (of variables and cases). This procedure yields principal components that result in the highest sums of squares for the orthogonal projections (Zhang et al. 2012). Orthogonal selection ensures that all the components extracted are statistically independent and, thus, uncorrelated in the multivariate field (Zou–Yoshino 2017). Each component accounts for a variable proportion of variance in the original dataset. More specifically, components are extracted iteratively, in a way that the first component extracts the largest proportion of the total variance; the next component accounts for the second largest proportion of variance and so on. In other words, the second component is calibrated by extracting from the proportion of the variance remaining after the extraction of the first component, and so forth. Thus, the proportion of variance extracted by each component from 1 to n follows a mathematical function that is decreasing monotonically (Jolliffe 2002). Although components are derived to maximise the proportion of explained variance, the first component can seldom capture the entire variance of the dataset (Hosseini–Kaneko 2011). At the same time, new (significant) variables are usually smaller in number than the inputs and account for the inherent variation of the data to the maximum possible extent. Choosing the number of components to analyse is notably subjective, although several quantitative approaches exist for making this choice operationally. The components retained in the final analysis can be selected based on quantitative – although subjective – criteria. Examples of these criteria include thresholds (to include or to exclude) based on the components’ absolute eigenvalues (λ > 1) or the relative proportion of variance explained by each component (σ2 > 10%). Other criteria can be set in this context (Salvati–Zitti 2009). For instance, PCA’s ability to summarise large amounts of information in a few relevant components – taking the explained variance as constant – is intrinsically reflected in the ratio of components selected for analysis to the total inputs n (e.g. Colantoni et Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 5 al. 2016). A lower value of this ratio indicates a higher ability to summarise information associated with a large dataset into a few meaningful components (Salvati 2014). In any case, the number of selected components for final analysis should never equal the number of original variables; instead, it should be much lower (e.g. Zambon et al. 2017). Based on the coordinates of the selected components, a new space (the components’ space) is generated, onto which variables and cases can be projected. The results of a principal component extraction are, therefore, discussed in terms of component scores (i.e. the transformed variable values corresponding to a particular data point) and component loadings (i.e. the weight by which each standardised variable should be multiplied to obtain the component score). A plot illustrating the relationship between variables and cases (close variables, close cases, and close variables and cases mean a strong association between them), called a biplot, is considered the most relevant graphical tool in a PCA. A biplot allows inspection of similarity patterns within the variables (loadings) and the cases (scores), and the relationships between variables and cases, providing powerful information for analysing apparent and latent patterns in the dataset (Cano-Orellana–DelgadoCabeza 2015). PCA generalisation In the first step of a PCA, mentioned in the former Section, the original data matrix is transformed by performing a spectral decomposition of the variance/covariance (or correlation) matrix. The use of a correlation matrix allows removing the scaling effect associated with covariance matrices. In a standard PCA, correlation matrices are obtained through the computation of the pairwise Pearson product-moment correlation coefficients. Pearson correlation coefficients range between -1 and +1 and are extensively adopted for quantifying pairwise linear correlations, where –1 and +1 indicate a perfectly negative and a perfectly positive linear correlation, respectively, with 0 indicating no evidence for a linear correlation (Jolliffe–Cadima 2016). The statistical property of Pearson (parametric) correlation coefficients’ sensitivity to linear correlations only is an intrinsic limitation of PCA, restricting its applicability to specific datasets with mostly linear relationships among the inputs. However, working with variables displaying significant deviations from linearity is relatively common in regional studies (Salvati 2014). Based on the results and simulations of earlier studies (Mérigot et al. 2007, 2017, Ciommi et al. 2019), a generalisation of PCA aimed at exploring non-linear multivariate relationships among inputs is illustrated below, using non-parametric correlation coefficients. In the present study, PCAs were performed by decomposing correlation matrices of both parametric (Pearson) and non-parametric (Spearman and Kendall) coefficients. Pairwise Spearman’s rank correlation coefficients and Kendall concordance coefficients assume that inputs were measured on – at least – an Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 6 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati ordinal (rank-order) scale, that is, that the individual observations can be ranked into an ordered series (Kendall 1970). More specifically, Spearman's rank correlation coefficient is a non-parametric measure of the statistical dependence between two variables, quantifying how finely the relationship between two variables can be assessed using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotonic function of the other variable (Zambon et al. 2017). Spearman correlation is less sensitive than Pearson correlation to strong outliers in the tails of both samples (Smarandache 2009). This is because Spearman's coefficient limits the outlier to the value of its rank. Kendall’s rank correlation coefficient is a non-parametric statistic, measuring the association between two variables using the same numerical scale as Pearson’s or Spearman’s coefficients (i.e. ranging between -1 and 1, with 0 indicating absence of any type of correlation). Like Spearman’s coefficient, this measure is indicative of a rank correlation, that is, the similarity of the orderings of the data when ranked by each of the variables. Neither Spearman nor Kendal correlation relies on any assumption on the (marginal and joint) distributions of the inputs (Kendall 1970). In contrast to Pearson correlation coefficient, this means that neither of the two aforementioned nonparametric correlation coefficients is sensitive to deviations from (i) normality of the individual variables and (ii) linearity between a pair of the variables. Thus, they provide a complete analysis of correlation regimes higher than only first-order correlations (e.g. correlations based on square, cubic, or higherorder polynomial relationships). Based on these premises, a generalised approach was proposed, and three separate analyses were performed on the same data set: (i) the standard procedure of decomposing a Pearson correlation matrix and (ii) the generalised procedure of decomposing a Spearman or Kendall correlation matrix. The empirical findings of both approaches were compared using traditional diagnostics (percentage of explained variance and statistical distribution of component loadings). PCA diagnostics The 'quality of representation' is an important measure of the variance accounted for by a given set of principal components. In the present study, components with an eigenvalue > 1 were selected and further analysed. To assess the quality of PCA outputs, the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy (testing whether the partial correlations among the variables are small), and Bartlett’s test of sphericity (testing whether the correlation matrix is an identity matrix), were adopted to verify whether the ‘representation model’ based on the selected components was appropriate for analysing the original dataset (Salvati–Zitti 2009). The loadings, indicating the association between a given input and a given component, were analysed to provide supplementary information on the overall Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 7 quality of representation for each PCA. Finally, the component scores were computed to evaluate the relative contribution of each case to the variance of a given component. The higher the contribution of a case, the heavier its weight on the component is (Jolliffe 2002). Empirical application An input matrix containing the per cent share of 19 land-use classes in the total landscape of 115 spatial domains (local municipalities) encompassing the entire extent (3025 km2) of a metropolitan region in Southern Europe (Athens, Greece) was obtained from a spatial overlay of a shapefile of municipal boundaries (source: Hellenic Statistical Authority) and a shapefile map developed as part of the Urban Atlas (UA) GMES(Global Monitoring and Environmental Surveillance)-Copernicus initiative (Figure 1, Figure 2). The UA service offers high-resolution (1:10,000) landuse maps of more than 300 urban areas in Europe, representing a typical big-data, remotely sensed source of information for landscape analysis. These maps adopt an articulated classification of land-use (Table 1) based on the Corine Land Cover nomenclature system, with a specific focus on urban areas. The study area represents a monocentric city with a distinctive urban–rural gradient (Colantoni et al. 2016). Figure 1 A simplified map of the study area, 2012 Built-up areas Agricultural areas Forest and other natural areas Note: Based on the Urban Atlas database, illustrating the spatial distribution of three basic land-use classes. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati 8 Figure 2 A map of the administrative boundaries of the local municipalities and the average elevation (m), 2012 <250 250–500 >500 Note: Indicating the location of downtown Athens in the metropolitan region (star). Geo-spatial databases derived from such maps provide an articulated representation of local territories based on specific attributes (e.g. high-resolution land-use surveys with heterogeneous classification systems for urban and rural areas, available at different temporal and spatial scales, with variable coverage among cities). They represent a great opportunity to test the effectiveness of PCA generalisations when analysing non-linear relationships among inputs. Based on the spatial overlay of the two maps described above, per cent class area in the total landscape was computed at each municipality of the study region using the ‘zonal statistics’ procedure, implemented within a Geographic Information System (ArcGIS 10 release). To avoid complete multi-collinearity (i.e. linear dependence) among the inputs, water bodies (code '50000') were not considered in the analysis. The input matrix subjected to PCA finally comprised 19 columns (land-use classes) and 115 rows (municipal units) with no missing data. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 9 Table 1 Land-use classification (based on Urban Atlas [UA] nomenclature) adopted in the present study, 2012 UA code 11100 11210 11220 11230 11240 11300 12100 12210 12220 12230 12300 12400 13100 13300 13400 14100 14200 20000 30000 50000 a) Description Continuous Urban Fabric (sealed soil > 80%) Discontinuous Dense Urban Fabric (50%–80%) Discontinuous Medium-Density Urban Fabric (30%–50%) Discontinuous Low-Density Urban Fabric (10%– 30%) Discontinuous Very-Low-Density Urban Fabric (< 10%) Isolated human structures Industrial, commercial, public, military, and private units Fast transit roads and associated land Other roads and associated land Railways and associated land Port areas Airports Mineral extraction and dump sites Construction sites Land without current use Green urban areas Sports and leisure facilities Agricultural areasa) Forestsa) Water bodies Code used in this study 1110 1121 1122 1123 1124 1130 121 1221 1222 1223 1230 1240 1310 1330 1340 1410 1420 2 3 Not considered Sub-classes in both of these fields were combined. Results The components with eigenvalues > 1 were selected, and Table 2 reports the basic diagnostics of the generalised PCA runs on the input dataset using Pearson, Spearman, or Kendall metrics. While the first component extracted the highest variance when using a Pearson correlation matrix, the cumulated variance from component 2 onwards when using a Spearman’s correlation matrix was systematically higher than that observed when using the other two correlation matrices. Moreover, the statistics used to evaluate the quality of representation (KMO and Bartlett's tests) indicate that the PCA using Spearman’s correlation coefficients achieved an optimal data representation. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati 10 Table 2 Diagnostics of generalised PCA based on spectral decomposition of different correlation matrices, by correlation measure, 2012 Correlation Pearson Spearman Kendall Cumulated variance extracted by the j-th component, % KMO Bartlett's 1st 2nd 3rd 4th 5th test Test 35.6 32.8 26.1 49.2 51.7 41.4 60.0 62.9 51.4 67.2 70.2 58.6 73.8 76.2 64.6 0.676 0.838 0.790 1712 1771 819 Figure 3 Difference in the per cent variance extracted by component for Pearson vs Spearman and Kendall vs Pearson correlation matrices, 2012 Figure 3 illustrates the absolute difference in the percentage of variance extracted by component for Pearson vs Spearman and Kendall vs Pearson correlation matrices. Assuming the first five components as representative of the input data matrix (see above), the difference between Spearman and Pearson matrices indicates that use of Spearman coefficients allows for (more or less) intense gains in the proportion of explained variance up to the fifth component. By contrast, PCA using a Kendall correlation matrix extracted less variance up to the fifth components compared to PCA using a Pearson correlation matrix. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 11 Figure 4 Loading plot (component 1 vs 2) by correlation matrix type, 2012 Pearson correlation matrix Spearman correlation matrix Rural Urban Figure 4 compares the inputs’ loadings along components 1 and 2. On average, the loadings on the principal components were larger when using Spearman’s correlation coefficients. Components 1 and 2 clearly discriminate continuous urban fabric (‘1100’) from rural areas (both agricultural, ‘2’, and forests, ‘3’), the typical density gradient observed in the study area (Colantoni et al. 2016, Pili et al. 2017). The results of the PCA using Pearson correlation coefficients are less clear as far as the monocentric structure of the city and the consequent urban-rural polarisation in landscape composition (Zambon et al. 2017). This result confirms the high quality of representation achieved by the two components extracted with the PCA based on a Spearman correlation matrix. Figure 5 compares the distribution of scores on the selected components extracted using Pearson and Spearman correlation matrices. The two score plots are substantially similar, with the Spearman plot showing a slightly higher capability to discriminate cases along the two components. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati 12 Figure 5. Score plot (component 1 vs 2) by correlation matrix type, 2012 Pearson correlation matrix Spearman correlation matrix Discussion Structure (e.g. the spatial arrangement of individual elements in a given landscape), functions (e.g. how ecological processes operate within a given landscape structure), and dynamics of change (e.g. disturbance and recovery) are characteristics of a given landscape. They result from the continuous interplay of physical, biological, and socio-cultural processes acting simultaneously over broad regions at different spatiotemporal scales (Bollen et al. 2009, Salvati 2014, Colantoni et al. 2018). The exploratory analysis of landscape data that includes accounting for non-linear relationships among inputs is an increasingly challenging issue (Salvati et al. 2019). With the evolution of remote techniques, a supply of big data to monitor land-use changes has been rapidly increasing in volume, dimension, and completeness, at both the global and regional scales (Guido et al. 2017, Todde et al. 2017, Jolliffe– Cadima 2016). Because of their intrinsic complexity, datasets derived from such sources represent a pivotal opportunity to verify the informative power, computational effectiveness, and simplicity of non-parametric generalisations of more traditional techniques (Li–Wu 2004). Non-parametric approaches have been regarded as promising tools in the analysis of large data sets characterised by complex relationships among inputs (Salvati et al. 2018). PCA has been frequently adopted as an exploratory framework for analysing highly correlated data sets (Vyas–Kumaranayake 2006). PCA was originally proposed as a linear technique, whose intrinsic limit was the ineffectiveness analysis of non-linear systems typically observed in regional studies (Zou–Yoshimo 2017). Several approaches extending linear PCA to the analysis of non-linear datasets have been proposed, including new singular value decomposition techniques, (simple or Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 13 recursive) Kernel PCA, and the adoption of genetic programming algorithms for achieving the objectives of non-linear PCA (Zhang et al. 2012). While giving a high degree of structural parsimony, principal curves contribute to the generalisation of PCA with a natural geometric framework (Linting et al. 2007, Lu et al. 2011, Jolliffe–Cadima 2016). The present study proposes a simplified evaluation of landscape indicators using a generalised PCA based on the decomposition of parametric (Pearson) and nonparametric (Spearman and Kendall) correlation matrices. In our case, decomposing a Spearman’s correlation matrix gave the highest performance in terms of (i) the explained variance, (ii) standard diagnostics (KMO and Bartlett’s tests), and (iii) component loadings and scores. By increasing the overall representation quality, use of the Spearman correlation matrix provides a refined picture of the latent relationships in a data set and can be routinely applied in a multivariate analysis (Salvati 2014). Moreover, standard PCA is particularly sensitive to outliers (Jolliffe 2002). Although outliers are difficult to identify in some contexts (Cano-Orellana– Delgado-Cabeza 2015), removing outliers before the computations is a common practice in several techniques, including robust PCA using low-rank and sparse matrices (Johnson–Wichern 2002). Contributing to resolving this issue, the Spearman correlation coefficient is also less sensitive to outliers (Hubert et al. 2005). Although it reduces sensitivity to outliers, it should be mentioned that the Spearman (and also Kendall) correlation coefficient does not solve all non-linearity issues, such as non-monotonicity (Han–Liu 2012). These problems need further examination that better integrates practitioners’ analysis needs with the objective of efficiency and robustness of statistical theory (Bansal et al. 2010). Refined and new approaches addressing complex, non-linear, and fragmented data patterns, typical of regional science, are a necessary pre-requisite to reliable, stable, and informed outcomes of any exploratory multivariate statistical method (Schwedhelm et al. 2018). Concluding remarks The results of this study encourage the integrated use of PCA using both Pearson and Spearman correlation matrices. The direct comparison of the results derived from the two types of analyses can provide effective information on more complex relationships among inputs, contributing to non-linear pattern recognition. The increasing demand for quantitative approaches, high-resolution data, and indicators suitable to inform multi-disciplinary studies at refined spatio-temporal scales stimulates research aimed at summarising trends and extracting latent patterns from large datasets in regional science. These approaches should concentrate on a comparative analysis of different tools, improving the representation quality of big data in multi-dimensional space. Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 14 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati REFERENCES BANSAL, S.–POURBOHLOUL, B.–HUPERT, N.–GRENFELL, B.–MEYERS, L. A. (2010): The shifting demographic landscape of pandemic influenza PLoS One 5 (2): e9360. https://doi.org/10.1371/journal.pone.0009360 BOLLEN, J.–VAN DE SOMPEL, H.–HAGBERG, A.–CHUTE, R. (2009): A principal component analysis of 39 scientific impact measures PloS One 4 (6): e6022. https://doi.org/10.1371/journal.pone.0006022 CANO-ORELLANA, A.–DELGADO-CABEZA, M. (2015): Local ecological footprint using Principal Component Analysis: A case study of localities in Andalusia (Spain) Ecological Indicators 57: 573–579. https://doi.org/10.1016/j.ecolind.2015.03.014 CARLUCCI, M.–CHELLI, F. M.–SALVATI, L. (2018): Toward a new cycle: Short-term population dynamics, gentrification, and re-urbanization of Milan (Italy) Sustainability 10 (9): 3014. https://doi.org/10.3390/su10093014 CHELLI, F.–GIGLIARANO, C.–MATTIOLI, E. (2009): The impact of inflation on heterogeneous groups of households: An application to Italy Economics Bulletin 29 (2): 1276–1295. CHELLI, F.–ROSTI, L. (2002): Age and gender differences in Italian workers’ mobility International Journal of Manpower 23 (4): 313–325. https://doi.org/10.1108/01437720210439656 CHELLI, F. M.–CIOMMI, M.–EMILI, A.–GIGLIARANO, C.–TARALLI, S. (2016): Assessing the Equitable and Sustainable Well-Being of the Italian Provinces International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems 24 (1): 39–62. https://doi.org/10.1142/S0218488516400031 CIOMMI, M.–CHELLI, F. M.–CARLUCCI, M.–SALVATI, L. (2018): Urban growth and demographic dynamics in southern Europe: Toward a new statistical approach to regional science Sustainability 10 (8): 2765. https://doi.org/10.3390/su10082765 CIOMMI, M.–CHELLI, F. M.–SALVATI, L. (2019): Integrating parametric and non-parametric multivariate analysis of urban growth and commuting patterns in a European metropolitan area Quality and Quantity 53 (2): 957–979. https://doi.org/10.1007/s11135-018-0798-2 CIOMMI, M.–GIGLIARANO, C.–EMILI, A.–TARALLI, S.–CHELLI, F. M. (2017): A new class of composite indicators for measuring well-being at the local level: An application to the Equitable and Sustainable Well-being (BES) of the Italian Provinces Ecological Indicators 76: 281–296. https://doi.org/10.1016/j.ecolind.2016.12.050 COLANTONI, A.–GRIGORIADIS, E.–SATERIANO, A.–VENANZONI, G.–SALVATI, L. (2016): Cities as selective land predators? A lesson on urban growth, deregulated planning and sprawl containment Science of the Total Environment 545: 329–339. https://doi.org/10.1016/j.scitotenv.2015.11.170 COLANTONI, A.–MONARCA, D.–LAURENDI, V.–VILLARINI, M.–GAMBELLA, F.–CECCHINI, M. (2018): Agriculture 8 (4): 47. https://doi.org/10.3390/agriculture8040047 GAERTNER, J-C.–MAIORANO, P.–MÉRIGOT, B.–COLLOCA F.–POLITOU, C-Y.–GIL DE SOLA, L.–BERTRAND, J. A.–MURENU, M.–DURBEC, J-P.–KALIANIOTIS, A.– MANINI, A. (2013): Large-scale diversity of slope fishes: patterns inconsistency between multiple diversity indices Plos One 13 (1): e0191588. https://doi.org/10.1371/journal.pone.0191588 Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 15 GAERTNER, J. C.–MÉRIGOT, B.–RELINI, G.–BERTRAND, J. A.–POLITOU, C.-Y.–MAZOUNI, N.–GIL DE SOLA, L.–KALLIANIOTIS, A.–CARPENTIERI, P.–MURENU, M.– DURBEC, J. P.–VRGOC, N.–UNGARO, N. (2010): Reproducibility of the multicomponent aspect of species diversity through different areas and scales: towards the constitution of a shortlist of complementary indices for monitoring fish diversity? Ecography 33 (6): 1123–1135. https://doi.org/10.1111/j.1600-0587.2009.06259.x GIGLIARANO, C.–CHELLI, F. M. (2016): Measuring inter-temporal intragenerational mobility: an application to the Italian labour market Quality and Quantity 50 (1): 89–102. https://doi.org/10.1007/s11135-014-0138-0 GUIDO, V.–MERCENARO, L.–GAMBELLA, F. (2017): Application of Proximal Sensing in Viticulture: Comparison of Different Berry State Conditions Chemical Engineering Transactions 58: 613–618. https://doi.org/10.3303/CET1758103 HAN, F.–LIU, H. (2012): Semiparametric principal component analysis In: JORDAN, M. I.– LECUN, Y.– SOLLA, S. A. Advances in Neural Information Processing Systems Proceedings of the First 12 Conferences pp. 171–179. The MIT Press, Cambridge, MA. HOSSEINI, H. M.–KANEKO, S. (2011): Dynamic sustainability assessment of countries at the macro level: A principal component analysis Ecological Indicators 11 (3): 811–823. https://doi.org/10.1016/j.ecolind.2010.10.007 HUBERT, M.–ROUSSEEUW, P. J.–VANDEN BRANDEN, K. (2005): ROBPCA: a new approach to robust principal component analysis Technometrics 47 (1): 64–79. https://doi.org/10.1198/004017004000000563 INKOOM, J. N.–FRANK, S.–GREVE, K.–WALZ, U.–FÜRST, C. (2018): Suitability of different landscape metrics for the assessments of patchy landscapes in West Africa Ecological Indicators 85: 117–127. https://doi.org/10.1016/j.ecolind.2017.10.031 JOHNSON, R. A.–WICHERN, D. W. (2002): Applied Multivariate Statistical Analysis Prentice Hall, New York. JOLLIFFE, I. T. (2002): Principal Component Analysis. Springer Series in Statistics Springer, New York. JOLLIFFE, I. T.–CADIMA, J. (2016): Principal component analysis: a review and recent developments Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 (2065): 20150202. https://doi.org/10.1098/rsta.2015.0202 KENDALL, M. G. (1970): Rank correlation methods Griffin, London. KHATUN, T. (2009): Measuring environmental degradation by using principal component analysis. Environment Development and Sustainability 11: 439–457. https://doi.org/10.1007/s10668-007-9123-2 LAMONICA, G. R.–CHELLI, F. M. (2018): The performance of non-survey techniques for constructing sub-territorial input-output tables Papers in Regional Science 97 (4): 1169–1202. https://doi.org/10.1111/pirs.12297 LAMONICA, G. R.–RECCHIONI, M. C.–CHELLI, F. M.–SALVATI, L. (2020): The efficiency of the cross-entropy method when estimating the technical coefficients of input– output tables Spatial Economic Analysis 15 (1): 62–91. https://doi.org/10.1080/17421772.2019.1615634 LI, H.–WU, J. (2004): Use and misuse of landscape indices Landscape Ecology 19: 389–399. https://doi.org/10.1023/B:LAND.0000030441.15628.d6 Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 16 Gianluca Egidi – Magda Edwards – Sirio Cividino – Filippo Gambella – Luca Salvati LINTING, M.–MEULMAN, J. J.–GROENEN, P. J. F.–VAN DER KOOJJ, A. (2007): Nonlinear principal components analysis: Introduction and application Journal of Psychological Methods 12 (3): 336–358. https://doi.org/10.1037/1082-989X.12.3.336 LU, H.–PLATANIOTIS, K. N.–VENETSANOPOULOS, A. N. (2011): A Survey of Multilinear Subspace Learning for Tensor Data Pattern Recognition 44 (7): 1540–1551. https://doi.org/10.1016/j.patcog.2011.01.004 MÉRIGOT, B.–BERTRAND, J. A.–GAERTNER, J. C.–DURBEC, J. P.–MAZOUNI, N.–MANTÉ, C. (2007): The multi-component structuration of the species diversity of groundfish assemblages of the east coast of Corsica (Mediterranean Sea): Variation according to the bathymetric strata Fisheries Research 88 (1–3): 120–132. https://doi.org/10.1016/j.fishres.2007.08.003 MÉRIGOT, B.–FRÉDOU, F. L.–VIANA, A. P.–FERREIRA, B. P.–JUNIOR, E. D. N. C.–DA SILVA JÚNIOR, C. B.–FRÉDOU, T. (2017): Fish assemblages in tropical estuaries of northeast Brazil: A multi-component diversity approach Ocean & Coastal Management 143: 175–183. https://doi.org/10.1016/j.ocecoaman.2016.08.004 Pili, S.–Grigoriadis, E.–Carlucci, M.–Clemente, M.–Salvati, L. (2017): Towards sustainable growth? A multi-criteria assessment of (changing) urban forms Ecological Indicators 76: 71–80. PRIMPAS, I.–TSIRTSIS, G.–KARYDIS, M.–KOKKORIS, G. D. (2010): Principal component analysis: Development of a multivariate index for assessing eutrophication according to the European water framework directive Ecological Indicators 10 (2): 178–183. https://doi.org/10.1016/j.ecolind.2009.04.007 RAJESH, S.–JAIN, S.–SHARMA, P. (2018): Inherent vulnerability assessment of rural households based on socio-economic indicators using categorical principal component analysis: A case study of Kimsar region, Uttarakhand Ecological Indicators 85: 93–104. https://doi.org/10.1016/j.ecolind.2017.10.014 ROSTI, L.–CHELLI, F. (2009): Self-employment among Italian female graduates Education and Training 51 (7): 526–540. https://doi.org/10.1108/00400910910992736 ROSTI, L.–CHELLI, F. (2012): Higher education in non-standard wage contracts Education and Training 54 (2–3): 142–151. https://doi.org/10.1108/00400911211210251 ROUSSEEUW, P. J.–LEROY, A. M. (2005): Robust regression and outlier detection (Vol. 589). Wiley, London. SALVATI, L. (2014): Agro-forest landscape and the ‘fringe’city: A multivariate assessment of land-use changes in a sprawling region and implications for planning Science of the Total Environment 490: 715–723. https://doi.org/10.1016/j.scitotenv.2014.05.080 SALVATI, L. (2020): Density-dependent population growth in Southern Europe (1961–2011): A non-parametric approach using smoothing splines Regional Statistics 10 (2): 27–41. https://doi.org/10.15196/RS100201 SALVATI, L.–ZITTI, M. (2009): The environmental ‘risky’ region: identifying land degradation processes through integration of socio-economic and ecological indicators in a multivariate regionalization model Environmental Management 44 (5): 888–899. https://doi.org/10.1007/s00267-009-9378-5 SALVATI, L.–GUANDALINI, A.–CARLUCCI, M.–CHELLI, F. M. (2017): An empirical assessment of human development through remote sensing: Evidences from Italy Ecological Indicators 78: 167–172. https://doi.org/10.1016/j.ecolind.2017.03.014 Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105 Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data 17 SALVATI, L.–FERRARA, A.–CHELLI, F. (2018): Long-term growth and metropolitan spatial structures: an analysis of factors influencing urban patch size under different economic cycles Geografisk Tidsskrift - Danish Journal of Geography 118 (1): 56–71. https://doi.org/10.1080/00167223.2017.1386582 SALVATI, L.–CIOMMI, M. T.–SERRA, P.–CHELLI, F. M. (2019): Exploring the spatial structure of housing prices under economic expansion and stagnation: The role of sociodemographic factors in metropolitan Rome, Italy Land Use Policy 81: 143–152. https://doi.org/10.1016/j.landusepol.2018.10.030 SCHWEDHELM, C.–IQBAL, K.–KNÜPPEL, S.–SCHWINGSHACKL, L.–BOEING, H. (2018): Contribution to the understanding of how principal component analysis–derived dietary patterns emerge from habitual data on food consumption The American Journal of Clinical Nutrition 107 (2): 227–235. https://doi.org/10.1093/ajcn/nqx027 SMARANDACHE, F. (2009): Alternatives to Pearson's and Spearman's correlation coefficients International Journal of Statistics and Economics 3 (S09): 47–53. TODDE, G.–CARIA, M.–GAMBELLA, F.–PAZZONA, A. (2017): Energy and carbon impact of precision livestock farming technologies implementation in the milk chain: from dairy farm to cheese factory Agriculture 7 (10): 79. https://doi.org/10.3390/agriculture7100079 TÓTH, G.–CSOMÓS, G. (2016): Mapping the position of cities in corporate research and development through a gravity model-based bidimensional regression analysis Regional Statistics 6 (1): 217–220. https://doi.org/10.15196/RS06111 TRIPATHI, M.–SINGAL, S. K. (2019): Use of Principal Component Analysis for parameter selection for development of a novel Water Quality Index: A case study of river Ganga India Ecological Indicators 96 (1): 430–436. https://doi.org/10.1016/j.ecolind.2018.09.025 VARGA, L.–TÓTH, G.–NÉDA, Z. (2016): An improved radiation model and its applicability for understanding commuting patterns in Hungary Regional Statistics 6 (2): 27–38. https://doi.org/10.15196/RS06202 VÁSQUEZ-CORREA, C. M.–LANIADO RODAS, H. (2019): A robust approach for principal component analysis arXiv arXiv:1903.00093 VYAS, S.–KUMARANAYAKE, L. (2006): Constructing socio-economic status indices: how to use principal components analysis Health Policy and Planning 21 (6): 459–468. https://doi.org/10.1093/heapol/czl029 ZAMBON, I.–SERRA, P.–SAURI, D.–CARLUCCI, M.–SALVATI, L. (2017): Beyond the ‘Mediterranean city’: Socioeconomic disparities and urban sprawl in three Southern European cities Geografiska Annaler: Series B, Human Geography 99 (3): 319–337. https://doi.org/10.1080/04353684.2017.1294857 ZHANG, Y.–LI, S.–TENG, Y. (2012): Dynamic processes monitoring using recursive kernel principal component analysis Chemical Engineering Science 72: 78–86. https://doi.org/10.1016/j.ces.2011.12.026 ZOU, T.–YOSHINO, K. (2017): Environmental vulnerability evaluation using a spatial principal components approach in the Daxing’anling region, China Ecological Indicators 78: 405–415. https://doi.org/10.1016/j.ecolind.2017.03.039 Regional Statistics, Vol. 11. No. 1. 2021 Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105

RELATED PAPERS

RELATED TOPICS

Log In

Exploring non-linear relationships among redundant variables through non-parametric principal component analysis: An empirical analysis with land-use data

Exploring non-linear relationships among redundant variables through non-parametric principal component analysis: An empirical analysis with land-use data

Related Papers

RELATED PAPERS

RELATED TOPICS