Exploring non-linear relationships among
redundant variables through non-parametric
principal component analysis:
An empirical analysis with land-use data
Gianluca Egidi Principal Component Analysis (PCA) is a
Tuscia University, Italy widely applied statistical technique aimed at
E-mail: egidi.gianluca@unitus.it summarising a multidimensional set of input
(partly redundant) variables into a restricted
Magda Edwards number of independent components that are
linear combinations of the inputs. PCA
Global Change Research Institute transforms the original data matrix by
(CzechGlobe), Czech Republic performing a spectral decomposition of the
E-mail: edwards.m@czechglobe.cz related variance/covariance (or correlation)
matrix. When decomposing a correlation
Sirio Cividino matrix, Pearson product-moment correlation
University of Udine, Italy coefficients are traditionally used in the
E-mail: sirio.cividino@uniud.it correlation matrix. The statistical properties
of Pearson correlation coefficients (being
Filippo Gambella insensitive to non-linear, high-order
University of Sassari, Italy correlations) represent an intrinsic limitation
E-mail:gambella@uniss.it of PCA, restricting its applicability to linear
relationships among inputs. However,
Luca Salvati working with variables displaying (more or
University of Macerata, Italy less intense) deviations from linearity is
common in both socioeconomic research and
E-mail: luca.salvati@unimc.it environmental studies. Following the
theoretical assumptions of earlier studies, a
generalisation of PCA aimed at exploring
non-linear multivariate relationships among
inputs is illustrated in the present article by
using non-parametric Spearman and Kendall
coefficients to replace linear Pearson
coefficients in the correlation matrix. The per
cent share of 19 land-use classes in the total
landscape in a given study area (the Athens
metropolitan region, Greece), obtained from
a high-resolution map at the local scale, were
used as inputs. The results of the standard
PCA (via decomposition of a Pearson linear
correlation matrix) and a generalised
approach (via decomposition of a nonparametric correlation matrix based on
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
2
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
Keywords:
multidimensional techniques,
spearman non-parametric
coefficients,
Principal Component Analysis (PCA),
large data sets,
indicators,
regional science
Spearman or Kendall rank coefficients) were
compared using traditional diagnostics. The
PCA performed by decomposing a Spearman
correlation matrix exhibited the highest
variance extracted by the principal
components, giving refined loadings and
scores that allow recognition of latent landuse patterns. Contributing to a recent debate
on the use of multidimensional techniques in
regional studies, non-parametric approaches
are promising tools for analysis of large
datasets displaying complex, almost nonlinear relationships among inputs.
Introduction
By integrating data collected at various spatial resolutions from different
information sources, the multidimensional analysis of large datasets is a pivotal issue
in regional studies, which requires empirical refinement (Jolliffe 2002, Bansal et al.
2010, Chelli et al. 2009, Ciommi et al. 2019). Multivariate statistical techniques
considering the combination of time, space, and/or other relevant analysis’
dimensions may adequately explore latent patterns in the input variables, illustrating,
for example, spatio-temporal trends characterising factors that affect the overall
matrix variability more or less intensely (Johnson–Wichern 2002, Mérigot et al.
2007, Han–Liu 2012, Ciommi et al. 2018). With the advent of big data and the
continuous implementation of data mining techniques, exploratory statistical
frameworks have been increasingly oriented towards the analysis of complex
datasets (e.g. Chelli et al. 2016, Ciommi et al. 2017, Salvati et al. 2017). In this
context, the treatment of variables deviating from normality and of non-linear
relationships among inputs is a key (theoretical and practical) challenge in regional
science (Cano-Orellana–Delgado-Cabeza 2015, Lamonica–Chelli 2018, Salvati et al.
2019). New estimators of non-linear relationships (e.g. through new approaches
decomposing correlation matrices) have been proposed, sometimes in the context
of robust regression, for example, reducing model outcomes’ sensitivity to outliers
(Rousseeuw–Leroy 2005). Closely related to factor analysis (Salvati–Zitti 2009),
principal component analysis (PCA) is likely the most widely used multivariate
technique in the quantitative analysis of measurement variables. In this regard, PCA
is an effective tool for exploring complex datasets, producing interpretative and
predictive models in vastly different disciplinary fields, from economics to
sociology, physics to chemistry, and from ecology to psychology (e.g. Vyas–
Kumaranayake 2006, Linting et al. 2007, Rosti–Chelli 2009, Khatun 2009, Colantoni
et al. 2016).
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
3
To overcome the increasing complexity and non-linearity of large datasets,
generalisations of the standard PCA originally proposed by Karl Pearson in 1901
have been advanced, using multi-linear, higher-order, robust and weighted
approaches (Lu et al. 2011, Rosti–Chelli 2012, Carlucci et al. 2018). More
specifically, the idea of replacing the classic correlation coefficient estimators in a
correlation matrix by non-parametric estimators has already been discussed in an
ecological framework (Mérigot et al. 2007, 2017) and applied to specific
environmental case studies with field measurements of biophysical phenomena
(Gaertner et al. 2010, 2013, Vásquez-Correa–Laniado Rodas 2019).
Practical approaches to exploratory analysis of non-linear relationships in
complex data matrices—a typical data configuration in regional science—are still
necessary for socioeconomic research, especially when more traditional statistical
techniques (e.g. econometric models) fail because of the intrinsic deviation of input
variables from the working assumptions of these approaches (Chelli–Rosti 2002,
Gigliarano–Chelli 2016, Salvati et al. 2018). Grounded in the broad debate of
exploratory multidimensional statistics applied to practical cases of multivariate
analysis (e.g. Tóth–Csomós 2016, Varga et al. 2016, Lamonica et al. 2020), the
present study contributes to this important topic by applying to empirical data a
simplified, non-parametric procedure aimed at appropriately investigating non-linear
relationships among inputs through a generalised PCA.
The generalisation proposed here was aimed at (i) improving the overall
performance of a PCA in terms of representation of the original data matrix and (ii)
providing a better representation of complex relationships among variables. The
procedure was run on a data matrix quantifying land-use composition (based on the
per cent share of 19 land-use classes in the total landscape obtained from a highresolution map) in 115 local domains (municipalities) belonging to a sufficiently
large region (the Athens’ metropolitan region in Greece).
This two-dimensional structure (variables and spatial units) represents a typical
data matrix in regional studies. Displaying highly complex and fragmented
landscapes resulting in a non-linear relationship between land-use extent over space,
the study area was taken as representative of metropolitan regions all over Europe.
Such regions are experiencing a progressive shift from monocentric structures
ordered along the urban gradient towards more heterogeneous and dispersed
configurations with sprawled settlements increasingly mixed with cropland and
forests (Salvati 2020). To improve the regional analysis and contribute more
significantly to land-use science with a refined understanding of non-linear
relationships in the land-use composition of a given landscape, the results of the
generalised approach described in this paper were finally compared with the
outcomes of a standard PCA.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
4
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
Methodology
PCA: Definitions and operational framework
A PCA performed on data collected on n variables for j cases investigates both the jdimensional space defined by n variables and the n-dimensional space defined by j
cases (Inkoom et al. 2018). In other words, PCA performs an orthogonal
transformation of n inputs – measured on a cardinal scale – into n linearly
uncorrelated variables called components (Primpas et al. 2010). The first step is to
compute a diagonal symmetric matrix based on the variances-covariances (or
Pearson linear correlations), depending on whether the data need to be standardised
or centred about the mean values (Tripathi–Singal 2019). In both cases, the result is
a new set of variables (principal components) that are linear combinations of the
inputs (Rajesh et al. 2018). Components are selected using optimisation procedures
(e.g. the least squares criterion), since linear combinations of inputs are sought that
best fit the clouds of points in the vector spaces (of variables and cases). This
procedure yields principal components that result in the highest sums of squares for
the orthogonal projections (Zhang et al. 2012). Orthogonal selection ensures that all
the components extracted are statistically independent and, thus, uncorrelated in the
multivariate field (Zou–Yoshino 2017).
Each component accounts for a variable proportion of variance in the original
dataset. More specifically, components are extracted iteratively, in a way that the first
component extracts the largest proportion of the total variance; the next component
accounts for the second largest proportion of variance and so on. In other words, the
second component is calibrated by extracting from the proportion of the variance
remaining after the extraction of the first component, and so forth. Thus, the
proportion of variance extracted by each component from 1 to n follows a
mathematical function that is decreasing monotonically (Jolliffe 2002). Although
components are derived to maximise the proportion of explained variance, the first
component can seldom capture the entire variance of the dataset (Hosseini–Kaneko
2011). At the same time, new (significant) variables are usually smaller in number than
the inputs and account for the inherent variation of the data to the maximum possible
extent. Choosing the number of components to analyse is notably subjective, although
several quantitative approaches exist for making this choice operationally.
The components retained in the final analysis can be selected based on
quantitative – although subjective – criteria. Examples of these criteria include
thresholds (to include or to exclude) based on the components’ absolute eigenvalues
(λ > 1) or the relative proportion of variance explained by each component (σ2 >
10%). Other criteria can be set in this context (Salvati–Zitti 2009). For instance,
PCA’s ability to summarise large amounts of information in a few relevant
components – taking the explained variance as constant – is intrinsically reflected in
the ratio of components selected for analysis to the total inputs n (e.g. Colantoni et
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
5
al. 2016). A lower value of this ratio indicates a higher ability to summarise
information associated with a large dataset into a few meaningful components
(Salvati 2014). In any case, the number of selected components for final analysis
should never equal the number of original variables; instead, it should be much
lower (e.g. Zambon et al. 2017).
Based on the coordinates of the selected components, a new space (the
components’ space) is generated, onto which variables and cases can be projected.
The results of a principal component extraction are, therefore, discussed in terms of
component scores (i.e. the transformed variable values corresponding to a particular
data point) and component loadings (i.e. the weight by which each standardised
variable should be multiplied to obtain the component score). A plot illustrating the
relationship between variables and cases (close variables, close cases, and close
variables and cases mean a strong association between them), called a biplot, is
considered the most relevant graphical tool in a PCA. A biplot allows inspection of
similarity patterns within the variables (loadings) and the cases (scores), and the
relationships between variables and cases, providing powerful information for
analysing apparent and latent patterns in the dataset (Cano-Orellana–DelgadoCabeza 2015).
PCA generalisation
In the first step of a PCA, mentioned in the former Section, the original data matrix
is transformed by performing a spectral decomposition of the variance/covariance
(or correlation) matrix. The use of a correlation matrix allows removing the scaling
effect associated with covariance matrices. In a standard PCA, correlation matrices
are obtained through the computation of the pairwise Pearson product-moment
correlation coefficients. Pearson correlation coefficients range between -1 and +1
and are extensively adopted for quantifying pairwise linear correlations, where –1
and +1 indicate a perfectly negative and a perfectly positive linear correlation,
respectively, with 0 indicating no evidence for a linear correlation (Jolliffe–Cadima
2016). The statistical property of Pearson (parametric) correlation coefficients’
sensitivity to linear correlations only is an intrinsic limitation of PCA, restricting its
applicability to specific datasets with mostly linear relationships among the inputs.
However, working with variables displaying significant deviations from linearity is
relatively common in regional studies (Salvati 2014). Based on the results and
simulations of earlier studies (Mérigot et al. 2007, 2017, Ciommi et al. 2019), a
generalisation of PCA aimed at exploring non-linear multivariate relationships
among inputs is illustrated below, using non-parametric correlation coefficients.
In the present study, PCAs were performed by decomposing correlation
matrices of both parametric (Pearson) and non-parametric (Spearman and Kendall)
coefficients. Pairwise Spearman’s rank correlation coefficients and Kendall
concordance coefficients assume that inputs were measured on – at least – an
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
6
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
ordinal (rank-order) scale, that is, that the individual observations can be ranked into
an ordered series (Kendall 1970). More specifically, Spearman's rank correlation
coefficient is a non-parametric measure of the statistical dependence between two
variables, quantifying how finely the relationship between two variables can be
assessed using a monotonic function. If there are no repeated data values, a perfect
Spearman correlation of +1 or −1 occurs when each of the variables is a perfect
monotonic function of the other variable (Zambon et al. 2017).
Spearman correlation is less sensitive than Pearson correlation to strong outliers
in the tails of both samples (Smarandache 2009). This is because Spearman's
coefficient limits the outlier to the value of its rank. Kendall’s rank correlation
coefficient is a non-parametric statistic, measuring the association between two
variables using the same numerical scale as Pearson’s or Spearman’s coefficients (i.e.
ranging between -1 and 1, with 0 indicating absence of any type of correlation). Like
Spearman’s coefficient, this measure is indicative of a rank correlation, that is, the
similarity of the orderings of the data when ranked by each of the variables. Neither
Spearman nor Kendal correlation relies on any assumption on the (marginal and
joint) distributions of the inputs (Kendall 1970).
In contrast to Pearson correlation coefficient, this means that neither of the two
aforementioned nonparametric correlation coefficients is sensitive to deviations
from (i) normality of the individual variables and (ii) linearity between a pair of the
variables. Thus, they provide a complete analysis of correlation regimes higher than
only first-order correlations (e.g. correlations based on square, cubic, or higherorder polynomial relationships). Based on these premises, a generalised approach
was proposed, and three separate analyses were performed on the same data set: (i)
the standard procedure of decomposing a Pearson correlation matrix and (ii) the
generalised procedure of decomposing a Spearman or Kendall correlation matrix.
The empirical findings of both approaches were compared using traditional
diagnostics (percentage of explained variance and statistical distribution of
component loadings).
PCA diagnostics
The 'quality of representation' is an important measure of the variance accounted
for by a given set of principal components. In the present study, components with
an eigenvalue > 1 were selected and further analysed. To assess the quality of PCA
outputs, the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy (testing
whether the partial correlations among the variables are small), and Bartlett’s test of
sphericity (testing whether the correlation matrix is an identity matrix), were
adopted to verify whether the ‘representation model’ based on the selected
components was appropriate for analysing the original dataset (Salvati–Zitti 2009).
The loadings, indicating the association between a given input and a given
component, were analysed to provide supplementary information on the overall
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
7
quality of representation for each PCA. Finally, the component scores were
computed to evaluate the relative contribution of each case to the variance of a
given component. The higher the contribution of a case, the heavier its weight on
the component is (Jolliffe 2002).
Empirical application
An input matrix containing the per cent share of 19 land-use classes in the total
landscape of 115 spatial domains (local municipalities) encompassing the entire extent
(3025 km2) of a metropolitan region in Southern Europe (Athens, Greece) was
obtained from a spatial overlay of a shapefile of municipal boundaries (source:
Hellenic Statistical Authority) and a shapefile map developed as part of the Urban
Atlas (UA) GMES(Global Monitoring and Environmental Surveillance)-Copernicus
initiative (Figure 1, Figure 2). The UA service offers high-resolution (1:10,000) landuse maps of more than 300 urban areas in Europe, representing a typical big-data,
remotely sensed source of information for landscape analysis. These maps adopt an
articulated classification of land-use (Table 1) based on the Corine Land Cover
nomenclature system, with a specific focus on urban areas. The study area represents a
monocentric city with a distinctive urban–rural gradient (Colantoni et al. 2016).
Figure 1
A simplified map of the study area, 2012
Built-up areas
Agricultural areas
Forest and other natural areas
Note: Based on the Urban Atlas database, illustrating the spatial distribution of three basic land-use classes.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
8
Figure 2
A map of the administrative boundaries of the local municipalities and
the average elevation (m), 2012
<250
250–500
>500
Note: Indicating the location of downtown Athens in the metropolitan region (star).
Geo-spatial databases derived from such maps provide an articulated
representation of local territories based on specific attributes (e.g. high-resolution
land-use surveys with heterogeneous classification systems for urban and rural areas,
available at different temporal and spatial scales, with variable coverage among
cities). They represent a great opportunity to test the effectiveness of PCA
generalisations when analysing non-linear relationships among inputs. Based on the
spatial overlay of the two maps described above, per cent class area in the total
landscape was computed at each municipality of the study region using the ‘zonal
statistics’ procedure, implemented within a Geographic Information System
(ArcGIS 10 release). To avoid complete multi-collinearity (i.e. linear dependence)
among the inputs, water bodies (code '50000') were not considered in the analysis.
The input matrix subjected to PCA finally comprised 19 columns (land-use classes)
and 115 rows (municipal units) with no missing data.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
9
Table 1
Land-use classification (based on Urban Atlas [UA] nomenclature) adopted in
the present study, 2012
UA code
11100
11210
11220
11230
11240
11300
12100
12210
12220
12230
12300
12400
13100
13300
13400
14100
14200
20000
30000
50000
a)
Description
Continuous Urban Fabric (sealed soil > 80%)
Discontinuous Dense Urban Fabric (50%–80%)
Discontinuous Medium-Density Urban Fabric (30%–50%)
Discontinuous Low-Density Urban Fabric (10%– 30%)
Discontinuous Very-Low-Density Urban Fabric (< 10%)
Isolated human structures
Industrial, commercial, public, military, and private units
Fast transit roads and associated land
Other roads and associated land
Railways and associated land
Port areas
Airports
Mineral extraction and dump sites
Construction sites
Land without current use
Green urban areas
Sports and leisure facilities
Agricultural areasa)
Forestsa)
Water bodies
Code used in this study
1110
1121
1122
1123
1124
1130
121
1221
1222
1223
1230
1240
1310
1330
1340
1410
1420
2
3
Not considered
Sub-classes in both of these fields were combined.
Results
The components with eigenvalues > 1 were selected, and Table 2 reports the basic
diagnostics of the generalised PCA runs on the input dataset using Pearson,
Spearman, or Kendall metrics. While the first component extracted the highest
variance when using a Pearson correlation matrix, the cumulated variance from
component 2 onwards when using a Spearman’s correlation matrix was
systematically higher than that observed when using the other two correlation
matrices. Moreover, the statistics used to evaluate the quality of representation
(KMO and Bartlett's tests) indicate that the PCA using Spearman’s correlation
coefficients achieved an optimal data representation.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
10
Table 2
Diagnostics of generalised PCA based on spectral decomposition of different
correlation matrices, by correlation measure, 2012
Correlation
Pearson
Spearman
Kendall
Cumulated variance extracted by the j-th component, %
KMO
Bartlett's
1st
2nd
3rd
4th
5th
test
Test
35.6
32.8
26.1
49.2
51.7
41.4
60.0
62.9
51.4
67.2
70.2
58.6
73.8
76.2
64.6
0.676
0.838
0.790
1712
1771
819
Figure 3
Difference in the per cent variance extracted by component for Pearson vs
Spearman and Kendall vs Pearson correlation matrices, 2012
Figure 3 illustrates the absolute difference in the percentage of variance extracted
by component for Pearson vs Spearman and Kendall vs Pearson correlation
matrices. Assuming the first five components as representative of the input data
matrix (see above), the difference between Spearman and Pearson matrices indicates
that use of Spearman coefficients allows for (more or less) intense gains in the
proportion of explained variance up to the fifth component. By contrast, PCA using
a Kendall correlation matrix extracted less variance up to the fifth components
compared to PCA using a Pearson correlation matrix.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
11
Figure 4
Loading plot (component 1 vs 2) by correlation matrix type, 2012
Pearson correlation matrix
Spearman correlation matrix
Rural
Urban
Figure 4 compares the inputs’ loadings along components 1 and 2. On average,
the loadings on the principal components were larger when using Spearman’s
correlation coefficients. Components 1 and 2 clearly discriminate continuous urban
fabric (‘1100’) from rural areas (both agricultural, ‘2’, and forests, ‘3’), the typical
density gradient observed in the study area (Colantoni et al. 2016, Pili et al. 2017).
The results of the PCA using Pearson correlation coefficients are less clear as far as
the monocentric structure of the city and the consequent urban-rural polarisation in
landscape composition (Zambon et al. 2017). This result confirms the high quality
of representation achieved by the two components extracted with the PCA based
on a Spearman correlation matrix. Figure 5 compares the distribution of scores on
the selected components extracted using Pearson and Spearman correlation
matrices. The two score plots are substantially similar, with the Spearman plot
showing a slightly higher capability to discriminate cases along the two components.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
12
Figure 5.
Score plot (component 1 vs 2) by correlation matrix type, 2012
Pearson correlation matrix
Spearman correlation matrix
Discussion
Structure (e.g. the spatial arrangement of individual elements in a given landscape),
functions (e.g. how ecological processes operate within a given landscape structure),
and dynamics of change (e.g. disturbance and recovery) are characteristics of a given
landscape. They result from the continuous interplay of physical, biological, and
socio-cultural processes acting simultaneously over broad regions at different spatiotemporal scales (Bollen et al. 2009, Salvati 2014, Colantoni et al. 2018). The
exploratory analysis of landscape data that includes accounting for non-linear
relationships among inputs is an increasingly challenging issue (Salvati et al. 2019).
With the evolution of remote techniques, a supply of big data to monitor land-use
changes has been rapidly increasing in volume, dimension, and completeness, at
both the global and regional scales (Guido et al. 2017, Todde et al. 2017, Jolliffe–
Cadima 2016). Because of their intrinsic complexity, datasets derived from such
sources represent a pivotal opportunity to verify the informative power,
computational effectiveness, and simplicity of non-parametric generalisations of
more traditional techniques (Li–Wu 2004). Non-parametric approaches have been
regarded as promising tools in the analysis of large data sets characterised by
complex relationships among inputs (Salvati et al. 2018).
PCA has been frequently adopted as an exploratory framework for analysing
highly correlated data sets (Vyas–Kumaranayake 2006). PCA was originally
proposed as a linear technique, whose intrinsic limit was the ineffectiveness analysis
of non-linear systems typically observed in regional studies (Zou–Yoshimo 2017).
Several approaches extending linear PCA to the analysis of non-linear datasets have
been proposed, including new singular value decomposition techniques, (simple or
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
13
recursive) Kernel PCA, and the adoption of genetic programming algorithms for
achieving the objectives of non-linear PCA (Zhang et al. 2012). While giving a high
degree of structural parsimony, principal curves contribute to the generalisation of
PCA with a natural geometric framework (Linting et al. 2007, Lu et al. 2011,
Jolliffe–Cadima 2016).
The present study proposes a simplified evaluation of landscape indicators using
a generalised PCA based on the decomposition of parametric (Pearson) and nonparametric (Spearman and Kendall) correlation matrices. In our case, decomposing
a Spearman’s correlation matrix gave the highest performance in terms of (i) the
explained variance, (ii) standard diagnostics (KMO and Bartlett’s tests), and (iii)
component loadings and scores. By increasing the overall representation quality, use
of the Spearman correlation matrix provides a refined picture of the latent
relationships in a data set and can be routinely applied in a multivariate analysis
(Salvati 2014). Moreover, standard PCA is particularly sensitive to outliers (Jolliffe
2002). Although outliers are difficult to identify in some contexts (Cano-Orellana–
Delgado-Cabeza 2015), removing outliers before the computations is a common
practice in several techniques, including robust PCA using low-rank and sparse
matrices (Johnson–Wichern 2002). Contributing to resolving this issue, the
Spearman correlation coefficient is also less sensitive to outliers (Hubert et al. 2005).
Although it reduces sensitivity to outliers, it should be mentioned that the Spearman
(and also Kendall) correlation coefficient does not solve all non-linearity issues, such
as non-monotonicity (Han–Liu 2012). These problems need further examination
that better integrates practitioners’ analysis needs with the objective of efficiency
and robustness of statistical theory (Bansal et al. 2010). Refined and new approaches
addressing complex, non-linear, and fragmented data patterns, typical of regional
science, are a necessary pre-requisite to reliable, stable, and informed outcomes of
any exploratory multivariate statistical method (Schwedhelm et al. 2018).
Concluding remarks
The results of this study encourage the integrated use of PCA using both Pearson
and Spearman correlation matrices. The direct comparison of the results derived
from the two types of analyses can provide effective information on more complex
relationships among inputs, contributing to non-linear pattern recognition. The
increasing demand for quantitative approaches, high-resolution data, and indicators
suitable to inform multi-disciplinary studies at refined spatio-temporal scales
stimulates research aimed at summarising trends and extracting latent patterns from
large datasets in regional science. These approaches should concentrate on a
comparative analysis of different tools, improving the representation quality of big
data in multi-dimensional space.
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
14
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
REFERENCES
BANSAL, S.–POURBOHLOUL, B.–HUPERT, N.–GRENFELL, B.–MEYERS, L. A. (2010): The
shifting demographic landscape of pandemic influenza PLoS One 5 (2): e9360.
https://doi.org/10.1371/journal.pone.0009360
BOLLEN, J.–VAN DE SOMPEL, H.–HAGBERG, A.–CHUTE, R. (2009): A principal component
analysis of 39 scientific impact measures PloS One 4 (6): e6022.
https://doi.org/10.1371/journal.pone.0006022
CANO-ORELLANA, A.–DELGADO-CABEZA, M. (2015): Local ecological footprint using
Principal Component Analysis: A case study of localities in Andalusia (Spain)
Ecological Indicators 57: 573–579. https://doi.org/10.1016/j.ecolind.2015.03.014
CARLUCCI, M.–CHELLI, F. M.–SALVATI, L. (2018): Toward a new cycle: Short-term
population dynamics, gentrification, and re-urbanization of Milan (Italy)
Sustainability 10 (9): 3014. https://doi.org/10.3390/su10093014
CHELLI, F.–GIGLIARANO, C.–MATTIOLI, E. (2009): The impact of inflation on
heterogeneous groups of households: An application to Italy Economics Bulletin
29 (2): 1276–1295.
CHELLI, F.–ROSTI, L. (2002): Age and gender differences in Italian workers’ mobility
International Journal of Manpower 23 (4): 313–325.
https://doi.org/10.1108/01437720210439656
CHELLI, F. M.–CIOMMI, M.–EMILI, A.–GIGLIARANO, C.–TARALLI, S. (2016): Assessing the
Equitable and Sustainable Well-Being of the Italian Provinces International Journal
of Uncertainty, Fuzziness and Knowlege-Based Systems 24 (1): 39–62.
https://doi.org/10.1142/S0218488516400031
CIOMMI, M.–CHELLI, F. M.–CARLUCCI, M.–SALVATI, L. (2018): Urban growth and
demographic dynamics in southern Europe: Toward a new statistical approach to
regional science Sustainability 10 (8): 2765. https://doi.org/10.3390/su10082765
CIOMMI, M.–CHELLI, F. M.–SALVATI, L. (2019): Integrating parametric and non-parametric
multivariate analysis of urban growth and commuting patterns in a European
metropolitan area Quality and Quantity 53 (2): 957–979.
https://doi.org/10.1007/s11135-018-0798-2
CIOMMI, M.–GIGLIARANO, C.–EMILI, A.–TARALLI, S.–CHELLI, F. M. (2017): A new class of
composite indicators for measuring well-being at the local level: An application
to the Equitable and Sustainable Well-being (BES) of the Italian Provinces
Ecological Indicators 76: 281–296. https://doi.org/10.1016/j.ecolind.2016.12.050
COLANTONI, A.–GRIGORIADIS, E.–SATERIANO, A.–VENANZONI, G.–SALVATI, L. (2016):
Cities as selective land predators? A lesson on urban growth, deregulated
planning and sprawl containment Science of the Total Environment 545: 329–339.
https://doi.org/10.1016/j.scitotenv.2015.11.170
COLANTONI, A.–MONARCA, D.–LAURENDI, V.–VILLARINI, M.–GAMBELLA, F.–CECCHINI,
M. (2018): Agriculture 8 (4): 47. https://doi.org/10.3390/agriculture8040047
GAERTNER, J-C.–MAIORANO, P.–MÉRIGOT, B.–COLLOCA F.–POLITOU, C-Y.–GIL DE
SOLA, L.–BERTRAND, J. A.–MURENU, M.–DURBEC, J-P.–KALIANIOTIS, A.–
MANINI, A. (2013): Large-scale diversity of slope fishes: patterns inconsistency
between multiple diversity indices Plos One 13 (1): e0191588.
https://doi.org/10.1371/journal.pone.0191588
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
15
GAERTNER, J. C.–MÉRIGOT, B.–RELINI, G.–BERTRAND, J. A.–POLITOU, C.-Y.–MAZOUNI,
N.–GIL DE SOLA, L.–KALLIANIOTIS, A.–CARPENTIERI, P.–MURENU, M.–
DURBEC, J. P.–VRGOC, N.–UNGARO, N. (2010): Reproducibility of the multicomponent aspect of species diversity through different areas and scales: towards
the constitution of a shortlist of complementary indices for monitoring fish
diversity? Ecography 33 (6): 1123–1135.
https://doi.org/10.1111/j.1600-0587.2009.06259.x
GIGLIARANO, C.–CHELLI, F. M. (2016): Measuring inter-temporal intragenerational mobility:
an application to the Italian labour market Quality and Quantity 50 (1): 89–102.
https://doi.org/10.1007/s11135-014-0138-0
GUIDO, V.–MERCENARO, L.–GAMBELLA, F. (2017): Application of Proximal Sensing in
Viticulture: Comparison of Different Berry State Conditions Chemical Engineering
Transactions 58: 613–618. https://doi.org/10.3303/CET1758103
HAN, F.–LIU, H. (2012): Semiparametric principal component analysis In: JORDAN, M. I.–
LECUN, Y.– SOLLA, S. A. Advances in Neural Information Processing Systems Proceedings
of the First 12 Conferences pp. 171–179. The MIT Press, Cambridge, MA.
HOSSEINI, H. M.–KANEKO, S. (2011): Dynamic sustainability assessment of countries at the
macro level: A principal component analysis Ecological Indicators 11 (3): 811–823.
https://doi.org/10.1016/j.ecolind.2010.10.007
HUBERT, M.–ROUSSEEUW, P. J.–VANDEN BRANDEN, K. (2005): ROBPCA: a new approach
to robust principal component analysis Technometrics 47 (1): 64–79.
https://doi.org/10.1198/004017004000000563
INKOOM, J. N.–FRANK, S.–GREVE, K.–WALZ, U.–FÜRST, C. (2018): Suitability of different
landscape metrics for the assessments of patchy landscapes in West Africa
Ecological Indicators 85: 117–127. https://doi.org/10.1016/j.ecolind.2017.10.031
JOHNSON, R. A.–WICHERN, D. W. (2002): Applied Multivariate Statistical Analysis Prentice Hall,
New York.
JOLLIFFE, I. T. (2002): Principal Component Analysis. Springer Series in Statistics Springer, New York.
JOLLIFFE, I. T.–CADIMA, J. (2016): Principal component analysis: a review and recent
developments Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences 374 (2065): 20150202. https://doi.org/10.1098/rsta.2015.0202
KENDALL, M. G. (1970): Rank correlation methods Griffin, London.
KHATUN, T. (2009): Measuring environmental degradation by using principal component
analysis. Environment Development and Sustainability 11: 439–457.
https://doi.org/10.1007/s10668-007-9123-2
LAMONICA, G. R.–CHELLI, F. M. (2018): The performance of non-survey techniques for
constructing sub-territorial input-output tables Papers in Regional Science 97 (4):
1169–1202. https://doi.org/10.1111/pirs.12297
LAMONICA, G. R.–RECCHIONI, M. C.–CHELLI, F. M.–SALVATI, L. (2020): The efficiency of
the cross-entropy method when estimating the technical coefficients of input–
output tables Spatial Economic Analysis 15 (1): 62–91.
https://doi.org/10.1080/17421772.2019.1615634
LI, H.–WU, J. (2004): Use and misuse of landscape indices Landscape Ecology 19: 389–399.
https://doi.org/10.1023/B:LAND.0000030441.15628.d6
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
16
Gianluca Egidi – Magda Edwards – Sirio Cividino
– Filippo Gambella – Luca Salvati
LINTING, M.–MEULMAN, J. J.–GROENEN, P. J. F.–VAN DER KOOJJ, A. (2007): Nonlinear
principal components analysis: Introduction and application Journal of Psychological
Methods 12 (3): 336–358. https://doi.org/10.1037/1082-989X.12.3.336
LU, H.–PLATANIOTIS, K. N.–VENETSANOPOULOS, A. N. (2011): A Survey of Multilinear
Subspace Learning for Tensor Data Pattern Recognition 44 (7): 1540–1551.
https://doi.org/10.1016/j.patcog.2011.01.004
MÉRIGOT, B.–BERTRAND, J. A.–GAERTNER, J. C.–DURBEC, J. P.–MAZOUNI, N.–MANTÉ, C.
(2007): The multi-component structuration of the species diversity of groundfish
assemblages of the east coast of Corsica (Mediterranean Sea): Variation
according to the bathymetric strata Fisheries Research 88 (1–3): 120–132.
https://doi.org/10.1016/j.fishres.2007.08.003
MÉRIGOT, B.–FRÉDOU, F. L.–VIANA, A. P.–FERREIRA, B. P.–JUNIOR, E. D. N. C.–DA SILVA
JÚNIOR, C. B.–FRÉDOU, T. (2017): Fish assemblages in tropical estuaries of
northeast Brazil: A multi-component diversity approach Ocean & Coastal
Management 143: 175–183. https://doi.org/10.1016/j.ocecoaman.2016.08.004
Pili, S.–Grigoriadis, E.–Carlucci, M.–Clemente, M.–Salvati, L. (2017): Towards sustainable
growth? A multi-criteria assessment of (changing) urban forms Ecological
Indicators 76: 71–80.
PRIMPAS, I.–TSIRTSIS, G.–KARYDIS, M.–KOKKORIS, G. D. (2010): Principal component
analysis: Development of a multivariate index for assessing eutrophication
according to the European water framework directive Ecological Indicators 10 (2):
178–183. https://doi.org/10.1016/j.ecolind.2009.04.007
RAJESH, S.–JAIN, S.–SHARMA, P. (2018): Inherent vulnerability assessment of rural
households based on socio-economic indicators using categorical principal
component analysis: A case study of Kimsar region, Uttarakhand Ecological
Indicators 85: 93–104. https://doi.org/10.1016/j.ecolind.2017.10.014
ROSTI, L.–CHELLI, F. (2009): Self-employment among Italian female graduates Education and
Training 51 (7): 526–540. https://doi.org/10.1108/00400910910992736
ROSTI, L.–CHELLI, F. (2012): Higher education in non-standard wage contracts Education and
Training 54 (2–3): 142–151. https://doi.org/10.1108/00400911211210251
ROUSSEEUW, P. J.–LEROY, A. M. (2005): Robust regression and outlier detection (Vol. 589). Wiley,
London.
SALVATI, L. (2014): Agro-forest landscape and the ‘fringe’city: A multivariate assessment of
land-use changes in a sprawling region and implications for planning Science of the
Total Environment 490: 715–723. https://doi.org/10.1016/j.scitotenv.2014.05.080
SALVATI, L. (2020): Density-dependent population growth in Southern Europe (1961–2011): A
non-parametric approach using smoothing splines Regional Statistics 10 (2): 27–41.
https://doi.org/10.15196/RS100201
SALVATI, L.–ZITTI, M. (2009): The environmental ‘risky’ region: identifying land degradation
processes through integration of socio-economic and ecological indicators in a
multivariate regionalization model Environmental Management 44 (5): 888–899.
https://doi.org/10.1007/s00267-009-9378-5
SALVATI, L.–GUANDALINI, A.–CARLUCCI, M.–CHELLI, F. M. (2017): An empirical assessment
of human development through remote sensing: Evidences from Italy Ecological
Indicators 78: 167–172. https://doi.org/10.1016/j.ecolind.2017.03.014
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105
Exploring non-linear relationships among redundant variables through nonparametric principal component analysis: An empirical analysis with land-use data
17
SALVATI, L.–FERRARA, A.–CHELLI, F. (2018): Long-term growth and metropolitan spatial
structures: an analysis of factors influencing urban patch size under different
economic cycles Geografisk Tidsskrift - Danish Journal of Geography 118 (1): 56–71.
https://doi.org/10.1080/00167223.2017.1386582
SALVATI, L.–CIOMMI, M. T.–SERRA, P.–CHELLI, F. M. (2019): Exploring the spatial structure
of housing prices under economic expansion and stagnation: The role of sociodemographic factors in metropolitan Rome, Italy Land Use Policy 81: 143–152.
https://doi.org/10.1016/j.landusepol.2018.10.030
SCHWEDHELM, C.–IQBAL, K.–KNÜPPEL, S.–SCHWINGSHACKL, L.–BOEING, H. (2018):
Contribution to the understanding of how principal component analysis–derived
dietary patterns emerge from habitual data on food consumption The American
Journal of Clinical Nutrition 107 (2): 227–235. https://doi.org/10.1093/ajcn/nqx027
SMARANDACHE, F. (2009): Alternatives to Pearson's and Spearman's correlation coefficients
International Journal of Statistics and Economics 3 (S09): 47–53.
TODDE, G.–CARIA, M.–GAMBELLA, F.–PAZZONA, A. (2017): Energy and carbon impact of
precision livestock farming technologies implementation in the milk chain: from
dairy farm to cheese factory Agriculture 7 (10): 79.
https://doi.org/10.3390/agriculture7100079
TÓTH, G.–CSOMÓS, G. (2016): Mapping the position of cities in corporate research and
development through a gravity model-based bidimensional regression analysis
Regional Statistics 6 (1): 217–220. https://doi.org/10.15196/RS06111
TRIPATHI, M.–SINGAL, S. K. (2019): Use of Principal Component Analysis for parameter
selection for development of a novel Water Quality Index: A case study of river
Ganga India Ecological Indicators 96 (1): 430–436.
https://doi.org/10.1016/j.ecolind.2018.09.025
VARGA, L.–TÓTH, G.–NÉDA, Z. (2016): An improved radiation model and its applicability
for understanding commuting patterns in Hungary Regional Statistics 6 (2): 27–38.
https://doi.org/10.15196/RS06202
VÁSQUEZ-CORREA, C. M.–LANIADO RODAS, H. (2019): A robust approach for principal
component analysis arXiv arXiv:1903.00093
VYAS, S.–KUMARANAYAKE, L. (2006): Constructing socio-economic status indices: how to
use principal components analysis Health Policy and Planning 21 (6): 459–468.
https://doi.org/10.1093/heapol/czl029
ZAMBON, I.–SERRA, P.–SAURI, D.–CARLUCCI, M.–SALVATI, L. (2017): Beyond the
‘Mediterranean city’: Socioeconomic disparities and urban sprawl in three
Southern European cities Geografiska Annaler: Series B, Human Geography 99 (3):
319–337. https://doi.org/10.1080/04353684.2017.1294857
ZHANG, Y.–LI, S.–TENG, Y. (2012): Dynamic processes monitoring using recursive kernel
principal component analysis Chemical Engineering Science 72: 78–86.
https://doi.org/10.1016/j.ces.2011.12.026
ZOU, T.–YOSHINO, K. (2017): Environmental vulnerability evaluation using a spatial
principal components approach in the Daxing’anling region, China Ecological
Indicators 78: 405–415. https://doi.org/10.1016/j.ecolind.2017.03.039
Regional Statistics, Vol. 11. No. 1. 2021
Online first Egidi–Edwards–Cividino–Gambella–Salvati 1–17; DOI: 10.15196/ RS110105