# Validation of Hierarchical Climate Regionalization

`validClimR.Rd`

`validClimR`

computes indices for cluster validation, and an
objective tree cut for `regional`

linkage clustering method.

## Usage

```
validClimR(y = NULL, k = NULL, minSize = 1, alpha = 0.05, verbose = TRUE,
plot = FALSE, colPalette = NULL, pch = 15, cex = 1)
```

## Arguments

- y
a dendrogram tree produced by

`HiClimR`

.- k
`NULL`

or a n integer`k > 1`

for the number of regions/clusters. Only for`regional`

linkage method,`k = NULL`

is supported, where the "optimal" number of regions will be used at a user specified significance level`alpha`

. It is required to specify number of clusters`k`

for the other methods, since they are not based on inter-cluster correlation. If`k = NULL`

for these methods (except`regional`

) linkage, the`validClimR`

with be aborted. One can use`validClimR`

function to compute inter-cluster correlation at different number of clusters to objectively cut the tree for the other methods, which could be computationally expensive to cover the entire merging history for large number of spatial elements.- minSize
minimum cluster size. The

`regional`

linkage method tend to isolate noisy data in small clusters. The`minSize`

can be used to exclude these very small clusters from the`statSum`

statistical summary, because they are most likely noisy data that need to be checked in a quality control step. The analysis may be then repeated.- alpha
confidence level: the default is

`alpha = 0.05`

for 95% confidence level.- verbose
logical to print processing information if

`verbose = TRUE`

.- plot
logical to call the plotting method if

`plot = TRUE`

.- colPalette
a color palette or a list of colors such as that generated by

`rainbow`

,`heat.colors`

,`topo.colors`

,`terrain.colors`

or similar functions.- pch
Either an integer specifying a symbol or a single character to be used as the default in plotting points. See

`points`

for possible values.- cex
A numerical value giving the amount by which plotting symbols should be magnified relative to the

`default = 1`

.

## Value

An object of class `HiClimR`

which produces indices for validating
the tree produced by the clustering process.
The object is a list with the following components:

- cutLevel
the minimum significant correlation used for objective tree cut together with the corresponding confidence level.

- clustMean
the cluster means which are the region's mean timeseries for all selected regions.

- clustSize
cluster sizes for all selected regions.

- clustFlag
a flag

`0 or 1`

to indicate the cluster used in`statSum`

validation indices (`interCor`

,`intraCor`

,`diffCor`

, and`statSum`

), based on`minSize`

minimum cluster size. If`clustFlag = 0`

, the cluster has been excluded because its size is less than the`minSize`

minimum cluster size. The sum of`clustFlag`

elements represents the selected number clusters.- interCor
inter-cluster correlations for all selected regions. It is the inter-cluster correlations between cluster means. The maximum inter-cluster correlation is a measure for separation or contiguity, and it is used for objective tree cut (to find the "optimal" number of clusters).

- intraCor
intra-cluster correlations for all selected regions. It is the intra-cluster correlations between the mean of each cluster and its members. The average intra-cluster correlation is a weighted average for all clusters, and it is a measure for homogeneity.

- diffCor
difference between intra-cluster correlation and maximum inter-cluster correlation for all selected regions.

- statSum
overall statistical summary for i

`nterCluster`

,`intraCor`

, and`diffCor`

.- region
ordered regions vector of size

`N`

number of spatial elements for the selected number of clusters, after excluding the small clusters defined by`minSize`

argument.- regionID
ordered regions ID vector of length equals the selected number of clusters, after excluding the small clusters defined by

`minSize`

argument. It helps in mapping ordered regions and their actual names before ordering. Only the`region`

component uses ordered ID, while other components use the names used during the clustering process.

## Details

The `validClimR`

function is used for validation of a dendrogram tree
produced by `HiClimR`

, by computing detailed statistical information for
each cluster about cluster means, sizes, intra- and inter-cluster correlations,
and overall summary. It requires the preprocessed data matrix and the tree from
`HiClimR`

function as inputs. An optional parameter can be used to
validate clustering for a selected number of clusters `k`

. If `k = NULL`

,
the default which supports only the `regional`

linkage method, objective cutting
of the tree to find the optimal number of clusters will be applied based on a user
specified significance level (`alpha`

parameter). In `regional`

linkage method,
noisy spatial elements are isolated in very small-size clusters or individuals since
they do not correlate well with any other elements. They can be excluded from the
validation indices (`interCor`

, `intraCor`

, `diffCor`

, and `statSum`

),
based on `minSize`

minimum cluster size. The excluded clusters are identified in
the output of `validClimR`

in `clustFlag`

, which takes a value of `1`

for selected clusters or `0`

for excluded clusters. The sum of `clustFlag`

elements represents the selected number clusters.This should be followed by a quality
control step before repeating the analysis.

## References

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015):
A Tool for Hierarchical Climate Regionalization, *Earth Science Informatics*,
**8**(4), 949-958, doi:10.1007/s12145-015-0221-7
.

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014):
Hierarchical Climate Regionalization,
*Comprehensive R Archive Network (CRAN)*,
https://cran.r-project.org/package=HiClimR.

## Author

Hamada S. Badr <badr@jhu.edu>, Benjamin F. Zaitchik <zaitchik@jhu.edu>,
and Amin K. Dezfuli <amin.dezfuli@nasa.gov>. `HiClimR`

is
a modification of `hclust`

function, which is based on
Fortran code contributed to STATLIB by F. Murtagh.

## Examples

```
require(HiClimR)
## Load test case data
x <- TestCase$x
## Generate longitude and latitude mesh vectors
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Hierarchical Climate Regionalization
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE,
kH = NULL, members = NULL, validClimR = TRUE, k = 12, minSize = 1,
alpha = 0.01, plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#>
#> PROCESSING STARTED
#>
#> Checking Multivariate Clustering (MVC)...
#> ---> x is a matrix
#> ---> single-variate clustering: 1 variable
#> Checking data...
#> ---> Checking dimensions...
#> ---> Checking row names...
#> ---> Checking column names...
#> Data filtering...
#> ---> Computing mean for each row...
#> ---> Checking rows with mean bellow meanThresh...
#> ---> 4678 rows found, mean ≤ 10
#> ---> Computing variance for each row...
#> ---> Checking rows with near-zero-variance...
#> ---> 3951 rows found, variance ≤ 0
#> Data preprocessing...
#> ---> Applying mask...
#> ---> Checking columns with missing values...
#> ---> Removing linear trend...
#> ---> Standardizing data...
#> Agglomerative Hierarchical Clustering...
#> ---> Computing correlation/dissimilarity matrix...
#> ---> Starting clustering process...
#> ---> Constructing dendrogram tree...
#> Calling cluster validation...
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
#>
#> PROCESSING COMPLETED
#>
#> Running Time:
#> user system elapsed
#> 0.233 0.024 0.257
#> Time difference of 0.2574458 secs
## Validtion of Hierarchical Climate Regionalization
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Use a specified number of clusters (k = 12)
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Apply minimum cluster size (minSize = 25)
z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## The optimal number of clusters, including small clusters
k <- length(z$clustFlag)
## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks <- sum(z$clustFlag)
```