# Fast correlation for large matrices

`fastCor.Rd`

`fastCor`

is a helper function that compute Pearson correlation matrix
for `HiClimR`

and `validClimR`

functions. It is similar
to `cor`

function in R but uses a faster implementation on 64-bit
machines (an optimized `BLAS`

library is highly recommended). `fastCor`

also uses a memory-efficient algorithm that allows for splitting the data matrix and
only compute the upper-triangular part of the correlation matrix. It can be used to
compute correlation matrix for the columns of any data matrix.

## Arguments

- xt
an (

`M`

rows by`N`

columns) matrix of 'double' values:`N`

objects (spatial points or stations) to be clustered by`M`

observations (temporal points or years). It is the transpose of the input matrix`x`

required for`HiClimR`

and`validClimR`

functions.- nSplit
integer number greater than or equal to one, to split the data matrix into

`nSplit`

splits of the total number of columns`ncol(xt)`

. If`nSplit = 1`

, the default method will be used to compute correlation matrix for the full data matrix (no splits). If`nSplit > 1`

, the correlation matrix (or the upper-triangular part if`upperTri = TRUE`

) will be allocated and filled with the computed correlation sub-matrix for each split. the first`n-1`

splits have equal size while the last split may include any remaining columns. This is used with`upperTri = TRUE`

to compute only the upper-triangular part of the correlation matrix. The maximum number of splits`nSplitMax = floor(N / 2)`

makes splits with 2 columns; if`nSplit > nSplitMax`

,`nSplitMax`

will be used. Very large number of splits`nSplit`

makes computation slower but it could handle big data or if the available memory is not enough to allocate the correlation matrix, which helps in solving the “Error: cannot allocate vector of size...” memory limitation problem. It is recommended to start with a small number of splits. If the data is very large compared to the physical memory, it is highly recommended to use a 64-Bit machine with enough memory resources and/or use coarsening feature for gridded data by setting`lonStep > 1`

and`latStep > 1`

.- upperTri
logical to compute only the upper-triangular half of the correlation matrix if

`upperTri = TRUE`

and`nSplit > 1`

., which includes all required info since the correlation/dissimilarity matrix is symmetric. This almost halves memory use, which can be very important for big data.- optBLAS
logical to use optimized BLAS library if installed and

`optBLAS = TRUE`

only on 64-bit machines.- verbose
logical to print processing information if

`verbose = TRUE`

.

## Details

The `fastCor`

function computes the correlation matrix by
calling the cross product function in the Basic Linear Algebra Subroutines
(BLAS) library used by R. A significant performance improvement can be
achieved when building R on 64-bit machines with an optimized BLAS library,
such as *ATLAS*, *OpenBLAS*, or the commercial *Intel MKL*.
For big data, the memory required to allocate the square matrix of correlations
may exceed the total amount of physical memory available resulting in
“Error: cannot allocate vector of size...”. `fastCor`

allows
for splitting the data matrix into `nSplit`

splits and only computes the
upper-triangular part of the correlation matrix with `upperTri = TRUE`

.
This almost halves memory use, which can be very important for big data.
If `nSplit > 1`

, the correlation matrix (or the upper-triangular part if
`upperTri = TRUE`

) will be allocated and filled with computed correlation
sub-matrix for each split. the first `n-1`

splits have equal size while
the last split may include any remaining columns.

## References

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015):
A Tool for Hierarchical Climate Regionalization, *Earth Science Informatics*,
**8**(4), 949-958, doi:10.1007/s12145-015-0221-7
.

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014):
Hierarchical Climate Regionalization,
*Comprehensive R Archive Network (CRAN)*,
https://cran.r-project.org/package=HiClimR.

## Author

Hamada S. Badr <badr@jhu.edu>, Benjamin F. Zaitchik <zaitchik@jhu.edu>, and Amin K. Dezfuli <amin.dezfuli@nasa.gov>.

## See also

`HiClimR`

, `HiClimR2nc`

, `validClimR`

,
`geogMask`

, `coarseR`

, `fastCor`

,
`grid2D`

and `minSigCor`

.

## Examples

```
require(HiClimR)
## Load test case data
x <- TestCase$x
## Use fastCor function to compute the correlation matrix
t0 <- proc.time() ; xcor <- fastCor(t(x)) ; proc.time() - t0
#> ---> Checking zero-variance data...
#> ---> Total number of variables: 6400
#> ---> WARNING: 3951 variables found with zero variance
#> user system elapsed
#> 0.542 0.048 0.590
## compare with cor function
t0 <- proc.time() ; xcor0 <- cor(t(x)) ; proc.time() - t0
#> Warning: the standard deviation is zero
#> user system elapsed
#> 0.527 0.032 0.558
if (FALSE) {
## Split the data into 10 splits and return upper-triangular half only
xcor10 <- fastCor(t(x), nSplit = 10, upperTri = TRUE)
}
```