Adapted from glab.library::PCA_from_file
.
Usage
run_PCA(
df,
savename = NULL,
summary = FALSE,
center = TRUE,
scale = FALSE,
tol = 0.05,
rank = NULL,
screeplot = TRUE
)
Arguments
- df
(path to) numeric dataframe; samples as columns, genes/features as rows
- savename
string; filepath (no ext.) to save PCA scores, loadings, sdev under
- summary
logical; output summary info
- center
logical; indicate whether the variables should be shifted to be zero centered
- scale
logical; indicate whether the variables should be scaled to have unit variance
- tol
numeric; indicate the magnitude below which components should be omitted
- rank
integer; a number specifying the maximal rank, i.e., maximal number of principal components to be used
- screeplot
logical; output + save screeplot?
Details
In general, Z-score standardization (center = T
; scale = T
) before PCA is advised. For (transformed) gene expression data, genearlly, center but don't scale.
center = T
: PCA maximizes the sum-of-squared deviations from the origin in the first PC. Variance is only maximized if the data is pre-centered.
scale = T
: If one feature varies more than others, the feature will dominate resulting principal components. Scaling will also result in components in the same order of magnitude.
Use either tol
or rank
, but not both.
Examples
data(iris)
Rubrary::run_PCA(t(iris[,c(1:4)]))
#> ** Cumulative var. exp. >= 80% at PC 1 (92.5%)
#> Standard deviations (1, .., p=4):
#> [1] 2.0562689 0.4926162 0.2796596 0.1543862
#>
#> Rotation (n x k) = (4 x 4):
#> PC1 PC2 PC3 PC4
#> Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
#> Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
#> Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
#> Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574