
panGWAS_with_large_dataset
Source:vignettes/panGWAS_with_large_dataset.Rmd
panGWAS_with_large_dataset.Rmd
Introduction
In this vignette we will look how to optimize aurora for a large dataset. aurora builds hundreds training datasets and corresponding machine learning models. Thus especially large and versatile datasets with many features may not finish in a reasonable time. On the other hand, since aurora is not parallelized it has low CPU and RAM requirements. If the user has the option, we advise to run aurora in the background as the vast majority of datasets with over 500 strains may take a few days to finish. It is always advised to run aurora with the most accurate settings and all machine learning tools but the even the core version will provide good results.
Workflow
Let’s first simulate some data. We will use Limosilactobacillus reuteri dataset which will be bootstrapped to obtain a larger dataset.
data("pheno_mat_reuteri")
data("bin_mat_reuteri")
# Bootstrap the dataset to make it larger
index <- sample(1:nrow(pheno_mat), size = 2000, replace = TRUE)
pheno_mat <- pheno_mat[index, ]
bin_mat_tmp <- bin_mat[,-1:-14]
bin_mat <- cbind(bin_mat[,1:14], bin_mat_tmp[, match(pheno_mat$ids, colnames(bin_mat_tmp))])
# give the strains generic names because the indexes must be unique
pheno_mat$ids <- paste0("strain", 1:nrow(pheno_mat))
colnames(bin_mat)[-1:-14] <- paste0("strain", 1:nrow(pheno_mat))
# Create a random tree
tree <- rtree(2000, tip.label = pheno_mat$ids)
With such a large dataset you could apply algorithms that reduce the number of strains while still preserving maximum variability. For example Treemmer, Treetrimmer and Tree Pruner can all do this. While this is a good option especially when the dataset contains clonal lineages, it is difficult to know what is still an acceptable reduction. Additionally, due to horizontal gene transfer there variability is present even in clonal lineages. In the aurora paper we show that some clonal lineages of S. Typhimurium contain strains isolated from multiple habitats. Reducing the amount of strains thus may lead to a significant loss of information.
Instead of removing strains you can configure aurora to do
just a fraction of the computation thus lowering the computational time.
If you are sure that your dataset does not contain any mislabelled
strains and you are sure that the species is adapted to the phenotype,
then you do not have to run the computationally exhausting
aurora_pheno()
function and just run the GWAS analysis with
aurora_GWAS()
. Runnig this function should take just a few
minutes
# run GWAS analysis without results from aurora_pheno()
res <- aurora_GWAS(bin_mat = bin_mat,
pheno_mat = pheno_mat,
tree = midpoint(tree),
write_data = FALSE)
#> Finished
See vignette panGWAS
to see how to postprocess the GWAS
results.
However, what if you want to run aurora_pheno()
. What is
the best parameter setting for a large dataset? This depends on the
properties of your data. The simplest way to reduce the computational
time is to decrease the parameter no_rounds
from the
default 100 to 50. Going even lower is not recommended as this could
result in lowering the statistical power to detect adaptation to the
phenotype. If the genes responsible for the adaptation have high effect
size then the intervention will not influence the results however if
multiple low effect size genes are responsible for the adaptation and if
only a small fraction of the strains is adapted to the phenotype then
no_rounds
should be kept at 100. Another way to decrease
the computational time is to omit some machine learning algorithms. In
the aurora article we have demonstrated that using multiple
algorithms may be redundant as they all tend to identify the same
strains as mislabelled. Random Forest and CART models produce a distance
matrix. These matrices are crucial for analysing genotype-phenotype
associations and for identifying more mislabelled strains. Thus we will
disable the two remaining algorithms: log regression
(ovr_log_reg = FALSE
) and AdaBoost
(adaboost = FALSE
) to retain only the two most informative
models. This will reduce the computational time almost twofold.
Another way to lower the computational time is to reduce the number
of strains in the training dataset. You can do so by modifying the
argument bag_size
. By default bag_size
is set
to 5* the number of strains in the class with the fewest strains. Hence,
if you have 60 strains in class A, 30 strains in class B and 80 strains
in class C the training dataset will contain 150 strains from each
class. We can lower this by setting the parameter bag_size
to bag_size = c(75,75,75)
. This way of reducing the
computational time is suitable in cases where the dataset contains
roughly the same number of strains in each class. Sometimes fitting
parameters to AdaBoost and Random Forest takes a long time. This fitting
can be disabled by setting fit_parameters
to
FALSE
. To find the best parameters
fit_parameters
runs grid
search algorithm. Due to the stochastic nature of the bagging
algorithmns, the parameter fitting needs to be repeated multiple times
(default: 10). You can reduce the number of repetitions by setting
repeats = 5
. Alternatively, more experienced users can
utilize function get_bags()
to build their own training
datasets and optimize hyperparameters (see vignette
get_bags
). These hyperparameters can then be passed to
aurora_pheno()
(see ?aurora_pheno()
documentation). Lastly, the choice of bagging algorithm also influence
the computational time. phylogenetic_walk
takes roughly 100
times more CPU time than random_walk
. However, since most
of the computational time is spent on fitting the models this changes
will not influence the overall computational time as much. We advise
using random_walk
only when an accurate phylogenetic tree
could not be obtained or in cases where a large portion of the dataset
is mislabelled. Let’s demonstrate how we can assess what parameters we
can change to reduce the computational time.
# Plot the phyogenetic tree first to see if the dataset is clonal
plot(tree, type = "fan", show.tip.label = FALSE)
# Thy phylogenetic tree does not appear to contain large clonal lineages that we could collapse into a single strain. Let's thus look at other ways to reduce the computational time
# If one class in the dataset is not dominant them we could reduce the argument bag_size
table(pheno_mat$pheno)
# While rodent and poultry isolates are clearly more abundant the numbers are roughly equal and we can reduce bag_size
table_tmp <- table(pheno_mat$pheno)
bag_size <- rep(table_tmp[which.min(table_tmp)]*2, length(table_tmp))
# Another thing that will drastically reduce the computational time is lowering the number of classes. The results then has to be interpreted accordingly.
pheno_mat$pheno[pheno_mat$pheno != "rodent"] <- "other" # make only two classes: rodent and other
aurora_pheno(pheno_mat = pheno_mat,
bin_mat = bin_mat,
type_bin_mat = "custom",
tree = tree,
bag_size = unname(bag_size),
bagging = "random_walk", # use random walk which is faster than phylogenetic walk
repeats = 5, # This will lower the time needed to fit hyperparameters to Random Forest
ovr_log_reg = FALSE, # Do not use log regression
adaboost = FALSE , # Do not use AdaBoost
no_rounds = 50, # From literature we know that reuteri colonization factors have high effect size. Therefore we can reduce this argument
write_data = FALSE)
Sometimes the number of strains is not limiting but the number of
features is (i.e., when SNPs or k-mers are used as the
investigated variants). In this case, follow the vignette
SNP_and_kmer_GWAS
. In the example above the pangenome has
many genes in the cloud genome and many genes near core genome. All of
these are certainly not causal. One simple way to filter the features
and subsequently lower the computational time is to increase the
parameter low_perc_cutoff
from 3 to 5 which will removed
the cloud genes and decrease the parameter upp_perc_cutoff
from 99 to 95 which removes the near core genes.