panGWAS_with_large_dataset • aurora

Introduction

In this vignette we will look how to optimize aurora for a large dataset. aurora builds hundreds training datasets and corresponding machine learning models. Thus especially large and versatile datasets with many features may not finish in a reasonable time. On the other hand, since aurora is not parallelized it has low CPU and RAM requirements. If the user has the option, we advise to run aurora in the background as the vast majority of datasets with over 500 strains may take a few days to finish. It is always advised to run aurora with the most accurate settings and all machine learning tools but the even the core version will provide good results.

library(aurora)
library(phangorn)

Workflow

Let’s first simulate some data. We will use Limosilactobacillus reuteri dataset which will be bootstrapped to obtain a larger dataset.

data("pheno_mat_reuteri")
data("bin_mat_reuteri")

# Bootstrap the dataset to make it larger
index <- sample(1:nrow(pheno_mat), size = 2000, replace = TRUE)
pheno_mat <- pheno_mat[index, ]
bin_mat_tmp <- bin_mat[,-1:-14]
bin_mat <- cbind(bin_mat[,1:14], bin_mat_tmp[, match(pheno_mat$ids, colnames(bin_mat_tmp))])
# give the strains generic names because the indexes must be unique
pheno_mat$ids <- paste0("strain", 1:nrow(pheno_mat))
colnames(bin_mat)[-1:-14] <- paste0("strain", 1:nrow(pheno_mat))
# Create a random tree
tree <- rtree(2000, tip.label = pheno_mat$ids)

With such a large dataset you could apply algorithms that reduce the number of strains while still preserving maximum variability. For example Treemmer, Treetrimmer and Tree Pruner can all do this. While this is a good option especially when the dataset contains clonal lineages, it is difficult to know what is still an acceptable reduction. Additionally, due to horizontal gene transfer there variability is present even in clonal lineages. In the aurora paper we show that some clonal lineages of S. Typhimurium contain strains isolated from multiple habitats. Reducing the amount of strains thus may lead to a significant loss of information.

Instead of removing strains you can configure aurora to do just a fraction of the computation thus lowering the computational time. If you are sure that your dataset does not contain any mislabelled strains and you are sure that the species is adapted to the phenotype, then you do not have to run the computationally exhausting aurora_pheno() function and just run the GWAS analysis with aurora_GWAS(). Runnig this function should take just a few minutes

# run GWAS analysis without results from aurora_pheno()
res <- aurora_GWAS(bin_mat = bin_mat,
                   pheno_mat = pheno_mat,
                   tree = midpoint(tree),
                   write_data = FALSE)
#> Finished

See vignette panGWAS to see how to postprocess the GWAS results.

However, what if you want to run aurora_pheno(). What is the best parameter setting for a large dataset? This depends on the properties of your data. The simplest way to reduce the computational time is to decrease the parameter no_rounds from the default 100 to 50. Going even lower is not recommended as this could result in lowering the statistical power to detect adaptation to the phenotype. If the genes responsible for the adaptation have high effect size then the intervention will not influence the results however if multiple low effect size genes are responsible for the adaptation and if only a small fraction of the strains is adapted to the phenotype then no_rounds should be kept at 100. Another way to decrease the computational time is to omit some machine learning algorithms. In the aurora article we have demonstrated that using multiple algorithms may be redundant as they all tend to identify the same strains as mislabelled. Random Forest and CART models produce a distance matrix. These matrices are crucial for analysing genotype-phenotype associations and for identifying more mislabelled strains. Thus we will disable the two remaining algorithms: log regression (ovr_log_reg = FALSE) and AdaBoost (adaboost = FALSE) to retain only the two most informative models. This will reduce the computational time almost twofold.

Another way to lower the computational time is to reduce the number of strains in the training dataset. You can do so by modifying the argument bag_size. By default bag_size is set to 5* the number of strains in the class with the fewest strains. Hence, if you have 60 strains in class A, 30 strains in class B and 80 strains in class C the training dataset will contain 150 strains from each class. We can lower this by setting the parameter bag_size to bag_size = c(75,75,75). This way of reducing the computational time is suitable in cases where the dataset contains roughly the same number of strains in each class. Sometimes fitting parameters to AdaBoost and Random Forest takes a long time. This fitting can be disabled by setting fit_parameters to FALSE. To find the best parameters fit_parameters runs grid search algorithm. Due to the stochastic nature of the bagging algorithmns, the parameter fitting needs to be repeated multiple times (default: 10). You can reduce the number of repetitions by setting repeats = 5. Alternatively, more experienced users can utilize function get_bags() to build their own training datasets and optimize hyperparameters (see vignette get_bags). These hyperparameters can then be passed to aurora_pheno() (see ?aurora_pheno() documentation). Lastly, the choice of bagging algorithm also influence the computational time. phylogenetic_walk takes roughly 100 times more CPU time than random_walk. However, since most of the computational time is spent on fitting the models this changes will not influence the overall computational time as much. We advise using random_walk only when an accurate phylogenetic tree could not be obtained or in cases where a large portion of the dataset is mislabelled. Let’s demonstrate how we can assess what parameters we can change to reduce the computational time.

# Plot the phyogenetic tree first to see if the dataset is clonal
plot(tree, type = "fan", show.tip.label = FALSE)
# Thy phylogenetic tree does not appear to contain large clonal lineages that we could collapse into a single strain. Let's thus look at other ways to reduce the computational time

# If one class in the dataset is not dominant them we could reduce the argument bag_size
table(pheno_mat$pheno)
# While rodent and poultry isolates are clearly more abundant the numbers are roughly equal and we can reduce bag_size
table_tmp <- table(pheno_mat$pheno)
bag_size <- rep(table_tmp[which.min(table_tmp)]*2, length(table_tmp))

# Another thing that will drastically reduce the computational time is lowering the number of classes. The results then has to be interpreted accordingly.
pheno_mat$pheno[pheno_mat$pheno != "rodent"] <- "other" # make only two classes: rodent and other

aurora_pheno(pheno_mat = pheno_mat,
             bin_mat = bin_mat,
             type_bin_mat = "custom",
             tree = tree,
             bag_size = unname(bag_size),
             bagging = "random_walk", # use random walk which is faster than phylogenetic walk
             repeats = 5, # This will lower the time needed to fit hyperparameters to Random Forest
             ovr_log_reg = FALSE, # Do not use log regression
             adaboost = FALSE , # Do not use AdaBoost
             no_rounds = 50, # From literature we know that reuteri colonization factors have high effect size. Therefore we can reduce this argument
             write_data = FALSE)

Sometimes the number of strains is not limiting but the number of features is (i.e., when SNPs or k-mers are used as the investigated variants). In this case, follow the vignette SNP_and_kmer_GWAS. In the example above the pangenome has many genes in the cloud genome and many genes near core genome. All of these are certainly not causal. One simple way to filter the features and subsequently lower the computational time is to increase the parameter low_perc_cutoff from 3 to 5 which will removed the cloud genes and decrease the parameter upp_perc_cutoff from 99 to 95 which removes the near core genes.