panGWAS • aurora

Introduction

Pangenome GWAS (panGWAS) is an analysis method that investigates the relationship between genes in species’ genofond and observable traits. PanGWAS relies on the construction of pangenomes and phylogenetic trees which are the main input into the analysis.

Objective

In this vignette you will be introduced to the most common use case of aurora package. You will try to find out if Limosilactobacillus reuteri is a host adapted species. To this end you will use an output from Panaroo (pangenome tool) run with 207 L. reuteri assemblies downloaded from NCBI and a core-genome phylogenetic tree constructed using IQ-TREE. In the end you will identify some of the genes that allow L. reuteri to colonize the rodent gastrointestinal tract.

Workflow

First, let’s install and attach all the packages we will need for this vignette.

library(aurora)
library(readr)
library(ape)
library(gplots)

Next let’s load all the data and also create the directory where the output will be stored.

save_dir <- "/path/to/your/dir/"

# this line will not be run because you do not want a vignette to create folders
#if (!dir.exists(save_dir)) {dir.create(save_dir)}

data("pheno_mat_reuteri") # loads the phenotype for each strain
data("bin_mat_reuteri") # loads the pangenome matrix
data("tree_reuteri") # loads the core-genome phylogenetic tree

You need to make sure that all strains in pheno_mat are also in the bin_mat and in the tree. Additionally, the strain names should contain only letters, numbers and special signs “”, ”.”. A strain name cannot start with ”.” or ””. Now, the main aurora function will be executed. This function does not do any GWAS analysis. The only purpose of it is to find out if the species is adapted to the supplied phenotype and find which strains are mislabelled (allochthonous strains). If you don’t want to run it, you can skip this step and run only aurora_GWAS().

The L. reuteri is a medium-size dataset, and thus there is no need to change any parameters to optimize aurora’s performance. Still it might take a few hours for aurora_pheno() to finish. By default aurora_pheno() runs with all machine learning (ML) algorithms. Python functions are used to run AdaBoost and log regression. For this reason, one needs to set up a conda environment. Follow a tutorial in setup_conda to set it up properly and then, provide an exact system path to the conda environment. Alternatively, you can set adaboost = FALSE and ovr_log_reg = FALSE to disable those ML algorithmns. If you don’t provide any condaenv_path then AdaBoost and log regression will not be run.

# I want to use all ML algorithmns
results <- aurora_pheno(pheno_mat = pheno_mat,
                        bin_mat = bin_mat,
                        type_bin_mat = "panaroo",
                        tree = tree,
                        condaenv_path = "/path/to/conda/env", # an example of a path to conda environment
                        save_dir = save_dir)

# I don't want to use AdaBoost and log regression
results <- aurora_pheno(pheno_mat = pheno_mat,
                        bin_mat = bin_mat,
                        type_bin_mat = "panaroo",
                        tree = tree,
                        save_dir = save_dir)
# lets also save the final list for GWAS analysis later
save(results, file = paste0(save_dir, "results_aurora.Rdata"))

Since running the function will take a few hours lets load the precalculated results.

data("aurora_pheno_results_reuteri") # these are only truncated results!

When you run aurora_pheno() the generated figures are stored in save_dir. These figures named “aurora_heatmap_CART_[date].png”, “aurora_heatmap_median_random_forest_[date].png” and “aurora_heatmap_sum_random_forest_[date].png” are crucial for identifying clusters of strains that are not adapted to a phenotype class. These figures contain heatmaps of distances calculated based on CART and Random Forest proximities. If the species is adapted to the phenotype then the clusters should correspond with the observed phenotype classes. Additionally, the heatmaps contains a column color strip that show all strains whose predicted class matches the observed. If the color is lighter then the strain was classified as non-typical. Blank spots means that the class prediction of the strain was either INCONCLUSIVE or different than the observed class. The row color strip shows the observed phenotype class. The color legend of these figures is stored in results$results$results_random_forest$legend_heatmap and results$results$results_CART$legend_heatmap. Next we can examine the results stored in the list results The list contains the function call, the supplied bin_mat, the analysed phenotype classes (phenotypes for short). results$abundance_filter_removed, results$acr_filter_removed and results$hamming_filter_removed contains the number of features that these filters removed or collapsed. results$plot_acr_cutoff contains a histogram that shows how the cutoff for the ancestral reconstruction filter (ancest_rec_filter). results$pairwise_dist_outlier contains a histogram that shows the cutoff applied for reduce_oulier. The fitted parameters for AdaBoost and Random Forest can be found in results$fitted_parameters.

The most important objects are stored in results$results$. The complete list of what this part of the output contains is in ?aurora_pheno(). Let’s explore the main objects. First, the p-value matrices should be examined.

par(mar = c(10, 10, 1, 1))
func <- function(x) {
  # this function will make the results look prettier
  if (is.na(x)) {
    return(NA)
  }
   if (x == 0) {
    return(0)
  }
  y <- 1
  while (x*y < 1 || x*y > 10) {
    y <- y * 10
  }
  x <- x * y
  x_rounded <- round(x, 2)
  x_final <- x_rounded / y
  return(x_final)
}

mat <- apply(results$results$results_random_forest$p_val_mat,
             c(1,2), func)
heatmap.2(mat,
          trace = "none",
          dendrogram = "none",
          cellnote=mat,
          notecol="black",
          density.info="none",
          key = FALSE,
          Rowv = FALSE,
          Colv = FALSE,
          cexRow = 0.7,
          cexCol = 0.7)


mat <- apply(results$results$results_adaboost$p_val_mat,
             c(1,2), func)
heatmap.2(mat,
          trace = "none",
          dendrogram = "none",
          cellnote=mat,
          notecol="black",
          density.info="none",
          key = FALSE,
          Rowv = FALSE,
          Colv = FALSE,
          cexRow = 0.7,
          cexCol = 0.7)


mat <- apply(results$results$results_log_reg$p_val_mat,
             c(1,2), func)
heatmap.2(mat,
          trace = "none",
          dendrogram = "none",
          cellnote=mat,
          notecol="black",
          density.info="none",
          key = FALSE,
          Rowv = FALSE,
          Colv = FALSE,
          cexRow = 0.7,
          cexCol = 0.7)


mat <- apply(results$results$results_CART$p_val_mat,
             c(1,2), func)
heatmap.2(mat,
          trace = "none",
          dendrogram = "none",
          cellnote=mat,
          notecol="black",
          density.info="none",
          key = FALSE,
          Rowv = FALSE,
          Colv = FALSE,
          cexRow = 0.7,
          cexCol = 0.7)

When two p-values (x -> y and y -> x) are both above 0.05 then the two phenotype classes should be considered indistinguishable and vice versa when at least one pairwise p-value is below 0.05 then there is an adaptation to these two classes. aurora thus correctly identified L. reuteri as host adapted. Additionally, we should also confirm that results from all bags are consistent. Let’s plot AUC values for each tool in each iteration of Outlier calculation phase.

plot(results$results$results_random_forest$aucs, ylim = c(0,1), main = "Random Forest AUCs")

plot(results$results$results_adaboost$aucs, ylim = c(0,1), main = "AdaBoost AUCs")

plot(results$results$results_log_reg$aucs, ylim = c(0,1), main = "Log reg AUCs")

Indeed all of the AUCs are close to 1, therefore, no bag is an outlier. You might have noticed that results from CART were not plotted. Individual AUCs are not outputted for each iteration of CART. Instead a median of all runs for each pair of phenotype classes is calculated.

barplot(results$results$results_CART$complexity$AUC, names.arg = NULL, main = "CART AUCs", horiz = T, las = 1)

You can see that even a simple CART model does a pretty good job on this dataset, and all class pairs are close to 1. The fact that a classifier performs well on a dataset does not mean that there are no mislabelled strains. You can examine all predicted labels in the auto_allo_results object.

# These results are truncated to only the first 10 strains
knitr::kable(results$results$results_random_forest$auto_allo_results, format = "html", table.attr = "class='table table-striped'")

	strain_ids	poultry	rodent	homo sapiens	pig	primate	predicted	observed
GCA_025369755.1_ASM2536975v1_genomic	GCA_025369755.1_ASM2536975v1_genomic	TRUE not typical	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_024622285.1_ASM2462228v1_genomic	GCA_024622285.1_ASM2462228v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622305.1_ASM2462230v1_genomic	GCA_024622305.1_ASM2462230v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622315.1_ASM2462231v1_genomic	GCA_024622315.1_ASM2462231v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_025189695.1_ASM2518969v1_genomic	GCA_025189695.1_ASM2518969v1_genomic	FALSE	FALSE	TRUE	FALSE	FALSE	homo sapiens	homo sapiens
GCA_002159305.1_ASM215930v1_genomic	GCA_002159305.1_ASM215930v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_002160565.1_ASM216056v1_genomic	GCA_002160565.1_ASM216056v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_022786355.1_ASM2278635v1_genomic	GCA_022786355.1_ASM2278635v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022780645.1_ASM2278064v1_genomic	GCA_022780645.1_ASM2278064v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022511545.1_ASM2251154v1_genomic	GCA_022511545.1_ASM2251154v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent

knitr::kable(results$results$results_adaboost$auto_allo_results, format = "html", table.attr = "class='table table-striped'")

	strain_ids	poultry	rodent	homo sapiens	pig	primate	predicted	observed
GCA_025369755.1_ASM2536975v1_genomic	GCA_025369755.1_ASM2536975v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_024622285.1_ASM2462228v1_genomic	GCA_024622285.1_ASM2462228v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622305.1_ASM2462230v1_genomic	GCA_024622305.1_ASM2462230v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622315.1_ASM2462231v1_genomic	GCA_024622315.1_ASM2462231v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_025189695.1_ASM2518969v1_genomic	GCA_025189695.1_ASM2518969v1_genomic	FALSE	FALSE	TRUE	FALSE	FALSE	homo sapiens	homo sapiens
GCA_002159305.1_ASM215930v1_genomic	GCA_002159305.1_ASM215930v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_002160565.1_ASM216056v1_genomic	GCA_002160565.1_ASM216056v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_022786355.1_ASM2278635v1_genomic	GCA_022786355.1_ASM2278635v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022780645.1_ASM2278064v1_genomic	GCA_022780645.1_ASM2278064v1_genomic	FALSE	FALSE	FALSE	TRUE	INCONCLUSIVE	INCONCLUSIVE or pig	pig
GCA_022511545.1_ASM2251154v1_genomic	GCA_022511545.1_ASM2251154v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent

knitr::kable(results$results$results_log_reg$auto_allo_results, format = "html", table.attr = "class='table table-striped'")

	strain_ids	poultry	rodent	homo sapiens	pig	primate	predicted	observed
GCA_025369755.1_ASM2536975v1_genomic	GCA_025369755.1_ASM2536975v1_genomic	TRUE not typical	FALSE	FALSE	FALSE	FALSE	poultry not typical	poultry
GCA_024622285.1_ASM2462228v1_genomic	GCA_024622285.1_ASM2462228v1_genomic	FALSE	TRUE not typical	FALSE	FALSE	FALSE	rodent not typical	rodent
GCA_024622305.1_ASM2462230v1_genomic	GCA_024622305.1_ASM2462230v1_genomic	FALSE	TRUE not typical	FALSE	FALSE	FALSE	rodent not typical	rodent
GCA_024622315.1_ASM2462231v1_genomic	GCA_024622315.1_ASM2462231v1_genomic	FALSE	TRUE not typical	FALSE	FALSE	FALSE	rodent not typical	rodent
GCA_025189695.1_ASM2518969v1_genomic	GCA_025189695.1_ASM2518969v1_genomic	FALSE	FALSE	TRUE	FALSE	FALSE	homo sapiens	homo sapiens
GCA_002159305.1_ASM215930v1_genomic	GCA_002159305.1_ASM215930v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_002160565.1_ASM216056v1_genomic	GCA_002160565.1_ASM216056v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_022786355.1_ASM2278635v1_genomic	GCA_022786355.1_ASM2278635v1_genomic	FALSE	FALSE	FALSE	TRUE not typical	FALSE	pig not typical	pig
GCA_022780645.1_ASM2278064v1_genomic	GCA_022780645.1_ASM2278064v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022511545.1_ASM2251154v1_genomic	GCA_022511545.1_ASM2251154v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent

knitr::kable(results$results$results_CART$auto_allo_results, format = "html", table.attr = "class='table table-striped'")

strain_ids	poultry	rodent	homo sapiens	pig	primate	predicted	observed
GCA_025369755.1_ASM2536975v1_genomic	TRUE	TRUE	FALSE	FALSE	FALSE	rodent or poultry	poultry
GCA_024622285.1_ASM2462228v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622305.1_ASM2462230v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_024622315.1_ASM2462231v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent
GCA_025189695.1_ASM2518969v1_genomic	FALSE	FALSE	TRUE	FALSE	FALSE	homo sapiens	homo sapiens
GCA_002159305.1_ASM215930v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_002160565.1_ASM216056v1_genomic	TRUE	FALSE	FALSE	FALSE	FALSE	poultry	poultry
GCA_022786355.1_ASM2278635v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022780645.1_ASM2278064v1_genomic	FALSE	FALSE	FALSE	TRUE	FALSE	pig	pig
GCA_022511545.1_ASM2251154v1_genomic	FALSE	TRUE	FALSE	FALSE	FALSE	rodent	rodent

It seems that each algorithm identified only a few strains that are mislabelled. This is another indication that the species is adapted to the habitat. However, log regression identified the most mislabelled strains. If results from one algorithm are different than the rest of the algorithms, we can choose not to use the results in the subsequent GWAS analysis. Another reason why you may want to omit some algorithms is that p-value matrix from a ML algorithm shows that the species is not adapted and it contradicts results of other ML algorithms. Such a model may just have a high variance because it was overfitted. Also, if the AUCs of a ML algorithm are much lower than AUCs of others it might mean that the model is not complex enough (e.g., it has high bias) and the algorithm should be omitted.

In the example below we run the GWAS analysis without log regression. Phylogenetic tree, bin_mat and pheno_mat needs to be provided again. As we show in the aurora paper there are some rodent isolates that appear to be able to colonize primate gut as well. To identify the exclusively primate colonization factors these strains needs to be removed. All these L. reuteri strains were labelled as non-typical by Random Forest. We can thus easily identify them and remove them (argument remove_strains). Alternatively, you can also choose to remove all non-typical strains by setting rm_non_typical = TRUE. aurora_GWAS() is a function with many parameters but one of the most important that the user should pay attention to is the mode. The mode of aurora_GWAS() can either be “consensus” or “strict”. The first named always removes less strains than “strict” mode. Consensus mode removes only strains that were identified as allochthonous by all ML algorithms used. On the other hand, strict mode removes a strain if it was identified as mislabelled by at least one ML algorithm.

# Get all non-typical rodent strains identified by Random Forest
res_rf <- results$results$results_random_forest$auto_allo_results
res_rf_rodent <- res_rf[res_rf$observed == "rodent", ]
ids_rf_not_typ <- res_rf_rodent$strain_ids[grepl("not typical", res_rf_rodent$rodent)]

# Run GWAS analysis
res <- aurora_GWAS(bin_mat = bin_mat,
                   type_bin_mat = "panaroo",
                   pheno_mat = pheno_mat,
                   tree = tree,
                   remove_strains = ids_rf_not_typ,
                   aurora_results = results,
                   mode = "consensus",
                   write_data = FALSE,
                   use_log = FALSE)
#> 5 branches were identified as outliers. These branches were reduced 
#> Random forest removes 0 strains 
#> Adaboost removes 1 strains 
#> CART removes 1 strains 
#> Consensus mode removed in total 0 strains. There are 207 strains left. 
#> Finished

The removed strains are in res$removed_strains but the main results are in res$GWAS_results. This data frame contains precision, recall, F1 values and standardized residuals for each gene in each class. The best (most causal) genes will have the highest F1 value and standardized residual. The colonization factors of rodent gut are well known. It is the urease operon (ure genes), serine-rich adhesin (group_2460). Let’s see if these colonization factors can be found in the GWAS results.

GWAS_rodent <- res$GWAS_results
# Sort the genes based on rodent standardized residuals and give them a rank
GWAS_rodent <- GWAS_rodent[order(GWAS_rodent$`rodent std_residual`, decreasing = TRUE), ]
GWAS_rodent$rank <- rank(1-GWAS_rodent$`rodent std_residual`, ties.method = "average")
# Print the rank of the known colonization factors
index <- grepl("ure", GWAS_rodent$Variant) | GWAS_rodent$Variant == "group_2460"
knitr::kable(GWAS_rodent[index, c(1,29)], format = "html", table.attr = "class='table table-striped'")

	Variant	rank
ureD1	ureD1	3.5
ureF	ureF	12.5
ureG	ureG	12.5
ureE	ureE	12.5
ureC	ureC	12.5
ureA	ureA	12.5
ureB	ureB	12.5
ureI	ureI	12.5
group_2460	group_2460	17.0

Indeed, we can see that the known colonization factors are among the top results. However, sorting features based on their standardized residuals may not be the best approach in all scenarios. Non-causal genes with low frequency can rank before the causal genes. It is a common practice in GWAS studies to apply frequency cutoffs. aurora however uses F1 values which implicitly takes into account the frequency of the gene. Additionally, the user can choose to filter the results based on their precision value to remove high frequency non-causal genes.

GWAS_human <- res$GWAS_results
GWAS_human <- GWAS_human[GWAS_human$`homo sapiens precision` > 0.4, ]
GWAS_human <- GWAS_human[order(GWAS_human$`homo sapiens F1`, decreasing = TRUE), ]
# print the top 10 human colonization factors

knitr::kable(GWAS_human[1:10,c(1,20)], format = "html", table.attr = "class='table table-striped'")

	Variant	homo sapiens F1
xerC_5~~~xerC_6~~_{xerD_2}~~xerC_2	xerC_5~~~xerC_6~~_{xerD_2}~~xerC_2	0.8077482
group_2589	group_2589	0.7485191
group_4028	group_4028	0.7066599
group_4023	group_4023	0.7066599
group_4017	group_4017	0.7066599
panP	panP	0.6495611
group_362	group_362	0.6299141
group_260	group_260	0.6178300
tnpR~~~hin	tnpR~~~hin	0.6068966
group_2912	group_2912	0.5843621

In conclusion, we advise the users run aurora_pheno() with all machine learning algorithms and with phylogenetic_walk if a reliable phylogenetic tree is available. The user should then experiment with different settings of aurora_GWAS(). I often map the removed strains back to the phylogenetic tree to see how the allochthonous strains are dispersed around the tree. You can use res$removed_strains for this.