Function that run GWAS analysis

Usage

aurora_GWAS(
  pheno_mat,
  bin_mat = NA,
  type_bin_mat = "roary",
  which_snps = "all_alleles",
  bagging = "phylogenetic_walk",
  tree = NA,
  alternative_dist_mat = NA,
  reduce_outlier = TRUE,
  cutoff_outlier = 3,
  remove_strains = NA,
  aurora_results = NA,
  mode = "consensus",
  rm_non_typical = FALSE,
  use_rf = TRUE,
  use_ada = TRUE,
  use_log = TRUE,
  use_CART = TRUE,
  bag_size = NA,
  max_per_bag = 1e+05,
  get_bagging_counts = FALSE,
  write_data = TRUE,
  save_dir = NA,
  run_hogwash = FALSE,
  perm_val = NA,
  run_treeWAS = FALSE,
  n.snps.sim_val = NA,
  get_scoary = FALSE,
  get_pyseer = FALSE
)

Arguments

pheno_mat: Data frame that contains unique indexes in the first column and the phenotype classes in the second column. The unique indexes should contain only letters, numbers and special signs "_", ".".
bin_mat: Binary matrix containing the genomic variants. See type_bin_mat to check what can be supplied as a binary matrix.
type_bin_mat: Specifies the type of binary matrix. Options are: "panaroo"|"roary"; Exact system path to a csv file containing a pangenome matrix in Roary format (gene_presence_absence_roary.csv) or the pangenome matrix loaded as data frame. The pangenome in this format is produced by both Roary and Panaroo. "DRAM"; Exact system path to a file containing the summary of metabolism produced by DRAM (metabolism_summary.xlsx). "SCARAP"; Exact system path to a file pangenome.tsv or a dataframe containing the results of a pangenome tool SCARAP. "custom"; Data frame or matrix containing custom binary matrix. The features should be in columns and the strains in rows. "k-mers"|"unitigs"; Exact system path to a .gz file containing unitigs called by unitig-counter or k-mers by fsm-lite. "PIRATE"; Exact system path to a file containing the pangenome produced by PIRATE (PIRATE.gene_families.tsv) or the pangenome loaded as a dataframe. "SNPs"; Exact system path to a VCF file. aurora expect these columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT and a column for each analysed strain.
which_snps: There are two options: "biallelic" and "all_alleles". Setting which_snps = "biallelic" will remove all SNPs that have more than one alternative allele. Setting which_snps = "all_alleles" will create a new column for every alternative allele.
bagging: Bagging algorithm applied to capture the population structure. Select from "random_walk" or "phylogenetic_walk". Default: "phylogenetic_walk"
tree: Phylogenetic tree loaded as an object of class Phylo. The tree needs to contain edge lengths and the tips should be the same as the indexes in pheno_mat.
alternative_dist_mat: Distance matrix that contains pairwise phylogenetic distances. Row names and col names needs to be the same as indexes in pheno_mat.
reduce_outlier: If TRUE the phylogenetic distances of a very distant strains to the rest of the population will be reduced. Default: TRUE
cutoff_outlier: The number of standard deviations to the right of the mean of the distance distribution. Distances with z-score higher than the cutoff will be shrunk to the cutoff. Default: 3.
remove_strains: Character vector that specifies which strains to explicitly remove.
aurora_results: Output of a function aurora_pheno.
mode: Specifies how to handle results from aurora_pheno. Value can be either "consensus" or "strict". Consensus mode removes only strains that were found to be allochthonous by all ML tools used in aurora_pheno. Strict mode removes a strain if it was identified as allochthonous by at least one ML tool. Default: "consensus".
rm_non_typical: Specifies if strains that were labelled as non-typical should be treated as allochthonous. Default: FALSE. WARNING: setting to TRUE may remove many strains.
use_rf: If set to TRUE, than results from Random Forest are considered. Default: TRUE.
use_ada: If set to TRUE, than results from AdaBoost are considered. Default: TRUE.
use_log: If set to TRUE, than results from log regression are considered. Default: TRUE.
use_CART: If set to TRUE, than results from CART are considered. Default: TRUE.
bag_size: The size of the bag for each class. Default: NA. If NA than the bag_size is 1000 for each class of the phenotype. Provide the size as a number for each class i.e., c(50, 50, 50) for a phenotype with three classes.
max_per_bag: Maximum number of times a strain can be repeated in the bag. Default: Infinite.
get_bagging_counts: If set to TRUE then the number of times each stain appears in the bags will be outputted.
write_data: Indicates whether aurora should write the data in a directory specified by an argument save_dir. Default: TRUE.
save_dir: An exact system path to the directory where the result will be written.
run_hogwash: If set to TRUE then mGWAS tool hogwash will be run with the same input as aurora_GWAS. Default: FALSE.
perm_val: perm value for runnig Hogwash. default: 5000.
run_treeWAS: If set to TRUE then mGWAS tool TreeWAS will be run with the same input as aurora_GWAS. Default: FALSE.
n.snps.sim_val: n.snps.sim value for runnig TreeWAS. default: 100*number of features.
get_scoary: If set to TRUE then an input into mGWAS tool Scoary will be constructed. Default: FALSE.
get_pyseer: If set to TRUE then an input into mGWAS tool Pyseer will be constructed. Default: FALSE.

Value

The output of this function is a data frame that contains: the frequency of the feature in the class, standardized residuals, precision, recall and F1 values for each feature and class. Additionally the output contains a data frame that shows which strains were removed. If get_bagging_counts was set to TRUE than the output will also contain a data frame that shows how many times was each strain repeated in the bag. If the user also requested to run hogwash or TreeWAS then the result of these two tools will also be part of the output. For more information about the output check https://dalimilbujdos.github.io/aurora/articles/outputs.html.

Examples

if (FALSE) { # \dontrun{
  data(tree_reuteri)
  data(pheno_mat_reuteri)
  data(bin_mat_reuteri)
  data(aurora_pheno_results_reuteri)

  aurora_GWAS(bin_mat = bin_mat,
              type_bin_mat = "panaroo",
              pheno_mat = pheno_mat,
              tree = tree,
              aurora_results = results,
              save_dir = "/path/to/my/dir/")
} # }