Function that run GWAS analysis
Usage
aurora_GWAS(
pheno_mat,
bin_mat = NA,
type_bin_mat = "roary",
which_snps = "all_alleles",
bagging = "phylogenetic_walk",
tree = NA,
alternative_dist_mat = NA,
reduce_outlier = TRUE,
cutoff_outlier = 3,
remove_strains = NA,
aurora_results = NA,
mode = "consensus",
rm_non_typical = FALSE,
use_rf = TRUE,
use_ada = TRUE,
use_log = TRUE,
use_CART = TRUE,
bag_size = NA,
max_per_bag = 1e+05,
get_bagging_counts = FALSE,
write_data = TRUE,
save_dir = NA,
run_hogwash = FALSE,
perm_val = NA,
run_treeWAS = FALSE,
n.snps.sim_val = NA,
get_scoary = FALSE,
get_pyseer = FALSE
)
Arguments
- pheno_mat
Data frame that contains unique indexes in the first column and the phenotype classes in the second column. The unique indexes should contain only letters, numbers and special signs "_", ".".
- bin_mat
Binary matrix containing the genomic variants. See
type_bin_mat
to check what can be supplied as a binary matrix.- type_bin_mat
Specifies the type of binary matrix. Options are: "panaroo"|"roary"; Exact system path to a csv file containing a pangenome matrix in Roary format (gene_presence_absence_roary.csv) or the pangenome matrix loaded as data frame. The pangenome in this format is produced by both Roary and Panaroo. "DRAM"; Exact system path to a file containing the summary of metabolism produced by DRAM (metabolism_summary.xlsx). "SCARAP"; Exact system path to a file pangenome.tsv or a dataframe containing the results of a pangenome tool SCARAP. "custom"; Data frame or matrix containing custom binary matrix. The features should be in columns and the strains in rows. "k-mers"|"unitigs"; Exact system path to a .gz file containing unitigs called by unitig-counter or k-mers by fsm-lite. "PIRATE"; Exact system path to a file containing the pangenome produced by PIRATE (PIRATE.gene_families.tsv) or the pangenome loaded as a dataframe. "SNPs"; Exact system path to a VCF file. aurora expect these columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT and a column for each analysed strain.
- which_snps
There are two options: "biallelic" and "all_alleles". Setting which_snps = "biallelic" will remove all SNPs that have more than one alternative allele. Setting which_snps = "all_alleles" will create a new column for every alternative allele.
- bagging
Bagging algorithm applied to capture the population structure. Select from "random_walk" or "phylogenetic_walk". Default: "phylogenetic_walk"
- tree
Phylogenetic tree loaded as an object of class
Phylo
. The tree needs to contain edge lengths and the tips should be the same as the indexes in pheno_mat.- alternative_dist_mat
Distance matrix that contains pairwise phylogenetic distances. Row names and col names needs to be the same as indexes in
pheno_mat
.- reduce_outlier
If TRUE the phylogenetic distances of a very distant strains to the rest of the population will be reduced. Default: TRUE
- cutoff_outlier
The number of standard deviations to the right of the mean of the distance distribution. Distances with z-score higher than the cutoff will be shrunk to the cutoff. Default: 3.
- remove_strains
Character vector that specifies which strains to explicitly remove.
- aurora_results
Output of a function
aurora_pheno
.- mode
Specifies how to handle results from
aurora_pheno
. Value can be either "consensus" or "strict". Consensus mode removes only strains that were found to be allochthonous by all ML tools used inaurora_pheno
. Strict mode removes a strain if it was identified as allochthonous by at least one ML tool. Default: "consensus".- rm_non_typical
Specifies if strains that were labelled as non-typical should be treated as allochthonous. Default: FALSE. WARNING: setting to TRUE may remove many strains.
- use_rf
If set to TRUE, than results from Random Forest are considered. Default: TRUE.
- use_ada
If set to TRUE, than results from AdaBoost are considered. Default: TRUE.
- use_log
If set to TRUE, than results from log regression are considered. Default: TRUE.
- use_CART
If set to TRUE, than results from CART are considered. Default: TRUE.
- bag_size
The size of the bag for each class. Default: NA. If NA than the bag_size is 1000 for each class of the phenotype. Provide the size as a number for each class i.e., c(50, 50, 50) for a phenotype with three classes.
- max_per_bag
Maximum number of times a strain can be repeated in the bag. Default: Infinite.
- get_bagging_counts
If set to TRUE then the number of times each stain appears in the bags will be outputted.
- write_data
Indicates whether aurora should write the data in a directory specified by an argument save_dir. Default: TRUE.
- save_dir
An exact system path to the directory where the result will be written.
- run_hogwash
If set to TRUE then mGWAS tool hogwash will be run with the same input as
aurora_GWAS
. Default: FALSE.- perm_val
perm value for runnig Hogwash. default: 5000.
- run_treeWAS
If set to TRUE then mGWAS tool TreeWAS will be run with the same input as
aurora_GWAS
. Default: FALSE.- n.snps.sim_val
n.snps.sim value for runnig TreeWAS. default: 100*number of features.
- get_scoary
If set to TRUE then an input into mGWAS tool Scoary will be constructed. Default: FALSE.
- get_pyseer
If set to TRUE then an input into mGWAS tool Pyseer will be constructed. Default: FALSE.
Value
The output of this function is a data frame that contains: the frequency of the feature in the class, standardized residuals, precision, recall and F1 values for each feature and class. Additionally the output contains a data frame that shows which strains were removed. If get_bagging_counts was set to TRUE than the output will also contain a data frame that shows how many times was each strain repeated in the bag. If the user also requested to run hogwash or TreeWAS then the result of these two tools will also be part of the output. For more information about the output check https://dalimilbujdos.github.io/aurora/articles/outputs.html.