
Main function of the aurora package. This function determines if the analysed species is adapted to the investigated phenotype and it also identifies mislabeled strains. The output of this function can be used as an input into aurora_GWAS
function.
Source: R/aurora_pheno.R
aurora_pheno.Rd
Main function of the aurora package. This function determines if the analysed species is adapted to the investigated phenotype and it also identifies
mislabeled strains. The output of this function can be used as an input into aurora_GWAS
function.
Usage
aurora_pheno(
pheno_mat,
bin_mat = NA,
type_bin_mat = "roary",
which_snps = "all_alleles",
bagging = "phylogenetic_walk",
tree = NA,
alternative_dist_mat = NA,
reduce_outlier = TRUE,
cutoff_outlier = 3,
fit_parameters = TRUE,
repeats = 10,
random_forest = TRUE,
plot_random_forest = TRUE,
sampsize = 50,
mtry = 200,
ntree = 100,
maxnodes = 12,
ovr_log_reg = TRUE,
C_val = 0.5,
adaboost = TRUE,
max_depth = 1,
n_estimators = 500,
learning_rate = 0.3,
CART = TRUE,
CART_plot = TRUE,
condaenv_path = NA,
low_perc_cutoff = 3,
upp_perc_cutoff = 99,
run_chisq = FALSE,
cutoff_chisq = 0.1,
jaccard_filter = FALSE,
minPts_val = 3,
eps_val = 0.01,
hamming_filter = TRUE,
hamming_cutoff = 3,
ancest_rec_filter = TRUE,
cutoff_asr = 2,
bag_size = NA,
no_rounds = 100,
max_per_bag = NA,
misslabel_no = 1,
write_data = TRUE,
save_dir = NA
)
Arguments
- pheno_mat
Data frame that contains unique indexes in the first column and the phenotype classes in the second column. The unique indexes should contain only letters, numbers and special signs "_", ".". The maximum number of unique classes is 9.
- bin_mat
Binary matrix containing the genomic variants. See
type_bin_mat
to check what can be supplied as a binary matrix.- type_bin_mat
Specifies the type of binary matrix. Options are: "panaroo"|"roary"; Exact system path to a csv file containing a pangenome matrix in Roary format (gene_presence_absence_roary.csv) or the pangenome matrix loaded as data frame. The pangenome in this format is produced by both Roary and Panaroo. "DRAM"; Exact system path to a file containing the summary of metabolism produced by DRAM (metabolism_summary.xlsx). "SCARAP"; Exact system path to a file pangenome.tsv or a dataframe containing the results of a pangenome tool SCARAP. "custom"; Data frame or matrix containing custom binary matrix. The features should be in columns and the strains in rows. "k-mers"|"unitigs"; Exact system path to a .gz file containing unitigs called by unitig-counter or k-mers by fsm-lite. "PIRATE"; Exact system path to a file containing the pangenome produced by PIRATE (PIRATE.gene_families.tsv) or the pangenome loaded as a dataframe. "SNPs"; Exact system path to a VCF file. aurora expect these columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT and a column for each analysed strain.
- which_snps
There are two options: "biallelic" and "all_alleles". Setting which_snps = "biallelic" will remove all SNPs that have more than one alternative allele. Setting which_snps = "all_alleles" will create a new column for every alternative allele.
- bagging
Bagging algorithm applied to capture the population structure. Select from "random_walk" or "phylogenetic_walk". Default: "phylogenetic_walk".
- tree
Phylogenetic tree loaded as an object of class
Phylo
. The tree needs to contain edge lengths and the tips should be the same as the indexes in pheno_mat.- alternative_dist_mat
Distance matrix that contains pairwise phylogenetic distances. Row names and column names needs to be the same as indexes in
pheno_mat
.- reduce_outlier
If TRUE the phylogenetic distances of very distant strains to the rest of the population will be reduced. Default: TRUE.
- cutoff_outlier
The number of standard deviations to the right of the mean of the distance distribution. Distances with z-score higher than the cutoff will be shrunk to the cutoff. Default: 3.
- fit_parameters
Indicates if parameters to AdaBoost and Random Forest should be fitted. Default: TRUE.
- repeats
Number of times the parameters are fitted. Default: 10.
- random_forest
Indicates whether to use Random Forest. Default: TRUE.
- plot_random_forest
Indicates if the distance matrix from Random Forest should be plotted. Default: TRUE.
- sampsize
Determines the size of the random sample used to build each tree in Random Forest. Default: 50. Ignored if
fit_parameters
= TRUE.- mtry
Controls the number of features considered at each split during tree building in Random Forest. Default: 200. Ignored if
fit_parameters
= TRUE.- ntree
Determines the number of trees in Random Forest. Default: 100. Ignored if
fit_parameters
= TRUE.- maxnodes
Limits the maximum number of terminal nodes in each Random Forest tree. Default: 12. Ignored if
fit_parameters
= TRUE.- ovr_log_reg
Indicates whether to use log regression with one vs rest strategy. Default: TRUE.
- C_val
Regularization term for l1 penalty in log regression. Default: 0.5.
- adaboost
Indicates whether to use AdaBoost. Default: TRUE.
- max_depth
Indicates the maximum depth of the decision tree in AdaBoost. Default: 1. Changing the value will not effect anything. It is only here because in future version the user will be able to experiment with it.
- n_estimators
Determines the number of weak learners to combine in AdaBoost. Default: 500. Ignored if
fit_parameters
= TRUE.- learning_rate
Controls the contribution of each weak learner in the final AdaBoost model. Default: 0.3. Ignored if
fit_parameters
= TRUE.- CART
Indicates whether to use classification and regression tree (CART) model. Default: TRUE.
- CART_plot
Indicates whether the proximities from CART models should be plotted. Default: TRUE.
- condaenv_path
An exact system path to a conda environment that will be used for log regression and AdaBoost. If no path is provided or if the path is set to NA then AdaBoost and log regression will not be run.
- low_perc_cutoff
The lower cutoff for filtering the supplied binary matrix. Default = 3. This means that all features present in less than 3% of all strains will be removed.
- upp_perc_cutoff
The upper cutoff for filtering the supplied binary matrix. Default = 99. This means that all features present in more than 99% of all strains will be removed.
- run_chisq
Indicates whether chi-square filter should be run. Use only if the number of features after the initial filtering is still large (> 10,000). Default: FALSE.
- cutoff_chisq
A cutoff p-value for the chi-square test. Features with p-value higher that the cutoff will be removed. Default: 0.1.
- jaccard_filter
Grouping of correlated features based on Jaccard distance matrix and DBSCAN. Default: FALSE. Only one grouping method (
jaccard_filter
orhamming_filter
) can be used.- minPts_val
minPts
parameter in DBSCAN. Specifies the minimum number of neighboring data points for a point to be considered a core point. Default: 3.- eps_val
eps
parameter in DBSCAN. Determines the maximum distance between two points for them to be considered neighbors. Default: 0.01.- hamming_filter
Grouping of correlated features based on Hamming distance. Default: TRUE. Only one grouping method (
jaccard_filter
orhamming_filter
) can be used.- hamming_cutoff
Maximum Hamming distance. If features have intrarcluster distances lower or equal to hamming_cutoff then they are grouped into one feature. Default: 3.
- ancest_rec_filter
Indicates if ancestral reconstruction filter should be used. The filter removes features broadly distributed along the phylogenetic tree. These features are often common plasmid genes and IS elements. Default: TRUE.
- cutoff_asr
The number of standard deviations to the right of the mean of the distribution produced by ancest_rec_filter. Features with z-score higher than the cutoff will be removed. Default: 2.
- bag_size
The size of the bag for each class. Default: NA. If NA than the bag_size is calculated as 5* the number of strains in the class with the fewest strains. Provide the size as a number for each class i.e., c(50, 50, 50) for a phenotype with three classes.
- no_rounds
Number of times the aurora algorithm iterates. Default: 100. You can decrease this number to lower the computational time
- max_per_bag
Maximum number of times a strain can be repeated in the bag. Default: NA If NA than the max_per_bag is calculated so that none of the strains exceeds 20% of the bag of each class.
- misslabel_no
The number of mislabeled strains in the threshold calculation phase. Default: 1. Do not modify this argument. In the future version the user will be able to experiment with this.
- write_data
Indicates whether aurora should write the data in a directory specified by an argument
save_dir
. Default: TRUE.- save_dir
An exact system path to the directory where the result will be written.
Value
The output is a nested list that mainly contains a table for each ML tool.
The tables show which strains were identified as autochthonous and allochthonous.
The list also contains p-value matrices that show if the entire species is adapted
to the investigated phenotype. If write_csv
was set to TRUE
and save_dir
was provided then distance matrices are written to the folder. For more information about the
output check https://dalimilbujdos.github.io/aurora/articles/outputs.html.