Main function of the aurora package. This function determines if the analysed species is adapted to the investigated phenotype and it also identifies mislabeled strains. The output of this function can be used as an input into aurora_GWAS function.

Main function of the aurora package. This function determines if the analysed species is adapted to the investigated phenotype and it also identifies mislabeled strains. The output of this function can be used as an input into aurora_GWAS function.

Usage

aurora_pheno(
  pheno_mat,
  bin_mat = NA,
  type_bin_mat = "roary",
  which_snps = "all_alleles",
  bagging = "phylogenetic_walk",
  tree = NA,
  alternative_dist_mat = NA,
  reduce_outlier = TRUE,
  cutoff_outlier = 3,
  fit_parameters = TRUE,
  repeats = 10,
  random_forest = TRUE,
  plot_random_forest = TRUE,
  sampsize = 50,
  mtry = 200,
  ntree = 100,
  maxnodes = 12,
  ovr_log_reg = TRUE,
  C_val = 0.5,
  adaboost = TRUE,
  max_depth = 1,
  n_estimators = 500,
  learning_rate = 0.3,
  CART = TRUE,
  CART_plot = TRUE,
  condaenv_path = NA,
  low_perc_cutoff = 3,
  upp_perc_cutoff = 99,
  run_chisq = FALSE,
  cutoff_chisq = 0.1,
  jaccard_filter = FALSE,
  minPts_val = 3,
  eps_val = 0.01,
  hamming_filter = TRUE,
  hamming_cutoff = 3,
  ancest_rec_filter = TRUE,
  cutoff_asr = 2,
  bag_size = NA,
  no_rounds = 100,
  max_per_bag = NA,
  misslabel_no = 1,
  write_data = TRUE,
  save_dir = NA
)

Arguments

pheno_mat: Data frame that contains unique indexes in the first column and the phenotype classes in the second column. The unique indexes should contain only letters, numbers and special signs "_", ".". The maximum number of unique classes is 9.
bin_mat: Binary matrix containing the genomic variants. See type_bin_mat to check what can be supplied as a binary matrix.
type_bin_mat: Specifies the type of binary matrix. Options are: "panaroo"|"roary"; Exact system path to a csv file containing a pangenome matrix in Roary format (gene_presence_absence_roary.csv) or the pangenome matrix loaded as data frame. The pangenome in this format is produced by both Roary and Panaroo. "DRAM"; Exact system path to a file containing the summary of metabolism produced by DRAM (metabolism_summary.xlsx). "SCARAP"; Exact system path to a file pangenome.tsv or a dataframe containing the results of a pangenome tool SCARAP. "custom"; Data frame or matrix containing custom binary matrix. The features should be in columns and the strains in rows. "k-mers"|"unitigs"; Exact system path to a .gz file containing unitigs called by unitig-counter or k-mers by fsm-lite. "PIRATE"; Exact system path to a file containing the pangenome produced by PIRATE (PIRATE.gene_families.tsv) or the pangenome loaded as a dataframe. "SNPs"; Exact system path to a VCF file. aurora expect these columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT and a column for each analysed strain.
which_snps: There are two options: "biallelic" and "all_alleles". Setting which_snps = "biallelic" will remove all SNPs that have more than one alternative allele. Setting which_snps = "all_alleles" will create a new column for every alternative allele.
bagging: Bagging algorithm applied to capture the population structure. Select from "random_walk" or "phylogenetic_walk". Default: "phylogenetic_walk".
tree: Phylogenetic tree loaded as an object of class Phylo. The tree needs to contain edge lengths and the tips should be the same as the indexes in pheno_mat.
alternative_dist_mat: Distance matrix that contains pairwise phylogenetic distances. Row names and column names needs to be the same as indexes in pheno_mat.
reduce_outlier: If TRUE the phylogenetic distances of very distant strains to the rest of the population will be reduced. Default: TRUE.
cutoff_outlier: The number of standard deviations to the right of the mean of the distance distribution. Distances with z-score higher than the cutoff will be shrunk to the cutoff. Default: 3.
fit_parameters: Indicates if parameters to AdaBoost and Random Forest should be fitted. Default: TRUE.
repeats: Number of times the parameters are fitted. Default: 10.
random_forest: Indicates whether to use Random Forest. Default: TRUE.
plot_random_forest: Indicates if the distance matrix from Random Forest should be plotted. Default: TRUE.
sampsize: Determines the size of the random sample used to build each tree in Random Forest. Default: 50. Ignored if fit_parameters = TRUE.
mtry: Controls the number of features considered at each split during tree building in Random Forest. Default: 200. Ignored if fit_parameters = TRUE.
ntree: Determines the number of trees in Random Forest. Default: 100. Ignored if fit_parameters = TRUE.
maxnodes: Limits the maximum number of terminal nodes in each Random Forest tree. Default: 12. Ignored if fit_parameters = TRUE.
ovr_log_reg: Indicates whether to use log regression with one vs rest strategy. Default: TRUE.
C_val: Regularization term for l1 penalty in log regression. Default: 0.5.
adaboost: Indicates whether to use AdaBoost. Default: TRUE.
max_depth: Indicates the maximum depth of the decision tree in AdaBoost. Default: 1. Changing the value will not effect anything. It is only here because in future version the user will be able to experiment with it.
n_estimators: Determines the number of weak learners to combine in AdaBoost. Default: 500. Ignored if fit_parameters = TRUE.
learning_rate: Controls the contribution of each weak learner in the final AdaBoost model. Default: 0.3. Ignored if fit_parameters = TRUE.
CART: Indicates whether to use classification and regression tree (CART) model. Default: TRUE.
CART_plot: Indicates whether the proximities from CART models should be plotted. Default: TRUE.
condaenv_path: An exact system path to a conda environment that will be used for log regression and AdaBoost. If no path is provided or if the path is set to NA then AdaBoost and log regression will not be run.
low_perc_cutoff: The lower cutoff for filtering the supplied binary matrix. Default = 3. This means that all features present in less than 3% of all strains will be removed.
upp_perc_cutoff: The upper cutoff for filtering the supplied binary matrix. Default = 99. This means that all features present in more than 99% of all strains will be removed.
run_chisq: Indicates whether chi-square filter should be run. Use only if the number of features after the initial filtering is still large (> 10,000). Default: FALSE.
cutoff_chisq: A cutoff p-value for the chi-square test. Features with p-value higher that the cutoff will be removed. Default: 0.1.
jaccard_filter: Grouping of correlated features based on Jaccard distance matrix and DBSCAN. Default: FALSE. Only one grouping method (jaccard_filter or hamming_filter) can be used.
minPts_val: minPts parameter in DBSCAN. Specifies the minimum number of neighboring data points for a point to be considered a core point. Default: 3.
eps_val: eps parameter in DBSCAN. Determines the maximum distance between two points for them to be considered neighbors. Default: 0.01.
hamming_filter: Grouping of correlated features based on Hamming distance. Default: TRUE. Only one grouping method (jaccard_filter or hamming_filter) can be used.
hamming_cutoff: Maximum Hamming distance. If features have intrarcluster distances lower or equal to hamming_cutoff then they are grouped into one feature. Default: 3.
ancest_rec_filter: Indicates if ancestral reconstruction filter should be used. The filter removes features broadly distributed along the phylogenetic tree. These features are often common plasmid genes and IS elements. Default: TRUE.
cutoff_asr: The number of standard deviations to the right of the mean of the distribution produced by ancest_rec_filter. Features with z-score higher than the cutoff will be removed. Default: 2.
bag_size: The size of the bag for each class. Default: NA. If NA than the bag_size is calculated as 5* the number of strains in the class with the fewest strains. Provide the size as a number for each class i.e., c(50, 50, 50) for a phenotype with three classes.
no_rounds: Number of times the aurora algorithm iterates. Default: 100. You can decrease this number to lower the computational time
max_per_bag: Maximum number of times a strain can be repeated in the bag. Default: NA If NA than the max_per_bag is calculated so that none of the strains exceeds 20% of the bag of each class.
misslabel_no: The number of mislabeled strains in the threshold calculation phase. Default: 1. Do not modify this argument. In the future version the user will be able to experiment with this.
write_data: Indicates whether aurora should write the data in a directory specified by an argument save_dir. Default: TRUE.
save_dir: An exact system path to the directory where the result will be written.

Value

The output is a nested list that mainly contains a table for each ML tool. The tables show which strains were identified as autochthonous and allochthonous. The list also contains p-value matrices that show if the entire species is adapted to the investigated phenotype. If write_csv was set to TRUE and save_dir was provided then distance matrices are written to the folder. For more information about the output check https://dalimilbujdos.github.io/aurora/articles/outputs.html.

Examples


if (FALSE) { # \dontrun{
  data(pheno_mat_reuteri)
  data(bin_mat_reuteri)
  data(tree_reuteri)

  aurora_pheno(bin_mat = bin_mat,
               type_bin_mat = "panaroo",
               pheno_mat = pheno_mat,
               tree = tree,
               save_dir = "/path/to/my/dir/")
} # }