Skip to contents

Main function of the aurora package. This function determines if the analysed species is adapted to the investigated phenotype and it also identifies mislabeled strains. The output of this function can be used as an input into aurora_GWAS function.

Usage

aurora_pheno(
  pheno_mat,
  bin_mat = NA,
  type_bin_mat = "roary",
  which_snps = "all_alleles",
  bagging = "phylogenetic_walk",
  tree = NA,
  alternative_dist_mat = NA,
  reduce_outlier = TRUE,
  cutoff_outlier = 3,
  fit_parameters = TRUE,
  repeats = 10,
  random_forest = TRUE,
  plot_random_forest = TRUE,
  sampsize = 50,
  mtry = 200,
  ntree = 100,
  maxnodes = 12,
  ovr_log_reg = TRUE,
  C_val = 0.5,
  adaboost = TRUE,
  max_depth = 1,
  n_estimators = 500,
  learning_rate = 0.3,
  CART = TRUE,
  CART_plot = TRUE,
  condaenv_path = NA,
  low_perc_cutoff = 3,
  upp_perc_cutoff = 99,
  run_chisq = FALSE,
  cutoff_chisq = 0.1,
  jaccard_filter = FALSE,
  minPts_val = 3,
  eps_val = 0.01,
  hamming_filter = TRUE,
  hamming_cutoff = 3,
  ancest_rec_filter = TRUE,
  cutoff_asr = 2,
  bag_size = NA,
  no_rounds = 100,
  max_per_bag = NA,
  misslabel_no = 1,
  write_data = TRUE,
  save_dir = NA
)

Arguments

pheno_mat

Data frame that contains unique indexes in the first column and the phenotype classes in the second column. The unique indexes should contain only letters, numbers and special signs "_", ".". The maximum number of unique classes is 9.

bin_mat

Binary matrix containing the genomic variants. See type_bin_mat to check what can be supplied as a binary matrix.

type_bin_mat

Specifies the type of binary matrix. Options are: "panaroo"|"roary"; Exact system path to a csv file containing a pangenome matrix in Roary format (gene_presence_absence_roary.csv) or the pangenome matrix loaded as data frame. The pangenome in this format is produced by both Roary and Panaroo. "DRAM"; Exact system path to a file containing the summary of metabolism produced by DRAM (metabolism_summary.xlsx). "SCARAP"; Exact system path to a file pangenome.tsv or a dataframe containing the results of a pangenome tool SCARAP. "custom"; Data frame or matrix containing custom binary matrix. The features should be in columns and the strains in rows. "k-mers"|"unitigs"; Exact system path to a .gz file containing unitigs called by unitig-counter or k-mers by fsm-lite. "PIRATE"; Exact system path to a file containing the pangenome produced by PIRATE (PIRATE.gene_families.tsv) or the pangenome loaded as a dataframe. "SNPs"; Exact system path to a VCF file. aurora expect these columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT and a column for each analysed strain.

which_snps

There are two options: "biallelic" and "all_alleles". Setting which_snps = "biallelic" will remove all SNPs that have more than one alternative allele. Setting which_snps = "all_alleles" will create a new column for every alternative allele.

bagging

Bagging algorithm applied to capture the population structure. Select from "random_walk" or "phylogenetic_walk". Default: "phylogenetic_walk".

tree

Phylogenetic tree loaded as an object of class Phylo. The tree needs to contain edge lengths and the tips should be the same as the indexes in pheno_mat.

alternative_dist_mat

Distance matrix that contains pairwise phylogenetic distances. Row names and column names needs to be the same as indexes in pheno_mat.

reduce_outlier

If TRUE the phylogenetic distances of very distant strains to the rest of the population will be reduced. Default: TRUE.

cutoff_outlier

The number of standard deviations to the right of the mean of the distance distribution. Distances with z-score higher than the cutoff will be shrunk to the cutoff. Default: 3.

fit_parameters

Indicates if parameters to AdaBoost and Random Forest should be fitted. Default: TRUE.

repeats

Number of times the parameters are fitted. Default: 10.

random_forest

Indicates whether to use Random Forest. Default: TRUE.

plot_random_forest

Indicates if the distance matrix from Random Forest should be plotted. Default: TRUE.

sampsize

Determines the size of the random sample used to build each tree in Random Forest. Default: 50. Ignored if fit_parameters = TRUE.

mtry

Controls the number of features considered at each split during tree building in Random Forest. Default: 200. Ignored if fit_parameters = TRUE.

ntree

Determines the number of trees in Random Forest. Default: 100. Ignored if fit_parameters = TRUE.

maxnodes

Limits the maximum number of terminal nodes in each Random Forest tree. Default: 12. Ignored if fit_parameters = TRUE.

ovr_log_reg

Indicates whether to use log regression with one vs rest strategy. Default: TRUE.

C_val

Regularization term for l1 penalty in log regression. Default: 0.5.

adaboost

Indicates whether to use AdaBoost. Default: TRUE.

max_depth

Indicates the maximum depth of the decision tree in AdaBoost. Default: 1. Changing the value will not effect anything. It is only here because in future version the user will be able to experiment with it.

n_estimators

Determines the number of weak learners to combine in AdaBoost. Default: 500. Ignored if fit_parameters = TRUE.

learning_rate

Controls the contribution of each weak learner in the final AdaBoost model. Default: 0.3. Ignored if fit_parameters = TRUE.

CART

Indicates whether to use classification and regression tree (CART) model. Default: TRUE.

CART_plot

Indicates whether the proximities from CART models should be plotted. Default: TRUE.

condaenv_path

An exact system path to a conda environment that will be used for log regression and AdaBoost. If no path is provided or if the path is set to NA then AdaBoost and log regression will not be run.

low_perc_cutoff

The lower cutoff for filtering the supplied binary matrix. Default = 3. This means that all features present in less than 3% of all strains will be removed.

upp_perc_cutoff

The upper cutoff for filtering the supplied binary matrix. Default = 99. This means that all features present in more than 99% of all strains will be removed.

run_chisq

Indicates whether chi-square filter should be run. Use only if the number of features after the initial filtering is still large (> 10,000). Default: FALSE.

cutoff_chisq

A cutoff p-value for the chi-square test. Features with p-value higher that the cutoff will be removed. Default: 0.1.

jaccard_filter

Grouping of correlated features based on Jaccard distance matrix and DBSCAN. Default: FALSE. Only one grouping method (jaccard_filter or hamming_filter) can be used.

minPts_val

minPts parameter in DBSCAN. Specifies the minimum number of neighboring data points for a point to be considered a core point. Default: 3.

eps_val

eps parameter in DBSCAN. Determines the maximum distance between two points for them to be considered neighbors. Default: 0.01.

hamming_filter

Grouping of correlated features based on Hamming distance. Default: TRUE. Only one grouping method (jaccard_filter or hamming_filter) can be used.

hamming_cutoff

Maximum Hamming distance. If features have intrarcluster distances lower or equal to hamming_cutoff then they are grouped into one feature. Default: 3.

ancest_rec_filter

Indicates if ancestral reconstruction filter should be used. The filter removes features broadly distributed along the phylogenetic tree. These features are often common plasmid genes and IS elements. Default: TRUE.

cutoff_asr

The number of standard deviations to the right of the mean of the distribution produced by ancest_rec_filter. Features with z-score higher than the cutoff will be removed. Default: 2.

bag_size

The size of the bag for each class. Default: NA. If NA than the bag_size is calculated as 5* the number of strains in the class with the fewest strains. Provide the size as a number for each class i.e., c(50, 50, 50) for a phenotype with three classes.

no_rounds

Number of times the aurora algorithm iterates. Default: 100. You can decrease this number to lower the computational time

max_per_bag

Maximum number of times a strain can be repeated in the bag. Default: NA If NA than the max_per_bag is calculated so that none of the strains exceeds 20% of the bag of each class.

misslabel_no

The number of mislabeled strains in the threshold calculation phase. Default: 1. Do not modify this argument. In the future version the user will be able to experiment with this.

write_data

Indicates whether aurora should write the data in a directory specified by an argument save_dir. Default: TRUE.

save_dir

An exact system path to the directory where the result will be written.

Value

The output is a nested list that mainly contains a table for each ML tool. The tables show which strains were identified as autochthonous and allochthonous. The list also contains p-value matrices that show if the entire species is adapted to the investigated phenotype. If write_csv was set to TRUE and save_dir was provided then distance matrices are written to the folder. For more information about the output check https://dalimilbujdos.github.io/aurora/articles/outputs.html.

Examples


if (FALSE) { # \dontrun{
  data(pheno_mat_reuteri)
  data(bin_mat_reuteri)
  data(tree_reuteri)

  aurora_pheno(bin_mat = bin_mat,
               type_bin_mat = "panaroo",
               pheno_mat = pheno_mat,
               tree = tree,
               save_dir = "/path/to/my/dir/")
} # }