topiary.pipeline
Integrated pipelines that use topiary for ASR.
topiary.pipeline.alignment_to_ancestors
Pipeline that starts from an alignment, finds the best phylogenetic model, builds a maximum likelihood tree, reconciles this tree with the species tree, and then infers ancestral proteins.
- topiary.pipeline.alignment_to_ancestors.alignment_to_ancestors(df, out_dir=None, starting_tree=None, no_bootstrap=False, force_reconcile=False, force_no_reconcile=False, horizontal_transfer=False, alt_cutoff=0.25, model_matrices=['cpREV', 'Dayhoff', 'DCMut', 'DEN', 'Blosum62', 'FLU', 'HIVb', 'HIVw', 'JTT', 'JTT-DCMut', 'LG', 'mtART', 'mtMAM', 'mtREV', 'mtZOA', 'PMB', 'rtREV', 'stmtREV', 'VT', 'WAG'], model_rates=['', 'G8'], model_freqs=['', 'FC', 'FO'], model_invariant=['', 'IC', 'IO'], restart=False, overwrite=False, num_threads=-1, raxml_binary='raxml-ng', generax_binary='generax')
Given an alignment, find the best phylogenetic model, build a maximum- likelihood tree, reconcile this tree with the species tree, and then infer ancestral protein sequences. Reconciliation is not done for df that only have bacterial sequences. User can force reconciliation to happen or not, regardless of species in df, using force_reconcile and force_no_reconcile.
- Parameters:
df (pandas.DataFrame or str) – topiary data frame or csv written out from topiary df. This topiary dataframe should have an
alignment
column.out_dir (str, optional) – output directory. If not specified, create an output directory with the format “alignment-to-ancestors_{counter}” (where counter increments so previous directories are not overwritten).
starting_tree (str, optional) – tree in newick format. This will be used for the best model inference and starting tree for the maximum likelihood tree estimation. If not specified, the maximum parsimony tree is generated and used.
no_bootstrap (bool, default=False) – do not do bootstrap replicates
force_reconcile (bool, default=False) – reconcile gene and species trees
force_no_reconcile (bool, default=False) – do not reconcile gene and species trees
horizontal_transfer (bool, default=False) – whether to allow horizontal transfer during reconciliation. Default is to not allow transfer (UndatedDL). If set, use undated DTL.
alt_cutoff (float, default=0.25) – cutoff to use for altAll alternate ancestral protein sequence generation. Should be between 0 and 1.
model_matrices (list, default=["cpREV","Dayhoff","DCMut","DEN","Blosum62","FLU","HIVb","HIVw","JTT","JTT-DCMut","LG","mtART","mtMAM","mtREV","mtZOA","PMB","rtREV","stmtREV","VT","WAG"]) – list of model matrices to check. If calling from command line, these can be specified directly (
--model_matrices LG JTT ...
) or by specifying a file with models on each line (--model_matrices SOME_FILE
)model_rates (list, default=["","G8"]) – ways to treat model rates. If calling from command line, these can be specified directly (
--model_rates G8 ...
) or by specifying a file with rates on each line (--model_rates SOME_FILE
)model_freqs (list, default=["","FC","FO"]) – ways to treat model freqs. If calling from command line, these can be specified directly (
--model_freqs FC FO ...
) or by specifying a file with freqs on each line (--model_freqs SOME_FILE
)model_invariant (list, default=["","IC","IO"]) – ways to treat invariant alignment columns. If calling from command line, these can be specified directly (
--model_invariant IC IO ...
) or by specifying a file with invariants on each line (--model_invariant SOME_FILE
)restart (bool, default=False) – restart job from where it stopped in output directory. incompatible with overwrite
overwrite (bool, default=False) – whether or not to overwrite existing output. incompatible with restart
num_threads (int, default=-1) – number of threads to use. if -1 use all available
raxml_binary (str, optional) – raxml binary to use
generax_binary (str, optional) – what generax binary to use
topiary.pipeline.bootstrap_reconcile
Final step on the pipeline. Run replicates in embarrassingly parallel fashion across compute nodes using MPI.
- topiary.pipeline.bootstrap_reconcile.bootstrap_reconcile(previous_run_dir, num_threads, restart=False, overwrite=False, raxml_binary='raxml-ng', generax_binary='generax')
Perform a bootstrap branch support calculation for the gene/species tree reconciliation portion of the analysis.
- previous_run_dirstr
previous pipeline run directory. Should have a directory named xx_*bootstraps*, where xx is an integer and * are any value.
- num_threadsint
number of threads to use. GeneRax uses MPI for parallelization; num_threads correspond to the number of “slots” in MPI lingo. This job can be massively parallelized as there is no cross-talk between replicates, so feel free to span this across as many compute nodes as you like.
- restartbool, default=False
restart job from where it stopped in output directory. incompatible with overwrite
- overwritebool, default=False
whether or not to overwrite existing output. incompatible with restart. This will overwrite an existing 05_reconciliation-bootstraps directory, not the rest of the pipeline directory.
- threadsint, default=-1
number of threads to use. if -1 use all available
- raxml_binarystr, optional
raxml binary to use
- generax_binarystr, optional
what generax binary to use
topiary.pipeline.seed_to_alignment
Pipeline that takes a seed dataframe, BLASTS to find sequence hits, performs quality control, lowers alignment redundancy in a taxonomically informed fashion, and generates an alignment.
- topiary.pipeline.seed_to_alignment.seed_to_alignment(seed_df, out_dir=None, seqs_per_column=1, max_seq_number=500, redundancy_cutoff=0.9, worst_align_drop_fx=0.1, sparse_column_cutoff=0.8, align_trim=(0.05, 0.95), force_species_aware=False, force_not_species_aware=False, ncbi_blast_db=None, local_blast_db=None, blast_xml=None, move_mrca_up_by=2, local_recip_blast_db=None, min_call_prob=0.95, partition_temp=1, hitlist_size=5000, e_value_cutoff=0.001, gapcosts=(11, 1), num_ncbi_blast_threads=1, num_local_blast_threads=-1, restart=False, overwrite=False, keep_recip_blast_xml=False)
Pipeline that takes a seed dataframe, BLASTs to find sequence hits, performs quality control, lowers alignment redundancy in a taxonomically informed fashion, and generates an alignment.
- Parameters:
seed_df (pandas.DataFrame or str) – Spreadsheet with at least four columns: name, species, sequence, and aliases. Can either be a pandas.DataFrame or a string pointing to a spreadsheet file. For details on seed dataframes, see the [documentation](https://topiary-asr.readthedocs.io/en/latest/data_structures.html#seed-dataframe).
out_dir (str, optional) – Output directory. If not specified, create an output directory with the format “seed-to-alignment_{counter}” (where counter increments so previous directories are not overwritten).
seqs_per_column (float, default=1) – Aim to have this number of sequences per column in the key species sequences. (For example, if the key sequence is 100 amino acids long, seqs_per_column=1 would aim for 100 sequences; 2 would aim for 200 sequences).
max_seq_number (int, default=500) – Maximum number of sequences to get, regardless of seqs_per_column and key sequence length.
redundancy_cutoff (float, default=0.90) – Remove redundant sequences from closely related species is they have a sequence identity above redundancy_cutoff.
worst_align_drop_fx (float, default=0.1) – After alignment, drop approximately this fraction of the sequences, selecting those that have long insertions and are missing chunks of sequences
sparse_column_cutoff (float, default=0.80) – When checking alignment quality, a column is sparse if it has gaps in more than sparse_column_cutoff sequences.
align_trim (tuple, default=(0.05,0.95)) – When checking alignment quality, do not score the first and last parts of the alignment. Interpreted like a python list slice, but with percentages. (0.0,1.0) would not trim; (0.05,0,98) would trim the first 0.05 off the front and the last 0.02 off the back.
force_species_aware (bool, default=False) – Lower redundancy in a species-aware fashion, regardless of dataset type. By default, topiary will lower sequence redundancy in a species-aware fashion for non-microbial datasets, and in a non-species aware fashion for microbial datasets. If True, do species aware.
force_not_species_aware (bool, default=False) – Lower redundancy in a non-species-aware fashion, regardless of dataset type. By default, topiary will lower sequence redundancy in a species-aware fashion for non-microbial datasets, and in a non-species aware fashion for microbial datasets. If True, do not do species aware.
ncbi_blast_db (str, optional) – NCBI blast database to use. (If ncbi_blast_db, local_blast_db and blast_xml are all None, ncbi_blast_db is automatically set to “nr”).
local_blast_db (str, optional) – Local blast database to use.
blast_xml (str or list, optional) –
Previously generated blast xml files to load. This argument can be:
single xml file (str)
list of xml files (list of str)
directory (str). Code will grab all .xml files in the directory.
move_mrca_up_by (int, default=2) – When inferring the phylogenetic context from the seed dataframe, get the most recent common ancestor of the seed species, then find the taxonomic rank “move_mrca_up_by” levels above that ancestor. For example, if the key species all come from marsupials (Theria) and move_mrca_up_by == 2, the context will be Amniota (Theria -> Mammalia -> Amniota). Note: if all species in the dataset are bacteria or archaea, this argument is ignored and the context will be set to Bacteria or Archaea.
local_recip_blast_db (str, optional) – Local blast database to use for reciprocal blast. If None, construct a reciprocal blast database by downloading the proteomes of the key species from the ncbi.
min_call_prob (float, default=0.95) – Hits from all paralogs that yield a regular expression match to one of the aliases from the seed dataframe are weighted by their relative blast bit scores. Each paralog is assigned a relative probability. This cutoff is the minimum probability the best paralog match must have to result in a paralog call. Value should be between 0 and 1 (not inclusive), where increasing min_call_prob increases the stringency.
partition_temp (float, default=1) – When calculating posterior probability of the reciprocal blast paralog call, use this for weighting:
2^(bit_score/partition_temp)
.partition_temp
should be a float > 0. A higher value corresponds to a higher stringency. (The bit score difference between the best hit and the bit scores of other hits would have to be higher to be significant). This is a minium value. It may be adjusted automatically to avoid numerical problems in the calculation.hitlist_size (int, default=5000) – Download only the top
hitlist_size
hits in initial BLAST.e_value_cutoff (float, default=0.001) – Only take hits with
e_value
better thane_value_cutoff
in initial blastgapcost (tuple, default=(11,1)) – BLAST gapcosts (length 2 tuple of ints) in initial BLAST. The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. This BLAST search will charge a score
gapcost[0]
for the existence of a gap, and the score ofgapcost[1]
for each residue in the gap. Thus a gap ofk
residues receives a total score of-(gapcost[0] + gapcost[1]*k)
. The default ((11,1)
) gives scores of-(11 + 1*k) for each gap with :code:`k
residues.num_ncbi_blast_threads (int, default=1) – Number of threads to use for NCBI blast. -1 means use all available. (The default is 1; multithreading rarely speeds up remote BLAST).
num_local_blast_threads (int, default=-1) – Number of threads to use for local blast. -1 means all available.
restart (bool, default=False) – Restart the pipeline from where it stopped in
out_dir
. To use this option,out_dir
must be specified and point to an existing calculation. This option is incompatible withoverwrite = True
.overwrite (bool, default=False) – Overwrite
out_dir
if it already exists. This is incompatible withrestart = True
.keep_recip_blast_xml (bool, default=False) – Whether or not to keep raw blast xml files generated by the pipeline.
- Returns:
topiary_dataframe – Topiary dataframe with aligned, quality-controlled sequences.
- Return type:
pandas.DataFrame