topiary.io

Input/output functions for topiary.

topiary.io.alignments

Functions for reading and writing alignments to files.

topiary.io.alignments.read_fasta_into(df, fasta, load_into_column='alignment', unkeep_missing=True)

Load sequences from a fasta file into an existing topiary dataframe. This function expects the fasta file to have names formated like >uid|other stuff. It will match the uid in the fasta file with the uid in the topiary dataframe. If a uid is not in the dataframe, the function will raise an error.

Parameters:
  • df (pandas.DataFrame) – topiary data frame

  • fasta (str) – a fasta file with headers formatted like >uid|other stuff

  • load_into_column (str, default="alignment") – what column in the dataframe to load the sequences into

  • unkeep_missing (bool, default=True) – set any sequences in the dataframe tht are not in the fasta file to keep=False. This allows the user to delete sequences from the alignment and have that reflected in the dataframe.

Returns:

topiary dataframe with sequences now in load_into_column

Return type:

pandas.DataFrame

topiary.io.alignments.write_fasta(df, out_file, seq_column=None, label_columns=['species', 'name'], write_only_keepers=True, empty_char='X-?', clean_sequence=False, overwrite=False, sort_on_taxa=False)

Write a fasta file from a dataframe.

Parameters:
  • df (pandas.DataFrame) – data frame to write out

  • out_file (str) – output file

  • seq_column (str, optional) – column in data frame to use as sequence. If not specified, use “alignment” (if present) or “sequence”

  • label_columns (list, default=["species","name"]) – list of columns to use for sequence labels

  • write_only_keepers (bool, default=True) – whether or not to write only seq with keep == True

  • empty_char (str or None) – string containing empty char. If the sequence is only empty char, do not write out. To disable check, set empty_char=None.

  • clean_sequence (bool, default=False) – replace any non-aa characters with “-”

  • overwrite (bool, default=False) – whether or not to overwrite an existing file

  • sort_on_taxa (bool, default=False) – sort output taxonomically if possible. This will sort (in order of preference) by recip_paralog, nickname, and then name. Once sorted by protein, the species will then be sorted based on their taxonomic separation, starting with the first key_species in the dataframe.

topiary.io.alignments.write_phy(df, out_file, seq_column='alignment', write_only_keepers=True, empty_char='X-?', clean_sequence=False, overwrite=False)

Write a .phy file from a dataframe. Uses the uid as the sequence name. All sequences must have the same length.

Parameters:
  • df (pandas.DataFrame) – data frame to write out

  • out_file (str) – output file

  • seq_column (str, default="alignment") – column in data frame to use as sequence

  • label_columns (list, default=["species","name"]) – list of columns to use for sequence labels

  • write_only_keepers (bool, default=True) – whether or not to write only seq with keep == True

  • empty_char (str or None) – string containing empty char. If the sequence is only empty char, do not write out. To disable check, set empty_char=None.

  • clean_sequence (bool, default=False) – replace any non-aa characters with “-”

  • overwrite (bool, default=False) – whether or not to overwrite an existing file

Returns:

Return type:

None

topiary.io.dataframe

Functions for reading and writing dataframes.

topiary.io.dataframe.read_dataframe(input, remove_extra_index=True)

Read a topiary spreadsheet. Handles .csv, .tsv, .xlsx/.xls. If extension is not one of these, attempts to parse text as a spreadsheet using pandas.read_csv(sep=None).

Parameters:
  • input (pandas.DataFrame or str) – either a pandas dataframe OR the filename to read in.

  • remove_extra_index (bool, default=True) – look for the ‘Unnamed: 0’ column that pandas writes out for pandas.to_csv(index=True) and, if found, drop column.

Returns:

validated topiary dataframe

Return type:

pandas.DataFrame

topiary.io.dataframe.write_dataframe(df, out_file, overwrite=False)

Write a dataframe to an output file. The type of file written depends on the extension of out_file. If .csv, write comma-separated. If .tsv, write tab- separated. If .xlsx, write excel. Otherwise, write as a .csv file.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • out_file (str) – output file name

  • overwrite (bool, default=False) – whether or not to overwrite an existing file

Returns:

Return type:

None

topiary.io.paralog_patterns

Convert a paralog patterns dictionary with lists of patterns as values into a dictionary with regex as values.

topiary.io.paralog_patterns.load_paralog_patterns(alias_dict, spacers=[' ', '-', '_', '.'], ignorecase=True, re_flags=None)

Build regex to look for aliases when assigning protein names from raw NCBI description strings and/or doing reciprocal blast. The alias_dict can either have a list of strings as values or a pre-compiled regular expression for each value. If a pre-compiled expression, that expression is used as-is. Otherwise, the list of strings is compiled into a regular expression.

Parameters:
  • alias_dict (dict) – dictionary keying protein names to either a list of aliases (as strings) OR a pre-compiled regular expression.

  • spacers (list, default=[" ","-","_","."]) – list of characters to recognize as spacers

  • re_flags (list, optional) – regular expression flags to pass to compile. None or list of of flags. Note, “ignorecase” takes precedence over re_flags.

  • ignorecase (bool, default=True) – when compiling regex, whether or not to ignore case

Returns:

paralog_patterns – dictionary of compiled regular expressions to use to try to match paralogs. Keys are paralog names; values are regular expressions.

Return type:

dict

topiary.io.seed

Functions for working with seed dataframes

topiary.io.seed.df_from_seed(seed_df, ncbi_blast_db='nr', local_blast_db=None, blast_xml=None, move_mrca_up_by=2, species_aware=None, hitlist_size=5000, e_value_cutoff=0.001, gapcosts=(11, 1), num_ncbi_blast_threads=1, num_local_blast_threads=-1, keep_blast_xml=False, **kwargs)

Construct a topiary dataframe from a seed dataframe, blasting to fill in the sequences. This can blast an NCBI database, local database, and/or read in previously-run blast xml files.

Parameters:
  • seed_df (pandas.DataFrame or str) – seed dataframe containing seed sequences to launch the analysis. df can be a pandas dataframe or a string pointing to a spreadsheet file.

  • ncbi_blast_db (str or None, default="nr") – NCBI blast database to use.

  • local_blast_db (str or None, default=None) – Local blast database to use.

  • blast_xml (str or list, optional) –

    previously generated blast xml files to load. This argument can be:

    • single xml file (str)

    • list of xml files (list of str)

    • directory (str). Code will grab all .xml files in the directory.

  • move_mrca_up_by (int, default=2) – when inferring the phylogenetic context from the seed dataframe, get the most recent common ancestor of the seed species, then find the taxonomic rank “move_mrca_up_by” levels above that ancestor. For example, if the key species all come from marsupials (Theria) and move_mrca_up_by == 2, the context will be Amniota (Theria -> Mammalia -> Amniota). Note: If the seed dataframe consists entirely of Bacterial or Archaeal sequences, the mrca will be set to the appropriate domain, not a local species ancestors.

  • species_aware (bool or None, optional) – If True, do analysis in species-aware fashion; if False, ignore species; if None, infer this from the dataset. (Microbial datasets will be False; non-microbial datasets will be True.)

  • hitlist_size (int, default=5000) – download only the top hitlist_size hits

  • e_value_cutoff (float, default=0.001) – only take hits with e_value better than e_value_cutoff

  • gapcost (tuple, default=(11,1)) – BLAST gapcosts (length 2 tuple of ints)

  • num_ncbi_blast_threads (int, default=1) – number of threads to use for NCBI blast. -1 means use all available. (Multithreading rarely speeds up remote BLAST).

  • num_local_blast_threads (int, default=-1) – number of threads to use for local blast. -1 means all available.

  • keep_blast_xml (bool, default=False) – whether or not to keep raw blast xml output

  • **kwargs (dict, optional) – extra keyword arguments are passed directly to biopython NcbiblastXXXCommandline (for local blast) or qblast (for remote blast). These take precedence over anything specified above (hitlist_size, for example).

Returns:

  • topiary_dataframe (pandas.DataFrame) – topiary dataframe with sequences found from seed sequence.

  • key_species (numpy.array) – list if key species to keep during the analysis

  • paralog_patterns (list) – list of compiled regular expressions to use to try to match paralogs

  • species_aware (bool) – whether or not this should be treated as a species aware calculation

Notes

Every sequence in the original seed dataframe will have always_keep set to True, so they will not be deleted by subsequent quality control steps.

topiary.io.seed.read_seed(df, species_aware=None)

Read a seed data frame and extract alias patterns and key species.

Parameters:
  • df (pandas.DataFrame or str) – seed dataframe containing seed sequences to launch the analysis. df can be a pandas dataframe or a string pointing to a spreadsheet file.

  • species_aware (bool or None, default=None) – Whether or not read seed in a species-aware fashion. If True, require all species be resolvable in the seed dataset. If False, do not require resolvable. If None, choose automatically. (If microbial, set to False; if not microbial, set to True).

Returns:

  • topiary_dataframe (pandas.DataFrame) – new topiary dataframe built from the seed dataframe

  • key_species (numpy.array) – list if key species to keep during the analysis

  • paralog_patterns (list) – list of compiled regular expressions to use to try to match paralogs

  • species_aware (bool) – whether or not this should be treated as a species aware calculation

Notes

The seed dataframe is expected to have at least four columns:

  • species: species names for seed sequences in binomial format (i.e. Homo sapiens or Mus musculus)

  • name: name of each sequence (i.e. LY96)

  • aliases: other names for this sequence found in different databases/species, separated by ; (i.e. LY96;MD2;ESOP1)

  • sequence: amino acid sequences for these proteins.

It may have one other optional column:

  • key_species: True/False. Indicates whether or not this species should be used as a key species for reciprocal BLASTing.

Other columns in the dataframe are kept but not used by topiary.

topiary.io.tree

Load a tree into an ete3 tree data structure.

topiary.io.tree.load_trees(directory=None, prefix=None, T_clean=None, T_support=None, T_anc_label=None, T_anc_pp=None, T_event=None)

Generate an ete3 tree with features ‘event’, ‘anc_pp’, ‘anc_label’, and ‘bs_support’ on internal nodes. This information is read from the input ete3 trees or the specified topiary output directory. The tree is rooted using T_event. If this tree is not specified, the midpoint root is used. Trees are read from the directory first, followed by any ete3 trees specified as arguments. (This allows the user to override trees from the directory if desired). If no trees are passed in, returns None.

Warning: this will modify input ete3 trees as it works on the trees rather than copies.

Parameters:
  • directory (str) – output directory from a topiary calculation that has .newick files in it. Function will load all trees in that directory.

  • prefix (str, optional) – what type of trees to plot from the directory. should be “reconciled” or “gene”. If None, looks for reconciled trees. If it finds any, these prefix = “reconciled”

  • T_clean (ete3.Tree, optional) – clean tree (leaf labels and branch lengths, nothing else). Stored as {}-tree.newick in output directories.

  • T_support (ete3.Tree, optional) – support tree (leaf labels, branch lengths, supports). Stored as {}-tree_supports.newick in output directories.

  • T_anc_label (ete3.Tree, optional) – ancestor label tree (leaf labels, branch lengths, internal names) Stored as {}-tree_anc-label.newick.

  • T_anc_pp (ete3.Tree, optional) – ancestor posterior probability tree (leaf labels, branch lengths, posterior probabilities as supports) Stored as {}-tree_anc-pp.newick.

  • T_event (ete3.Tree, optional) – tree with reconciliation events as internal labels (leaf labels, branch lengths, event labels). Stored as reconciled-tree_events.newick

Returns:

merged_tree – rooted tree with features on internal nodes. Return None if no trees are passed in.

Return type:

ete3.Tree or None

topiary.io.tree.read_tree(tree, fmt=None)

Load a tree into an ete3 tree data structure.

Parameters:
  • tree (ete3.Tree or dendropy.Tree or str) – some sort of tree. can be an ete3.Tree (returns self), a dendropy Tree (converts to newick and drops root), a newick file or a newick string.

  • fmt (int or None) – format for reading tree from newick. 0-9 or 100. (See Notes for what these mean). If fmt is None, try to parse without a format descriptor, then these formats in numerical order.

Returns:

tree – an ete3 tree object.

Return type:

ete3.Tree

Notes

fmt number is read directly by ete3. See their documentation for how these are read (http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-and-writing-newick-trees). As of ETE3.1.1, these numbers mean:

  • 0: flexible with support values

  • 1: flexible with internal node names

  • 2: all branches + leaf names + internal supports

  • 3: all branches + all names

  • 4: leaf branches + leaf names

  • 5: internal and leaf branches + leaf names

  • 6: internal branches + leaf names

  • 7: leaf branches + all names

  • 8: all names

  • 9: leaf names

  • 100: topology only

topiary.io.tree.write_trees(T, name_dict=None, out_file=None, overwrite=False, anc_pp=True, anc_label=True, bs_support=True, event=True)

Write out an ete3.Tree as a newick format. This function looks for features set by load_trees and then writes an individual tree out with each feature. The features are anc_pp, anc_label, bs_support, and event. This will write out trees for any of these features present; not all features need to be in place for this function to work.

Parameters:
  • T (ete3.TreeNode) – ete3 tree with information loaded into appropriate features. This is the tree returned by load_trees.

  • name_dict (dict) – name_dict : dict, optional dictionary mapping strings in node.name to more useful names. (Can be generated using topiary.draw.core.create_name_dict). If not specified, trees are written out with uid as tip names

  • out_file (str, optional) – output file. If defined, write the newick string the file.

  • overwrite (bool, default=False) – whether or not to overwrite an existing file

  • anc_pp (bool, default=True) – whether or not to write a tree with anc_pp as support values

  • anc_label (bool, default=True) – whether or not to write a tree with anc_label as internal node names

  • bs_support (bool, default=True) – whether or not to write a tree with bs_support as support values

  • event (bool, default=True) – whether or not to write a tree with events as internal node names

Returns:

tree – Newick string representation of the output tree(s)

Return type:

str