topiary

Python framework for doing ancestral sequence reconstruction

Ancestral sequence reconstruction (ASR) is a powerful method to study protein evolution. It requires constructing a multiple sequence alignment, running a software pipeline with several software packages, and converting between arcane file types. Topiary streamlines this process, simplifying the workflow and helping non-experts do best-practice ASR.

Features

  • Automatic. Performs sequence database construction, quality control, multiple sequence alignment, tree construction, gene/species tree reconciliation, and ancestral reconstruction with minimal user input.

  • Human-oriented. Users prepare their input as spreadsheets, not complicated text files. Outputs are spreadsheets and graphical summaries of ancestor quality.

  • Species aware. Integrates with the Open Tree of Life database, improving selection of sequences and tree/ancestor inference.

  • Flexible. Use as a command line program or do custom analyses and plotting using the topiary API in a Jupyter notebook or Python script.

  • Modern. Topiary is built around a collection of modern, actively-supported, phylogenetic software tools: OpenTree, muscle 5, RAxML-NG, GeneRax, PastML, and toytree.

Steps done by topiary

Steps automated by topiary

Try it out on Google Colab

Workflow

Topiary automates the computational steps of an ASR calculation, allowing the user to focus on the three steps that require human insight: defining the problem, validating the alignment, and characterizing the resulting ancestors. The graphic below shows the steps done by the user (brain icons) versus software (tree icons) in a topiary calculation.

Topiary workflow

Example input/output

User input to a topiary calculation

A user prepares a “seed dataframe” setting the scope for the calculation.

name

aliases

species

sequence

LY96

ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96

Homo sapiens

MLPFLFF…

LY96

ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96

Danio rerio

MALWCPS…

LY86

Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1

Homo sapiens

MKGFTAT…

LY86

Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1

Danio rerio

MKTYFNM…

Final output from a small example topiary calculation

After running the pipeline, topiary returns a shareable directory with an html summary of all results.

Animation showing topiary final report

Installation

See the installation page.

Short protocol

For a more detailed protocol, see the protocol page.

  1. Create a seed spreadsheet with a handful of sequences that define the scope of the ASR study. For examples, see the table above or download the full example.

  2. Construct a multiple sequence alignment from a the seed spreadsheet (“seed.xlsx”, for example). This can be run on a local computer or a cluster.

    topiary-seed-to-alignment seed.xlsx --out_dir output
    
  3. If desired, visually inspect and edit the alignment in an external alignment viewer. (We recommend aliview.) Load the edited alignment into a topiary dataframe.

    topiary-load-fasta-into output/dataframe.csv edited_fasta final-dataframe.csv
    
  4. Build a species-reconciled phylogenetic tree and infer ancestral sequences. This is usually run on a cluster.

    topiary-alignment-to-ancestors final-dataframe.csv --out_dir ali_to_anc
    
  5. Generate bootstrap replicates to measure branch supports. This is usually run on a cluster.

    topiary-bootstrap-reconcile ali_to_anc num_threads
    

How to cite

If you use topiary in your research, please cite:

Orlandi KN*, Phillips SR*, Sailer ZR, Harman JL, Harms MJ. “Topiary: pruning the manual labor from ancestral sequence reconstruction” (2022) Protein Science 10.1002/pro.4551.

* Authors contributed equally

Please make sure to cite the tools we use in the package as well:

API and data structures

Topiary can also be used as an API to organize general phylogenetic workflows. It uses pandas dataframes to manage phylogenetic data, allowing it to readily connect to other data science pipelines. Further, topiary provides programmatic access to RAxML-NG, GeneRax, Muscle5, and BLAST (local and remote). It also wraps portions of the OpenTree and PastML Python APIs for convenient interaction with topiary pandas dataframes.

You can see examples of the topiary API in action inside Jupyter notebooks in the topiary-examples github repo. For a detailed description of the data structures and API, see the Data Structures and API pages.

Indices and tables