topiary#

Python framework for doing ancestral sequence reconstruction

Ancestral sequence reconstruction (ASR) is a powerful method to study protein evolution. It requires constructing a multiple sequence alignment, running a software pipeline with several software packages, and converting between arcane file types. Topiary streamlines this process, simplifying the workflow and helping non-experts do best-practice ASR.

Features#

Automatic. Performs sequence database construction, quality control, multiple sequence alignment, tree construction, gene/species tree reconciliation, and ancestral reconstruction with minimal user input.
Human-oriented. Users prepare their input as spreadsheets, not complicated text files. Outputs are spreadsheets and graphical summaries of ancestor quality.
Species aware. Integrates with the Open Tree of Life database, improving selection of sequences and tree/ancestor inference.
Flexible. Use as a command line program or do custom analyses and plotting using the topiary API in a Jupyter notebook or Python script.
Modern. Topiary is built around a collection of modern, actively-supported, phylogenetic software tools: OpenTree, muscle 5, RAxML-NG, GeneRax, PastML, and toytree.

Steps done by topiary

Try it out on Google Colab#

Workflow#

Topiary automates the computational steps of an ASR calculation, allowing the user to focus on the three steps that require human insight: defining the problem, validating the alignment, and characterizing the resulting ancestors. The graphic below shows the steps done by the user (brain icons) versus software (tree icons) in a topiary calculation.

Example input/output#

User input to a topiary calculation

A user prepares a “seed dataframe” setting the scope for the calculation.

name	aliases	species	sequence
LY96	ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96	Homo sapiens	MLPFLFF…
LY96	ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96	Danio rerio	MALWCPS…
LY86	Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1	Homo sapiens	MKGFTAT…
LY86	Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1	Danio rerio	MKTYFNM…

Final output from a small example topiary calculation

After running the pipeline, topiary returns a shareable directory with an html summary of all results.

Installation#

See the installation page.

Short protocol#

For a more detailed protocol, see the protocol page.

Create a seed spreadsheet with a handful of sequences that define the scope of the ASR study. For examples, see the table above or download the full example.
Construct a multiple sequence alignment from a the seed spreadsheet (“seed.xlsx”, for example). This can be run on a local computer or a cluster.
```
topiary-seed-to-alignment seed.xlsx --out_dir output
```
If desired, visually inspect and edit the alignment in an external alignment viewer. (We recommend aliview.) Load the edited alignment into a topiary dataframe.
```
topiary-load-fasta-into output/dataframe.csv edited_fasta final-dataframe.csv
```
Build a species-reconciled phylogenetic tree and infer ancestral sequences. This is usually run on a cluster.
```
topiary-alignment-to-ancestors final-dataframe.csv --out_dir ali_to_anc
```
Generate bootstrap replicates to measure branch supports. This is usually run on a cluster.
```
topiary-bootstrap-reconcile ali_to_anc num_threads
```

How to cite#

If you use topiary in your research, please cite:

Orlandi KN^*, Phillips SR^*, Sailer ZR, Harman JL, Harms MJ. “Topiary: pruning the manual labor from ancestral sequence reconstruction” (2022) Protein Science 10.1002/pro.4551.

^* Authors contributed equally

Please make sure to cite the tools we use in the package as well:

Muscle: Edgar RC (2021) bioRxiv https://doi.org/10.1101/2021.06.20.449169.
RAxML-NG: Kozlov et al (2019) Bioinformatics 35(21):4453–4455 https://doi.org/10.1093/bioinformatics/btz305.
GeneRax: Morel et al (2020) MBE https://doi.org/10.1093/molbev/msaa141.
PastML: Ishikawa et al (2019) MBE 36(9):2069–2085 https://doi.org/10.1093/molbev/msz131.
OpenTree: Mctavish J et al (2021) Syst Biol 70(6): 1295–1301. https://doi.org/10.1093/sysbio/syab033.

API and data structures#

Topiary can also be used as an API to organize general phylogenetic workflows. It uses pandas dataframes to manage phylogenetic data, allowing it to readily connect to other data science pipelines. Further, topiary provides programmatic access to RAxML-NG, GeneRax, Muscle5, and BLAST (local and remote). It also wraps portions of the OpenTree and PastML Python APIs for convenient interaction with topiary pandas dataframes.

You can see examples of the topiary API in action inside Jupyter notebooks in the topiary-examples github repo. For a detailed description of the data structures and API, see the Data Structures and API pages.

Contents: