topiary.quality

Functions for doing quality control on sequences in a dataframe.

topiary.quality.alignment

Functions for scoring alignment quality.

topiary.quality.alignment.score_alignment(df, sparse_column_cutoff=0.8, align_trim=(0.1, 0.9), silent=False)

Calculate alignment quality scores for each sequence in the dataframe. The resulting scores are loaded as columns into the dataframe. In all cases, a higher score is a worse alignment.

The three calculated scores are:

  • fx_in_sparse: fraction of columns from the sequence that have no gap where most other sequences have a gap.

  • fx_missing_dense: fraction of columns from the sequence that have a gap where most other sequences have no gap.

  • sparse_run_length: length of longest insertion in the sequence. The insertion does not have to be continuous in sequence. Instead, this function finds runs of sparse_columns and then counts how many columns within each run come from this sequence. If, for example, there are ten sparse columns in a row and a sequence has eight non-gap characters over that run of sparse columns, the run length would be eight.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • sparse_column_cutoff (float, default=0.80) – when checking alignment quality, a column is sparse if it has gaps in more than sparse_column_cutoff sequences.

  • align_trim (tuple, default=(0.05,0.95)) – when checking alignment quality, do not score the first and last parts of the alignment. Interpreted like a slice, but with percentages. (0.0,1.0) would not trim; (0.05,0,98) would trim the first 0.05 off the front and the last 0.02 off the back.

  • silent (bool, default=False) – whether or not to silence muscle output

Returns:

topiary_dataframe – copy of df with three new columns containing scores.

Return type:

pandas.DataFrame

topiary.quality.polish

Polish a near final alignment by removing sequences with long insertions or that are missing sequence.

topiary.quality.polish.polish_alignment(df, realign=True, sparse_column_cutoff=0.8, align_trim=(0, 1), fx_sparse_percentile=0.9, sparse_run_percentile=0.9, fx_missing_percentile=0.9)

Polish a near-final alignment by removing sequences that have long insertions or are missing large chunks of the sequence.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • realign (bool, default=True) – align after dropping columns

  • sparse_column_cutoff (float, default=0.80) – when checking alignment quality, a column is sparse if it has gaps in more than sparse_column_cutoff sequences.

  • align_trim (tuple, default=(0,1)) – when checking alignment quality, do not score the first and last parts of the alignment. Interpreted like a slice, but with percentages. (0.0,1.0) would not trim; (0.05,0,98) would trim the first 0.05 off the front and the last 0.02 off the back. The default to this function does not trim at all.

  • fx_sparse_percentile (float, default=0.90) – flag any sequence that is has a fraction sparse above this percentile cutoff.

  • sparse_run_percentile (float, default=0.90) – flag any sequence that is has total sparse run length above this percentile cutoff.

  • fx_missing_percentile (float, default=0.90) – flag any sequence that is has a fraction missing above this percentile cutoff.

Notes

The alignment is scored using topiary.quality.score_alignment (see that docstring for details). Briefly: columns are characterized as either dense (most sequences have a non-gap) or sparse (defined as not dense). This call is made using the sparse_column_cutoff argument. This function then identifies sequences that have many non-gap characters in sparse columns overall, sequences with runs of non-gap characters in long runs of sparse columns, and sequences that are missing large portions of the dense columns. It drops sequences that have BOTH large fx_sparse AND large sparse_run. It also drops sequences that have BOTH large fx_missing AND are flagged as partial in the original NCBI entry.

topiary.quality.redundancy

Remove redundancy for datasets in a semi-intelligent way.

topiary.quality.redundancy.find_redundancy_cutoff(df, target_seq_number, sample_fx=0.5, max_cutoff=0.99, min_cutoff=0.25, max_iterations=20, step_bias=0.75, target_length_cutoff=0.25, discard_key=False, num_threads=-1)

Find a redundancy cutoff that, when applied to df, will yield approximately target_seq_number

Parameters:
  • df (pandas.DataFrame) – topiary dataframe with sequences whose redundancy is to be reduced

  • target_seq_number (int) – number of sequences desired after redundancy is reduced

  • sample_fx (float, default=0.1) – fraction of sequences to use when doing redundancy reduction.

  • max_cutoff (float, default=0.99) – maximum redundancy cutoff value. The redundancy cutoff will be no more than this after optimization.

  • min_cutoff (float, default=0.35) – minimum redundancy cutoff value. The redundancy cutoff will be no less than this after optimization.

  • max_iterations (int, default=20) – when refining, do no more than max_iterations of optimization.

  • step_bias (float, default=0.75) – when iterating through cutoffs, bias towards higher value (> 0.5), lower value (< 0.5), or take exact average of two cutoffs (0.5). Must be between 0 and 1.

  • target_length_cutoff (float, default=0.25) – Give a higher quality to any sequence whose length is within target_length_cutoff pct of the median key_species sequence length. To disable, set to None.

  • discard_key (bool, default=False) – whether or not to discard key sequences when doing the optimization

  • num_threads (int, default=-1) – number of threads to use. If -1, use all available

Returns:

cutoff – optimized redundnacy cutoff

Return type:

float

topiary.quality.redundancy.remove_redundancy(df, cutoff=0.95, target_length_cutoff=0.25, discard_key=False, silent=False, num_threads=-1)

Remove redundant sequences according to cutoff and semi-intelligent heuristics.

If two sequences are identical within a sequence cutof, it selects the sequence to keep according to the following criteria, in order:

  1. whether sequence is from a key species

  2. whether it is significantly different in length from the median length of sequences from key species

  3. whether it’s annotated as low quality

  4. whether it’s annotated as partial

  5. whether it’s annotated as precursor

  6. whether it’s annotated as hypothetical

  7. whether it’s annotated as isoform

  8. whether it’s annotated as structure

  9. sequence length (preferring longer)

Parameters:
  • df (pandas.DataFrame) – topiary data frame with sequences

  • cutoff (float, default=0.95) – %identity cutoff for combining removing sequences (between 0 and 1)

  • target_length_cutoff (float, default=0.25) – Give a higher quality to any sequence whose length is within target_length_cutoff pct of the median key_species sequence length. To disable, set to None.

  • discard_key (bool, default=False) – whether or not to discard sequences from key species

  • silent (bool, default=False) – whether to print output and use status bars

  • only_in_species (bool, default=False) – only reduce redundancy within species; do not compare sequences between species

  • num_threads (int, default=-1) – number of threads to use. If -1, use all available

Returns:

topiary_dataframe – Copy of df in which “keep” is set to False for redundant sequences

Return type:

pandas.dataframe

topiary.quality.shrink

Shrink the sequence database in a rational way.

topiary.quality.shrink.shrink_aligners(df, target_seq_number, paralog_column='recip_paralog', species_tree_aware=True, weighted_paralog_split=False, sparse_column_cutoff=0.8, align_trim=(0.05, 0.95))

Select sequences that align best within taxonomically informed blocks.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • target_seq_number (int) – number of sequences that should be returned in final dataset

  • paralog_column (str, default="recip_paralog") – column holding preliminary paralog calls.

  • species_tree_aware (bool, default=True) – if True, merge paying attention to species tree and paralog. If False, merge all sequences based on sequence identity (similar to CD-hit)

  • weighted_paralog_split (bool, default=False) – when deciding how much of the total budget to assign to each paralog, weight the budget by the number of times each paralog is seen. If False, (default), split the budget as evenly as possible between the paralogs in the dataframe.

  • sparse_column_cutoff (float, default=0.80) – when checking alignment quality, a column is sparse if it has gaps in more than sparse_column_cutoff sequences.

  • align_trim (tuple, default=(0.05,0.95)) – when checking alignment quality, do not score the first and last parts of the alignment. Interpreted like a slice, but with percentages. (0.0,1.0) would not trim; (0.05,0,98) would trim the first 0.05 off the front and the last 0.02 off the back.

Returns:

df – dataframe with keep set to False for redundant sequences

Return type:

pandas.DataFrame

topiary.quality.shrink.shrink_dataset(df, paralog_column='recip_paralog', seqs_per_column=1, max_seq_number=500, species_tree_aware=True, redundancy_cutoff=0.96, merge_block_size=50, weighted_paralog_split=False, sparse_column_cutoff=0.8, align_trim=(0.05, 0.95))

Select a subset of sequences from a topiary dataframe in a rational way.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • paralog_column (str, default="recip_paralog") – column holding preliminary paralog calls.

  • seqs_per_column (float, default=1) – aim to have this number of sequences per column in the key species sequences. (For example, if the key sequence is 100 amino acids long, seqs_per_column=1 would aim for 100 sequences; 2 would aim for 200 sequences).

  • max_seq_number (int, default=500) – maximum number of sequences to get, regardless of seqs_per_column and key sequence length.

  • species_tree_aware (bool, default=True) – if True, merge paying attention to species tree and paralog. If False, merge all sequences based on sequence identity (similar to CD-hit)

  • redundancy_cutoff (float, default=0.98) – merge sequences from closely related species with sequence identity above cutoff.

  • merge_block_size (int, default=50) – create blocks of paralogs merge_block_size out of the species tree to do merging based on sequence identity.

  • weighted_paralog_split (bool, default=False) – when deciding how much of the total budget to assign to each paralog, weight the budget by the number of times each paralog is seen. If False, (default), split the budget as evenly as possible between the paralogs in the dataframe.

  • sparse_column_cutoff (float, default=0.80) – when checking alignment quality, a column is sparse if it has gaps in more than sparse_column_cutoff sequences.

  • align_trim (tuple, default=(0.05,0.95)) – when checking alignment quality, do not score the first and last parts of the alignment. Interpreted like a slice, but with percentages. (0.0,1.0) would not trim; (0.05,0,98) would trim the first 0.05 off the front and the last 0.02 off the back.

topiary.quality.shrink.shrink_in_species(df, redundancy_cutoff=0.98)

Lower sequence redundancy within individual species, ignoring paralog annotation.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • redundancy_cutoff (float, default=0.98) – merge sequences that have identities greater than or equal to redundancy_cutoff.

Returns:

df – dataframe with keep set to False for redundant sequences

Return type:

pandas.DataFrame

topiary.quality.shrink.shrink_redundant(df, paralog_column='recip_paralog', species_tree_aware=True, weighted_paralog_split=False, merge_block_size=50, redundancy_cutoff=0.98)

Lower sequence redundancy by sequence identity within taxonomically informed blocks.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • paralog_column (str, default="recip_paralog") – column holding preliminary paralog calls.

  • species_tree_aware (bool, default=True) – if True, merge paying attention to species tree and paralog. If False, merge all sequences based on sequence identity (similar to CD-hit)

  • weighted_paralog_split (bool, default=False) – when deciding how much of the total budget to assign to each paralog, weight the budget by the number of times each paralog is seen. If False, (default), split the budget as evenly as possible between the paralogs in the dataframe.

  • merge_block_size (int, default=50) – create blocks of paralogs merge_block_size out of the species tree to do merging based on sequence identity.

  • redundancy_cutoff (float, default=0.98) – merge sequences that have identities greater than or equal to redundancy_cutoff.

Returns:

df – dataframe with keep set to False for redundant sequences

Return type:

pandas.DataFrame

topiary.quality.taxonomic

Functions for finding blocks of sequences to merge in a taxonomically-informed (species/paralog) fashion.

topiary.quality.taxonomic.get_merge_blocks(df, target_seq_number, paralog_column='recip_paralog', weighted_paralog_split=False, target_merge_block_size=None, dummy_merge_blocks=False)

Determine blocks of sequences to merge in a taxonomically informed fashion.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe to evaluate

  • target_seq_number (int) – target number of sequences after the merge

  • paralog_column (str, default="recip_paralog") – column in dataframe containing paralog name of each sequence

  • weighted_paralog_split (bool, default=False) – when deciding how much of the total budget to assign to each paralog, weight the budget by the number of times each paralog is seen. If False, (default), split the budget as evenly as possible between the paralogs in the dataframe.

  • target_merge_block_size (int, optional) – if specified, attempt to make merge blocks have the given size. The actual block sizes will vary wildly, as it is done by traversing the tree and thus depends strongly on the tree topology. merge blocks will all be <= the target size except for tips that have more copies of a given paralog than the target size. In such a case, the paralogs from that species will form their own merge block.

  • dummy_merge_blocks (bool, default=False) – if True, do not actually generate merge blocks. Return a dictionary with key None that requests a merge of all uid in the dataframe.

Returns:

merge_blocks – dictionary keyed to paralog names taken from paralog_column. Values are lists of blocks to merge with the following form [(budget_for_merge,list_of_uid_to_merge,ete3_leaf_this_came_from),…]

Return type:

dict