topiary.ncbi.blast

Interface to NCBI blast.

topiary.ncbi.blast.local

Run BLAST against a local database.

topiary.ncbi.blast.local.local_blast(sequence, db, blast_program='blastp', hitlist_size=100, e_value_cutoff=0.001, gapcosts=(11, 1), keep_blast_xml=False, num_threads=-1, block_size=20, **kwargs)

Perform a blast query against a local blast database. Takes a sequence or list of sequences and returns a list of topiary dataframes containing hits for each sequence.

Parameters:
  • sequence (str or list) – sequence as a string OR list of string sequences

  • db (str) – name of local blast database

  • blast_program (str, default="blastp") – NCBI blast program to use (i.e. “blastp”, “tblastn”, etc.)

  • hitlist_size (int, default=100) – return only the top hitlist_size hits

  • e_value_cutoff (float, default=0.001) – only return hits with e_value better than e_value_cutoff

  • gapcosts (tuple, default=(11,1)) – BLAST gapcosts (length 2 tuple of ints)

  • keep_blast_xml (bool, default=False) – whether or not to keep temporary blast xml files

  • num_threads (int, default=-1) – number of threads to use. if -1, use all available.

  • block_size (int, default=20) – run blast in blocks of block_size sequences

  • **kwargs (dict, optional) – extra keyword arguments are passed directly to apps.NcbiblastXXXCommandline.

Returns:

blast_output – If a single sequence is passed in, return a dataframe of hits. If a list of queries is passed in, returns a list of dataframes–one entry per query. If no hits are found for a given query, return an empty dataframe.

Return type:

pandas.DataFrame or list

topiary.ncbi.blast.make

Construct a blast database.

topiary.ncbi.blast.make.make_blast_db(input_files, db_name, overwrite=False, makeblastdb_binary='makeblastdb')

Make a protein blast database from a set of fasta input files.

Parameters:
  • input_files (list) – list of input files. Must be fasta files (with extension .faa) or zipped fasta files (with extension .faa.gz).

  • db_name (str) – name of final database

  • overwrite (bool, default=False) – whether or not to overwrite an existing database

  • makeblastdb_binary (str, default="makeblastdb") – makeblastdb binary to use

Returns:

Return type:

None

topiary.ncbi.blast.merge

Merge dataframes from multiple BLAST queries.

topiary.ncbi.blast.merge.merge_and_annotate(blast_df_list, blast_source_list=None)

Merge a list of blast output dataframes into a single non-redundant dataframe. (See merge_blast_df documentation for details on merge). Grabs meta data (isoform, structure, etc.). Converts the “seuqence” column to “subject_sequence” and downloads full sequence for that accession from ncbi.

Parameters:
  • blast_df_list (list) – list of pandas dataframes containing blast output from local_blast, ncbi_blast, or read_blast_xml.

  • blast_source_list (list, optional) – list of values to put into a “blast_source” column that name where the dataframes from blast_df_list came from. Must be the same length as blast_df_list. if None, do not populate the “blast_source” column.

Returns:

df – merged blast dataframe

Return type:

pandas.DataFrame

topiary.ncbi.blast.merge.merge_blast_df(blast_df_list)

Merge dataframes from multiple BLAST queries. Merge happens based on accession. If different queries returned an overlapping sequence region from the same accession, merge them together. The merge can include an arbitrary number of sequences. This merge will:

  1. Take the longest contiguous hit as the row to keep.

  2. Update the “query” column to include all merged queries. For example, if the Human and Troll queries gave overlapping hits from subject region 1-100 and 2-101, respectively, the merged query will be “Human|1-100;Troll|2-101.

  3. Expand the “subject_start” and “subject_end” numbers to cover the maximum extent of all hits. In the example from (1), these would become 1 and 101 (human start, troll end).

  4. Set the “e_value” column to be the lowest e_value observed by any of the merged hits.

Parameters:

blast_df_list (list) – list of dataframes returned by a blast call with multiple sequence queries

Returns:

merged – single dataframe with duplicate hits between queries merged

Return type:

pandas.DataFrame

topiary.ncbi.blast.ncbi

Run BLAST against a remote NCBI database.

topiary.ncbi.blast.ncbi.ncbi_blast(sequence, db='nr', blast_program='blastp', taxid=None, hitlist_size=100, e_value_cutoff=0.001, gapcosts=(11, 1), url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', max_query_length=80000, num_tries_allowed=5, num_threads=1, verbose=False, keep_blast_xml=False, **kwargs)

Perform a blast query against a remote NCBI blast database. Takes a sequence or list of sequences and returns a list of topiary dataframes containing hits for each sequence.

Parameters:
  • sequence (str or list) – sequence as a string OR list of string sequences

  • db (str, default="nr") – name of ncbi blast database

  • blast_program (str, default="blastp") – NCBI blast program to use (i.e. “blastp”, “tblastn”, etc.)

  • taxid (str or int or list or None, default=None) – limit ncbi blast search to specified taxid. If None, do not limit search. If single str/int, interpret as an integer taxid (i.e. 9606 for Homo sapiens). If list of str/int, interpret as multiple taxid (i.e. [9606,10090] for Homo sapiens OR Mus musculus).

  • hitlist_size (int, default=100) – return only the top hitlist_size hits

  • e_value_cutoff (float, default=0.001) – only return hits with e_value better than e_value_cutoff

  • gapcosts (tuple, default=(11,1)) – BLAST gapcosts (length 2 tuple of ints)

  • url_base (str, default=”https://blast.ncbi.nlm.nih.gov/Blast.cgi”) – NCBI base url

  • max_query_length (int, default=80000) – maximum string length accepted by the server. if the query is longer than max_query_length, the function will break it into multiple requests shorter than max_query_length, then combine the results.

  • num_tries_allowed (int, default=5) – try num_tries_allowed times in case of server timeout

  • num_threads (int, default=1) – number of threads to use. if -1, use all available.

  • verbose (bool, default=False) – whether or not to use verbose output

  • keep_blast_xml (bool, default=False) – whether or not to keep raw blast xml output

  • **kwargs (dict, optional) – extra keyword arguments are passed directly to Bio.Blast.NCBIWWW.qblast, overriding anything constructed above. You could, for example, pass entrez_query=”txid9606[ORGN] or txid10090[ORGN]” to limit blast results to hits from human or mouse. This would take precedence over any taxid specified above.

Returns:

blast_output – If a single sequence is passed in, return a dataframe of hits. If a list of queries is passed in, returns a list of dataframes–one entry per query. If no hits are found for a given query, return an empty dataframe.

Return type:

pandas.DataFrame or list

topiary.ncbi.blast.read

Read BLAST xml output.

topiary.ncbi.blast.read.check_for_cpu_limit(xml_file)

Check to see if an ncbi server rejected the request because it hit a CPU limit.

Parameters:

xml_file (str) – xml file that came off server

Returns:

result – True if the cpu limit was hit, False otherwise.

Return type:

bool

topiary.ncbi.blast.read.read_blast_xml(xml_input, do_cpu_check=False)

Load blast xml file(s) and convert to pandas dataframe(s).

Parameters:
  • xml_input (str or list) –

    blast xml files to load. This can have a few formats:

    1. single xml file (str)

    2. list of xml files

    3. directory (str). topiary will grab all .xml in that directory.

  • do_cpu_check (bool, default=False) – check files to see if they indicate cpu limit exceeded. if True and this is seen, return None, xml_files.

Returns:

  • all_df (list or None) – pandas dataframes constructed from the blast xml files. Will be None if cpu_limit_check == True and the cpu limit was observed.

  • xml_files (list) – list of xml files parsed.

topiary.ncbi.blast.read.records_to_df(blast_records)

Read list of blast records and return as a single pandas data frame.

Parameters:

blast_records (list) – list of biopython.Blast records

Returns:

results – pandas dataframe with all blast hits

Return type:

list

topiary.ncbi.blast.recip

Reciprocal blast sequences.

topiary.ncbi.blast.recip.recip_blast(df, paralog_patterns, local_blast_db=None, ncbi_blast_db=None, ncbi_taxid=None, ignorecase=True, min_call_prob=0.95, partition_temp=1, drop_combo_fx=0.9, use_start_end=True, hitlist_size=10, e_value_cutoff=0.01, gapcosts=(11, 1), num_threads=-1, keep_blast_xml=False, **kwargs)

Take sequences from a topiary dataframe and do a recip blast analysis against an NCBI or local blast database. Looks in blast hits for the regular expressions defined in paralog_patterns to call paralog for each sequence in df. Returns a copy of the input topiary dataframe with five new columns:

  • recip_found_paralog: True/False, whether a paralog was found

  • recip_hit: string, description for paralog hit (if found) or best hit (if no match found)

  • recip_paralog: string or None. name of paralog from paralog_patterns

  • recip_prob_match: float. probability that this is the correct paralog call based on relative bit scores of all paralog hits (see Note 2)

  • recip_bit_score: float. bit_score for the paralog hit (if found)

Parameters:
  • df (pandas.DataFrame) – topiary dataframe. Will pull sequences from df.sequences. If there are ‘start’ and ‘stop’ columns in the dataframe, only blast sequences between start/top (for example: start = 5, stop = 20 would blast sequence[5:20]. To turn this off, set use_start_end = False.

  • paralog_patterns (dict) – dictionary with paralogs as keys and lists of patterns or compiled pattern regular regular expressions as values. See Note 1 for details

  • local_blast_db (str or None, default=None) – Local blast database to use. If None, use an NCBI database. Incompatible with ncbi_blast_db.

  • ncbi_blast_db (str or None, default=None) – NCBI blast database to use. If None, use a local database. Incompatible with local_blast_db.

  • ncbi_taxid (str or int or list or None, default=None) – limit ncbi blast search to specified taxid. If None, do not limit search. If single str/int, interpret as an integer taxid (i.e. 9606 for Homo sapiens). If list of str/int, interpret as multiple taxid (i.e. [9606,10090] for Homo sapiens OR Mus musculus).

  • ignorecase (bool, default=True) – whether to ignore letter case when searching blast. NOTE: if you pass in compiled regular expressions, the flags (such as re.IGNORECASE) on the last pattern for each paralog will be used for all patterns.

  • min_call_prob (float, default=0.95) – hits from all paralogs that yield a regular expression match are weighted by their relative bit scores. Each paralog is assigned a relative probability. This cutoff is the minimum probability the best paralog match must have to result in a paralog call. Value should be between 0 and 1 (not inclusive), where min_call_prob –> 1 increases the stringency.

  • partition_temp (float, default=1,) – when calculating posterior probability of the paralog call, use this for weighting: 2^(bit_score/partition_temp). partition_temp should be a float > 0. A higher value corresponds to a higher stringency. (The bit score difference between the best hit and the rest would have to be higher to be significant). This is a minium value: it may be adjusted on-the-fly to avoid numerical problems in the calculation. See Note 2.

  • drop_combo_fx (float, default=0.90) – when deciding whether to call a paralog as a combination (i.e. A/B) versus singleton (i.e. A), prefer A if pp_A/pp_AB > drop_combo_fx.

  • use_start_end (bool, default=True) – whether or not to use start/stop columns in dataframe (if present) to slice subset of sequence for reciprocal blast.

  • hitlist_size (int, default=10) – return only the top hitlist_size hits

  • e_value_cutoff (float, default=0.001) – only return hits with e_value better than e_value_cutoff

  • gapcosts (tuple, default=(11,1)) – BLAST gapcosts (length 2 tuple of ints)

  • num_threads (int, default=-1) – number of threads to use. if -1, use all available. (Multithreading rarely speeds up remote BLAST).

  • keep_blast_xml (bool, default=False) – whether or not to keep raw blast xml output

  • **kwargs (dict, optional) – extra keyword arguments are passed directly to biopython blast). These take precedence over anything specified above (hitlist_size, for example).

Returns:

topiary_dataframe – copy of input dataframe with new recip blast columns

Return type:

pandas.DataFrame

Notes

  1. paralog_patterns are used to match reciprocal blast hits. We strongly recommend that you use the patterns generated automatically by io.read_seed, as this will find patterns that are unique to each paralog. If you want to specify them manually, however, use something like:

    paralog_patterns = {"LY96":["lymphocyte antigen 96","MD-2"],
                        "LY86":["lymphocyte antigen 86","MD-1"]}
    

    This would mean hits with ‘lymphocyte antigen 96’ or ‘MD-2’ will map to LY96; hits with ‘lymphocyte antigen 86’ or ‘MD-1’ will map to LY86. Hits that match neither would not be assigned a paralog. String patterns are interpreted literally. The pattern "[A-Z]" would be escaped to look for "[A-Z]". If you want to use regular expressions in your patterns, pass them in as compiled regular expressions. For example,

    my_pattern = re.compile("[A-Z]")
    
  2. The posterior probability of a given call is calculated using a weighted bit score. For each blast hit, we calculate w = 2^(bit_score/partition_temp). We then calculate the posterior probability for each paralog match as w_paralog/sum(w_all). See MC Frith (2020) Bioinformatics. https://doi.org/10.1093/bioinformatics/btz576 By increasing the partition_temp parameter, you increase stringency.

topiary.ncbi.blast.util

Shared functions for dealing with NCBI blast.