topiary.ncbi.entrez

Functions for interfacing with NCBI entrez databases.

topiary.ncbi.entrez.download

Functions to download files off of the NCBI via FTP.

topiary.ncbi.entrez.download.ncbi_ftp_download(full_url, file_base='_protein.faa.gz', md5_file='md5checksums.txt', num_attempts=5)

Download a proteome, genome, etc. from the ncbi. Makes multiple attempts, resumes interrupted downloads, and checks md5sum to validate integrity.

Parameters:
  • full_url (str) – full url to directory containing data. This will look like ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14 (leading ftp:// allowed).

  • file_base (str, default="_protein.faa.gz") – class of file to download. “_protein.faa.gz” corresponds to a proteome, “_genomic.fna.gz” to a genome, etc.

  • md5_file (str, default="md5checksums.txt") – name of md5 checksum file on the server.

  • num_attempts (int, default=5) – number of times to try to download before giving up

topiary.ncbi.entrez.proteome

Use entrez to download a proteome from the NCBI.

topiary.ncbi.entrez.proteome.get_proteome(taxid=None, species=None)

Use entrez to download a proteome from the NCBI.

Parameters:
  • taxid (int or str, optional) – NCBI taxid (integer or string version of the integer). Incompatible with species argument. At least taxid or species must be specified.

  • species (str, optional) – bionomial name of species (i.e. Mus musculus). Incompatible with taxid argument. At least taxid or species must be specified.

Returns:

proteome_file – the file we downloaded or None if no file downloaded

Return type:

str or None

topiary.ncbi.entrez.proteome.get_proteome_ids(taxid=None, species=None)

Query entrez to get a list of proteome ids that match a taxid or species. This will not raise an error on failure, but will instead return None with an error string.

Parameters:
  • taxid (int or str, optional) – NCBI taxid (integer or string version of the integer). Incompatible with species argument. At least taxid or species must be specified.

  • species (str, optional) – bionomial name of species (i.e. Mus musculus). Incompatible with taxid argument. At least taxid or species must be specified.

Returns:

  • returned_ids (list or None) – list of proteome ids. None if no ids found/error.

  • err (str or None) – descriptive error if no returned_ids. None if no error.

topiary.ncbi.entrez.sequences

Use entrez to download protein sequences from the NCBI.

topiary.ncbi.entrez.sequences.get_sequences(to_download, block_size=50, num_tries_allowed=10, num_threads=-1)

Use entrez to download protein sequences from the NCBI.

Parameters:
  • to_download (list) – list of ncbi ids to download

  • block_size (int, default=50) – download in chunks this size

  • num_tries_allowed (int, default=10) – number of times to try before giving up and throwing an error.

  • num_threads (int, default=-1) – number of threads to use. if -1, use all available.

Returns:

seq_output – list of tuples of strings. Each tuple looks like (seq_id,sequence)

Return type:

list

topiary.ncbi.entrez.taxid

Use entrez to get the NCBI taxid for species.

topiary.ncbi.entrez.taxid.get_taxid(species_list)

Use entrez to get the NCBI taxid for species.

Parameters:

species_list (list) – list of species in binomial format (i.e. Homo sapiens).

Returns:

taxid_list – list of taxid (not guaranteed to be in the same order as the input species_list)

Return type:

list