topiary.ncbi.entrez
Functions for interfacing with NCBI entrez databases.
topiary.ncbi.entrez.download
Functions to download files off of the NCBI via FTP.
- topiary.ncbi.entrez.download.ncbi_ftp_download(full_url, file_base='_protein.faa.gz', md5_file='md5checksums.txt', num_attempts=5)
Download a proteome, genome, etc. from the ncbi. Makes multiple attempts, resumes interrupted downloads, and checks md5sum to validate integrity.
- Parameters:
full_url (str) – full url to directory containing data. This will look like ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14 (leading ftp:// allowed).
file_base (str, default="_protein.faa.gz") – class of file to download. “_protein.faa.gz” corresponds to a proteome, “_genomic.fna.gz” to a genome, etc.
md5_file (str, default="md5checksums.txt") – name of md5 checksum file on the server.
num_attempts (int, default=5) – number of times to try to download before giving up
topiary.ncbi.entrez.proteome
Use entrez to download a proteome from the NCBI.
- topiary.ncbi.entrez.proteome.get_proteome(taxid=None, species=None)
Use entrez to download a proteome from the NCBI.
- Parameters:
taxid (int or str, optional) – NCBI taxid (integer or string version of the integer). Incompatible with species argument. At least taxid or species must be specified.
species (str, optional) – bionomial name of species (i.e. Mus musculus). Incompatible with taxid argument. At least taxid or species must be specified.
- Returns:
proteome_file – the file we downloaded or None if no file downloaded
- Return type:
str or None
- topiary.ncbi.entrez.proteome.get_proteome_ids(taxid=None, species=None)
Query entrez to get a list of proteome ids that match a taxid or species. This will not raise an error on failure, but will instead return None with an error string.
- Parameters:
taxid (int or str, optional) – NCBI taxid (integer or string version of the integer). Incompatible with species argument. At least taxid or species must be specified.
species (str, optional) – bionomial name of species (i.e. Mus musculus). Incompatible with taxid argument. At least taxid or species must be specified.
- Returns:
returned_ids (list or None) – list of proteome ids. None if no ids found/error.
err (str or None) – descriptive error if no returned_ids. None if no error.
topiary.ncbi.entrez.sequences
Use entrez to download protein sequences from the NCBI.
- topiary.ncbi.entrez.sequences.get_sequences(to_download, block_size=50, num_tries_allowed=10, num_threads=-1)
Use entrez to download protein sequences from the NCBI.
- Parameters:
to_download (list) – list of ncbi ids to download
block_size (int, default=50) – download in chunks this size
num_tries_allowed (int, default=10) – number of times to try before giving up and throwing an error.
num_threads (int, default=-1) – number of threads to use. if -1, use all available.
- Returns:
seq_output – list of tuples of strings. Each tuple looks like (seq_id,sequence)
- Return type:
list
topiary.ncbi.entrez.taxid
Use entrez to get the NCBI taxid for species.
- topiary.ncbi.entrez.taxid.get_taxid(species_list)
Use entrez to get the NCBI taxid for species.
- Parameters:
species_list (list) – list of species in binomial format (i.e. Homo sapiens).
- Returns:
taxid_list – list of taxid (not guaranteed to be in the same order as the input species_list)
- Return type:
list