topiary.util

Public utility functions for the topiary package.

topiary.util.create_nicknames

Create a nickname column that has a friendly nickname for each sequence.

topiary.util.create_nicknames.create_nicknames(df, paralog_patterns, source_column='name', output_column='nickname', separator='/', unassigned_name='unassigned', overwrite_output=False, ignorecase=True)

Create a nickname column that has a friendly nickname for each sequence, generated by looking for patterns defined in the paralog_patterns dictionary in source_column column from the dataframe.

Parameters:
  • df (pandas.DataFrame) – topiary dataframe

  • paralog_patterns (dict) –

    dictionary for creating standardized nicknames from input names. Key specifies what should be output, values the a list of patterns that map back to that key. For example:

    {"S100A9":["S100-A9","S100 A9","S-100 A9","MRP14"],
     "S100A8":["S100-A8","S100 A8","S-100 A9","MRP8"]}
    

    would assign “S100A9” to any sequence matching patterns only from its list; “S100A8” to any sequence matching patterns only from its list; and S100A9/S100A8 to any sequence matching patterns from both lists.

  • source_column (str, default="name") – source column in dataframe to use to generate a nickname

  • output_column (str, default="nickname") – column in which to store newly constructed nicknames

  • separator (str, default="/") – character to place between nicknames if more than one pattern matches.

  • unassigned_name (str, default="unassigned") – nickname to give sequences that do not match any of the patterns.

  • overwrite_output (bool, default=False) – overwrite an existing output column

  • ignorecase (bool, default=True) – Whether or not to ignore the case of matches when assigning the nickname.

Returns:

topiary_dataframe – Copy of dataframe with new nickname column

Return type:

pandas.DataFrame