MolSetInspector
Molecular Sets Inspector

MolSetInspector (Molecular Sets Inspector) is a Python package which facilitates the processing of multiple molecular sets stored in various text file formats. As its input, it takes a directory containing the sets of molecules stored in sdf, csv, smi or txt files. The sets are read and joined in one library consisting of distinct molecules. During processing, the molecules are canonicalized, can be standardised (neutralised, unsalted etc.) and tautomers can be removed. As a result, MolSetInspector outputs the intersections of individual molecular sets, the IDs of defective (not parsed) molecules and the list of distinct molecules including a hit table (a hit table shows in which set/s was the molecule found). MolSetInspector can also filter distinct molecules by their diversity using two approaches: by setting 1) a maximum total number of diverse molecules or 2) the maximum similarity treshold of a molecular pair in the set.
Installation

On Linux, install MolSetInspector using following commands:
sudo unzip MolSetInspector-0.1.0.zip cd MolSetInspector-0.1.0/ sudo python setup.py install
On Windows, install MolSetInspector using following command:
python setup.py install
MolSetInspector requires the following dependencies to be installed:
Usage

MolSetInspector can be used both from command line or from Python code.
In the example below, MolSetInspector is invoked from the command line. Path to an input directory (/path/to/input_directory) is the only arbitrary argument of MolSetInspector. The output directory (-o /path/to/output_directory) specifies the directory where all the exports will be written. All molecular sets contained in the input directory are read and molecules standardised (-standardise option). The intersections table for all pairs of sets is written to the output directory (-inter option). Distinct compounds combined from all sets together are filtered to contain molecules with maximum similarity of 0.7 using Tanimoto similarity and ECFP4 fingerprints (-dist and -dbs 0.7 options). The diverse set is written in the form of an .sdf file (-outf sdf option) to the output directory.
python molsetinspector.py /path/to/input_directory -o /path/to/output_directory -dist -dbs 0.7 -inter -standardise -outf sdf
Example
We have an input directory with 4 molecular sets in files with different formats:
input_directory/ set_1.sdf (810 molecules, classic SDF file) set_2.smi (319 molecules, file with molecules in SMILES format one per line) set_3.csv (94 molecules, CSV file with 2 columns with header, molecules in SMILES) set_4.csv (354 molecules, CSV file with 7 columns without header, molecules in SMILES)
Using the command above the output directory is created:
output_directory/ distinct.sdf (1442 distinct molecules, classic SDF file) interesections.csv (CSV file with table containing number of common molecules for all pairs of sets)
If we used the command with -hit and -outf csv options we would get:
output_directory/ distinct.csv (1442 distinct molecules in SMILES with information about the presence of molecule in each set) interesections.csv (CSV file with table containing number of common molecules for all pairs of sets)
interesections.csv
set_2.smi set_4.csv set_1.sdf set_3.csv
set_2.smi 319 26 0 0
set_4.csv 26 351 10 0
set_1.sdf 0 10 791 0
set_3.csv 0 0 0 94
distinct.csv (5 of 1443 lines in total)
smiles set_2.smi set_4.csv set_1.sdf set_3.csv sum
CC(=NNC(=O)C1CCCC1)c1ccc(NC(=O)C(F)(F)F)cc1 1 1 2
O=C(O)c1cccnc1SCCc1ccccc1 1 1
Cc1c2cnccc2c(C)c2c1c1ccccc1n2CCOC(=O)c1ccccc1 1 1
CC1=C(C)C(Cc2ccc(O)cc2)N(Cc2ccccc2)CC1 1 1
List of all possible MolSetInspector arguments:
positional arguments: input_directory directory containing molecular sets files optional arguments: -h, --help show this help message and exit -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY output directory (default: False) -outf OUTPUT_FORMAT, --output_format OUTPUT_FORMAT output format for distinct molecules (smi, sdf, csv) (default: csv) -dist, --distinct write file with distinct molecules from all sets (default: False) -hit, --hit_table append table with 0/1 to distinct export which shows the sets that contains the molecule (default: False) -inter, --intersections write file with set intersections (default: False) -defect, --defective_molecules write file with molecules that couldn't be parsed (default: False) -standardise, --standardise_structures standardise structures using the standardiser (default: False) -tautomer, --remove_tautomers remove tautomers (default: False) -dbs DIVERSE_BY_MAX_SIMILARITY, --diverse_by_max_similarity DIVERSE_BY_MAX_SIMILARITY get diverse structures by maximum similarity treshold (from 0 to 1, 1 is most similar) (default: False) -dbn DIVERSE_BY_TOTAL_NUMBER, --diverse_by_total_number DIVERSE_BY_TOTAL_NUMBER get maximum specified number of diverse structures (positive integer) (default: False)
The example below shows how to use MolSetInspector from a programming code:
from molsetinspector import MolSetInspector as MSI

"""
 Create instance of the MolSetInspector, 
 specify the input directory, output directory and whether the standardization of molecules should be used
"""
msi = MSI(indirectory="/path/to/input_directory", outdirectory="/path/to/output_directory", standardise=True, remove_tautomers=False)

"""
 Get table (list of lists) with the number of common molecules for all pairs of sets
 When output_directory is specified it also writes the result to the .csv file
"""
msi.get_set_intersections()

"""
 Get distinct molecules combining all molecular sets
 
 diversity_type option can be used to filter the set by diversity (max_similarity/total_number)
 treshold criterium for diversity selection (maximum_similarity - float from 0.0 to 1.0, total_number - positive integer)
 hit_table appends the infromation about presence of each molecule in each set, only works with .csv output format

 When output_directory is specified it also writes the result to the .sdf/.smi/.csv file
"""
msi.get_distinct(diversity_type="max_similarity/total_number", treshold=0.7, hit_table=True, output_format="sdf/smi/csv")

"""
 Get molecules which couldn't be parsed by RDkit, 
 each molecule is identified by the original molecular set and its position index in the file (starting from position 1)
 When output_directory is specified it also writes the result to the .csv file
"""
msi.get_defective()
License

MolSetInspector is released under the MIT license.