Data Science - Toxicological Predictions

Page content

[Download this notebook](05 - Data Science.ipynb)


In this lesson you’ll learn:

  • how to train a SVM.
  • why it is necessary to scale variables.
  • how to train a random forest model.
  • about Y-scrambling and the necessity of train/test splits.
  • how to split data into a training set and test set.

Today you will learn some basics of data science and machine learning. These will also be relevant for training neural networks. As an example, we will build models to detect the toxicological concern of molecules. In the specific example, we are interested in measuring the mitochondrial membrane potential (MMP). This is used as an indicator of the overall health of the cell [1].

The data come from the Tox21 Challenge dataset. This competition was about predicting the toxicological properties of molecules. For this purpose, the measurements for a total of twelve assays were made available. For the time being, we focus on only one assay and also do not use the entire data set, but only about 2000 molecules.

But before that, we will look at the effect of variable scaling by means of an example.

References: [1] Sakamuru, S., Attene-Ramos, M. S., & Xia, M. (2016). Mitochondrial membrane potential assay. In High-throughput screening assays in toxicology (pp. 17-22). Humana Press, New York, NY. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5375165/ [2] Liao, C., Xu, D., Liu, X., Fang, Y., Yi, J., Li, X., & Guo, B. (2018). Iridium (III) complex-loaded liposomes as a drug delivery system for lung cancer through mitochondrial dysfunction. International Journal of Nanomedicine, 13, 4417.

import pandas as pd 
import numpy as np
!pip install rdkit==2022.3.4
!pip install searborn
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem as Chem
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from rdkit.Chem.Lipinski import * 
from rdkit.Chem.rdMolDescriptors import CalcExactMolWt, CalcTPSA
from rdkit.Chem.Crippen import MolLogP, MolMR
from sklearn.preprocessing import MinMaxScaler
import seaborn as  sns
from matplotlib import pyplot as plt
import sys
%matplotlib inline

if 'google.colab' in sys.modules: # checks whether the notebook runs on collab
  !wget https://raw.githubusercontent.com/kochgroup/intro_pharma_ai/main/utils/utils.py
  %run utils.py
else:
  %run ../utils/utils.py # loads prewritten function

Scaling

The scaling of variables can be critical to the success of machine learning models. Scaling variables means that we change the scale of a variable’s values. More specifically, we want all variables to use the same scale.

For example, the value of a house in € often ranges from 50,000 to several million. However, the area of the house is probably only between 10 and several hundred square meters. The values that the price can take are much larger than those of the area.

For some algorithms based on distances or gradients, this can be a problem because the scale of the variable for that variable has a direct impact on the importance of the variable.

Variables with a larger scale are given more importance, even if the scale is arbitrary. One could also express the price of a house in 1000€ (50,000 becomes 50). The scale changes, but not the actual variable.

To avoid this effect, we scale variables to uniform sizes. For example, MinMax scaling scales all values between ‘0’ and ‘1’.

In the following example you can see the influence scaling can have on a Support Vector Machine.

First, you load the data using Python.


Datatypes: There are many different formats in which data is stored. Probably the most commonly used is the comma separated value format. This can be recognized by the abbreviation .csv at the end of a file. freeimages.com: T. Al Nakib

As the name implies, the individual values are separated by a comma. Values that belong in the same row are written in the same line and columns are separated by the comma. Other file formats that are often used are text files .txt. Here it is not possible to conclude directly from the extension how the structure is built. But often the system “values in a row belong in a row” is followed. Only the character that separates individual values can differ. A frequently used separator is the tab(ulator). The tab character is also often used in the .smi format. You will see this format more often when working with SMILES. If you are not sure which seperator is used, you can open the file with a simple text editor. On Windows, for example, “Notepad”. Here you should see how the values are separated.

In this case the data is stored in the format .tab. So the values are separated by a tab. You can use the function pd.read_csv() also if the file is not a .csv file. But then you must additionally specify the separator sep="\t". The symbol "\t" will then be recognized as a tab.

print("ab")
print("a\tb")

As you may have already seen from some of the loaded libraries, today we will use some functions that are still unknown to you.

  • pandas contains useful functions for processing large amounts of data and can do pretty much everything Excel can do (and some things even better). pandas creates so called DataFrames. DataFrames are similar to 2D arrays in numpy and store data in rows and columns. The difference is that we can store different types of data in a DataFrame. For example floats and strings in two different columns. We can also assign names to the columns and rows to have a better overview of our data.

  • You already know sklearn from the ROC-AUC (Notebook 04). sklearn brings many functions that are important for the preparation of data. There are also several machine learning algorithms in sklearn, including the Random Forest and Support Vector Machines.

In the following cell, the data for a simple example is read in.

toy_example = pd.read_csv("https://uni-muenster.sciebo.de/s/XTC29EBCPfMfqqh/download", sep = "\t")

print("Type:", type(toy_example))
print("Shape:",toy_example.shape)
toy_example.head()

The data read in with pandas is first stored in a DataFrame. This is a table with rows and columns. With .head() you can display the first 5 rows of a DataFrame. The data contains three columns: x1 x2 and y. In total the file contains 150 entries. So a total of 150 measuring points.

y indicates the notional membership of each data point to one of two classes.

To get a better overview, we create a diagram for the data.

plt.plot(toy_example.x1, toy_example.x2,"o")
plt.plot( toy_example.x1[toy_example.y==1],  toy_example.x2[toy_example.y==1],"o")
plt.xlabel("x1")
plt.ylabel("x2")

As you can see, one class is “surrounds” by the other. We want to train a SVM to distinguish the two classes.

You should also notice that the scales of the two input variables x1 and x2 differ significantly.

The x1 values lie roughly in between -1 and 1. The x2 values lie between ca. 8 and 12.

We will not scale the data at first and directly train a SVM on this data. For this purpose the module sklearn.svm offers some functionalities. As with linear regression, sklearn first creates the model and then trains it. For classification, the function SVC (Support Vector Classification) can be used. First we create a variable model. This contains the Support Vector Classifier (SVC). To train it, we use the function .fit(x,y) to fit the classifier to our data.

So far we have only trained the SVC, to get its predictions we need to use again the function model.predict(x).

Important: You may have seen it above, in pandas we can select variables (i.e. columns) directly from the DataFrame. So toy_example.y selects the column y from the DataFrame toy_example. If you want to select values using classical indexing, as with arrays, you must first append a .iloc[] to the DataFrame name. The indexing then works the same way as with numpy.

toy_example.iloc[:,:2] selects the first two columns, i.e. x1 and x2.

from sklearn.svm import SVC

model = SVC()
model.fit(toy_example.iloc[:,:2], toy_example.y)
y_pred = model.predict(toy_example.iloc[:,:2])
y_pred

To assess the quality of the predictions, you can again use the accuracy. You can use the same function as in the last notebook. Calculate the accuracy for the predictions of this model.

def accuracy(y_true, y_pred):
    return np.sum(y_true==y_pred)/len(y_true)

accuracy(_____, ____)
accuracy(toy_example.y, y_pred)

The accuracy is around 0.7. Not bad, but there is still room for improvement. With one of the precoded functions in utils.py you can also make the decision boundaries visible.

plot_svc(toy_example, model)

You can see that the decision boundary is elongated. This is because the scale of x2 is larger. This means that the distance from the decision boundary to the data points of two classes (which must be maximized) is easier to maximize for x2 than for x1. Therefore, we see a good decision boundary for the values of x2, but not for x1.

To change this, we can try to scale the data. To do this, we use what is called the MinMax scaler, where all values are scaled between 0 and 1. We apply this scaler to both input variables (x1, x2).

def min_max(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

toy_example.x1 = min_max(toy_example.x1)
toy_example.x2 = min_max(toy_example.x2)
plt.plot(toy_example.x1, toy_example.x2,"o")
plt.plot( toy_example.x1[toy_example.y==1],  toy_example.x2[toy_example.y==1],"o")
plt.xlabel("x1")
plt.ylabel("x2")

The plot looks exactly like the one before, but the scales, i.e. the x- and y-axis, have changed. That is, the relative relationship of the values has not changed. Can you now create a model_2 with SVC that is trained with the scaled values? Also calculate the accuracy!

model_2 = ____
model_2.fit(_____, ______)
y_pred = model_2.predict(toy_example._____)
accuracy(______,_____)
model_2 = SVC()
model_2.fit(toy_example.iloc[:,:2], toy_example.y)
y_pred = model_2.predict(toy_example.iloc[:,:2])
accuracy(toy_example.y, y_pred)

The scaling of the input variables alone results in a 0.3 improvement in accuracy. A significant improvement can also be seen based in the decision boundary.

plot_svc(toy_example, model_2)

Even though the data was generated just for this example, it shows why scaling the input variables is so important.

Train & Test Splits

From a theoretical example, we now move to a practical task: a toxicological prediction. pandas can automatically load data from websites and save them as a dataframe.

data = pd.read_csv("https://raw.githubusercontent.com/filipsPL/tox21_dataset/master/compounds/sr-mmp.tab", sep = "\t")

print("Shape:",data.shape)
data.head()

The data contains three columns: Compound, SMILES AND activity. In total the file contains 2246 molecules.

  • Compound contains the ID that can be used to distinguish the values in the original data set.
  • SMILES contains the SMILES-strings.
  • activity contains the results of the assay. A molecule is active (1) or inactive (0).

If you want to know how many active and therefore toxic molecules are contained in the data set, you can simply calculate the sum of the column activity.

np.sum(data.activity) # data.activity selects the column 'activity' contained in 'data'

The Compounds column is not important for the analysis. Therefore we remove it from data. After that we rename the columns so that all names are lowercase.

data = data.iloc[:,1:] # all columns except the first one (index 0) are selected
data.columns = ["smiles", "activity"]
data.head()

You have data, but no input for your model. SMILES are str variables, but models need numeric variables as input. This means that you still have to create so called features. For this you can use the descriptors from the “Cheminformatics” notebook. In the following cells we convert the SMILES to mol and use them to calculate the descriptors.

This time we also calculate the molar refractive index (measure of total polarizability), the number of rotatable bonds, and the TPSA (topological polar surface area). Then we combine all the variables into one DataFrame.

Adjust the for-loops:

# convert all SMILES to mols
mols = np.array([Chem.MolFromSmiles(x) for x in ________ ]) # which SMILES have to be chosen

# 1) N hydrogen bond donors
num_hb_donors = [NumHDonors(x) for x in _______ ] # which variables do we loop over

# 2) Hydrogen bond acceptors
num_hb_acceptors = [NumHAcceptors(x) for _____ in ______] # which variables do we loop over

# 3) Number of rotable bonds 
num_rotablebonds = [NumRotatableBonds(x) for x in mols]

# 4) Molecular Mass: CalcExactMolWt()
mw = [ _________(___ ) for x in mols]  # calculate the weight with CalcExactMolWt()

# 5) log P: MolLogP()
logP = [________ ___ __ __ ______] # calculate logP with MolLogP()

# 6) Molar refractivity 
mr = [MolMR(x) for x in mols]

# 7) Polar Surface
tpsa = [CalcTPSA(x) for x in mols]

aux_data=pd.DataFrame({
    "hb_donors": num_hb_donors,
    "hb_acceptors": num_hb_acceptors,
    "rotable_bonds": num_rotablebonds,
    "mw": mw,
    "logP": logP,
    "mr":mr,
    "tpsa":tpsa
    })

aux_data["activity"] = data.activity # we add the activity to the DataFrame
aux_data
# convert all SMILES to mols
mols = np.array([Chem.MolFromSmiles(x) for x in data.smiles])

# 1) N hydrogen bond donors
num_hb_donors = [NumHDonors(x) for x in mols]

# 2) Hydrogen bond acceptors
num_hb_acceptors = [NumHAcceptors(x) for x in mols]

# 3) Number of rotable bonds 
num_rotablebonds = [NumRotatableBonds(x) for x in mols]

# 4) Molecular Mass
mw = [CalcExactMolWt(x) for x in mols]

# 5) log P
logP = [MolLogP(x) for x in mols]

# 6) Molar refractivity 
mr = [MolMR(x) for x in mols]

# 7) Polar Surface
tpsa = [CalcTPSA(x) for x in mols]

aux_data=pd.DataFrame({
    "hb_donors": num_hb_donors,
    "hb_acceptors": num_hb_acceptors,
    "rotable_bonds": num_rotablebonds,
    "mw": mw,
    "logP": logP,
    "mr":mr,
    "tpsa":tpsa
    })

aux_data["activity"] = data.activity 
aux_data

The DataFrame aux_data has a separate column for each descriptor, seven in total. The first row contains the descriptors for the first molecule, the second for the second, and so on. You can now use this data set to build a model. However, there are some additional steps required before you can create a working model and make predictions.

First, we have two types of variables. Variables like the number of rotatable bonds are called discrete because they contain only integers. There are no 3.5 rotatable bonds in a molecule. In contrast, variables such as logP or TPSA are “continuous”, i.e., variables that can take values over the entire range of numbers.

In addition, we have variables with different scales. Values for weight are much larger than values for logP or the number of HB acceptors, for example.

However, since we want to use a Random Forests model next, the variables do not need to be scaled. This is because random forests do not use spacing or gradients.

# First we divide the data set into 'x' and 'y', i.e. input und output
x = aux_data.iloc[:,:7].values #'[:,:7]' => all columns except column 7
y = aux_data.iloc[:,7].values #'[:,7]' => only column 7

The new variable x is now an np.array (instead of a DataFrame). This was made possible by the extension .values. So you can quickly convert pd.DataFrame to np.arrays.

You can now train a random forest model. Similar to SVC you must first create a variable with the model and then use .fit() to train the model on the data.

rf = RandomForestClassifier(n_estimators = 1000, random_state = 42) 
# n_estimators sets the number of trees to use
rf.fit(x, y)

To see how good the model is, we need to extract the prediciton of the model for our dater. Unlike usual, we use the function .predict_proba(x)[:,1]. Using the trained rf model, we make the predictions for y_hat.

y_hat=rf.predict_proba(x)[:,1]
y_hat

The predicted probability for the first molecule is 0.014. As with logistic regression, this means that according to our model the molecule is active in the MMP assay 1.4% of the time, i.e. it is unlikely to be toxic. The higher the probability, the more likely it is (according to the model) that the molecule is active in the assay. Often, 0.5 is chosen as the “cut-off” value. So, at a value of 0.5, we would expect the model to classify these molecules as toxic.

To better assess how well the model is working, compare the predicted values pred_y with the actual values y.

To do this, you can again take the Accuracy. To calculate the Accuracy, you first have to round the probabilities to 0 and 1.

y_pred = np.round(y_hat)
accuracy(y, y_pred)

Then we can calculate the AUC with the function roc_auc_score(). Unlike accuracy, the probabilities y_hat are used here instead of the rounded values y_pred.

roc_auc_score(y, ____)
roc_auc_score(y,y_hat)

Very good. The model is almost perfect at prediction. It makes the correct decision in 99% of the cases. In the following cell, the misclassified molecules are selected and displayed with their predicted probability. It is not necessary that you understand this code.

falsly_classified=np.where(y_pred!=y)[0]
Draw.MolsToGridImage(mols[falsly_classified],
                     legends=["Predicted Probability:\n"+str(np.round(x,3)) for x in y_hat[falsly_classified]],
                     useSVG=True)

It is notable that the probabilities for these molecules are mostly relatively close to 0.5. This means that the model was not very sure about these molecules. However, in total only 19 molecules were misclassified, so we should not worry too much.

Y-Scrambling

The Problem:

Did the model really learn what is important for the MMP assay. Or did the RF model just memorize our data?.

We can find this out with a simple test. We train the RF model again, but shuffle the variable activity randomly. That is, the true measurements are shuffled and redistributed, randomly, among the molecules. This process is also called Y-scrambling.

Suppose our real data looks like this:

smiles Deskriptor 1 Deskriptor 2 activity
SMILES 1 $x_{1,1}$ $x_{1,2}$ $y_1$
SMILES 2 $x_{2,1}$ $x_{2,2}$ $y_2$
SMILES 3 $x_{3,1}$ $x_{3,2}$ $y_3$
SMILES 4 $x_{4,1}$ $x_{4,2}$ $y_4$

Two descriptors were calculated for each SMILES. We also recorded the activity ($y_1$-$y_4$) for each molecule. The activity $y_1$ is the measured activity of SMILES 1 and so on.

After Y-scrambling, our data looks like this:

smiles Deskriptor 1 Deskriptor 2 activity
SMILES 1 $x_{1,1}$ $x_{1,2}$ $y_2$
SMILES 2 $x_{2,1}$ $x_{2,2}$ $y_3$
SMILES 3 $x_{3,1}$ $x_{3,2}$ $y_4$
SMILES 4 $x_{4,1}$ $x_{4,2}$ $y_1$

The $y$ values were randomly assigned to other molecules.

The Y-scrambling leads to the loss of the actual relationship, between the variables x (i.e. our descriptors like logP,…) and the variable y to be predicted. This is because the relationship is now just random. If our random forest actually learns patterns instead of memorizing the data, then the model should perform worse on this data set.

To try this out, you will need the random library. The function random.shuffle() randomly reorders the values of y. You can then re-train the random forest model.

import random
random.seed(15) # a seed ensures that you all get the same random data
y_random=np.array(y) # first we store y in an array
random.shuffle(y_random) # then we shuffle y_random, we don't need to store this variable extra

We have just reshuffled the activity information. Now we can train a random forest model. However, this time we do not use y, but y_random.

# retrain the model 
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x, ______) # Which y variable do you need?
y_hat=rf.predict_proba(x)[:,1]
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x, y_random) 
y_hat=rf.predict_proba(x)[:,1]

Now calculate the Accuracy again (make sure that we use y_random again and not y for y_true):

y_pred = np.round(y_hat)
accuracy(y_random, y_pred)

Indeed, the accuracy deteriorates after Y-scrambling. But the accuracy is still above 90%. The random forest model is still relatively good at predicting toxicity, even though the data you used no longer makes any sense at all. The model could not have learned at all, since there is no relationship between x and y. Thus, the RF model has achieved its accuracy only by memorizing the data.

For this reason, a test data set is always used. This test data set is not used during training, and so the model sees these molecules for the first time when we want to evaluate the quality of the model. Memorizing the molecules in the training data set can still happen, but it does not help the model with the molecules we have in the test data set.


Often datasets are split not only into training/test sets, but into training/validation/test sets. The models are then optimized based on the validation set and only the optimized model is then tested on the test set. In cheminformatics, the validation set is sometimes referred to as the test set. The actual test set is then referred to as the external validation set.


Here, too, there are functions that do the work for you. The function train_test_split splits the data into a test set and a training set. We use 80% of the data set for training and the rest for validation. The molecules are then randomly divided between the two sets. Then the data is again split into x and y, but this time for train and test separately.

train, test=train_test_split(aux_data,test_size= 0.2, train_size= 0.8, random_state=1234)

train_x = train.iloc[:,:7]
train_y = train.iloc[:,7]
test_x = test.iloc[:,:7]
test_y =  test.iloc[:,7]
f"Train Shape: {train.shape}, Test Shape: {test.shape}"

The trainings set contains 1796 molecules and the test set 450.

We first train the Random Forest with the training data only:

rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(train_x, train_y)

But we only make the predictions for the molecules in the test set.

y_hat=rf.predict_proba(test_x)[:,1]
y_pred = np.round(y_hat)

We can now also calculate the accuracy. Which variable do we need now for y_true?

accuracy(___, y_pred)
accuracy(test_y, y_pred)

The accuracy has deteriorated by quite a bit, but it is still good. This time, however, we can be sure that the performance is not due to memorization, since the model has never seen these molecules. If we now apply Y-scrambling, the performance should deteriorate dramatically. We replace the aux_data.activity with the previously created y_random. These are the mixed y values and we repeat the analysis.

aux_data.activity = y_random 

Then we divide the data again into training and test set.

train_random, test_random=train_test_split(aux_data,test_size= 0.2, train_size= 0.8, random_state=1234)
train_x_random = train_random.iloc[:,:7]
train_y_random = train_random.iloc[:,7]
test_x_random = test_random.iloc[:,:7]
test_y_random =  test_random.iloc[:,7]

We repeat the training with the randomized data. Then we let the model make predictions for the test set.

rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(train_x_random, train_y_random)
y_hat=rf.predict_proba(test_x_random)[:,1]

Finally, we calculate the accuracy.

y_pred = np.round(y_hat)
accuracy(test_y_random, y_pred)

Using a test set, the accuracy of the model with the Y-scrambled data drops dramatically to about 50%. It is no better than a model that would simply guess. Only by using a test set could we show that the model learned something beyond memorization The main point was to show the importance of evaluating a model not only based on the training set.

Y-scrambling is rarely used in practice. But the OECD, for example, requires a Y-scrambling test to validate QSAR (Quantitative Structure-Activity Relationship) models.

Feature Importance

As a final step, let’s look at the Feature Importance. The feature importance indicates how important each input variable is to the decision. Depending on which ML algorithm you use, you can extract the feature importance relatively easily. We first re-train our RF, this time with a data split.

In the next cell, we first re-train the random forest model (without y-scrambling). Then we can output the feature importance of the model.

aux_data.activity = data.activity

train, test=train_test_split(aux_data,test_size= 0.2, train_size= 0.8, random_state=1234)
train_x = train.iloc[:,:7]
train_y = train.iloc[:,7]
test_x = test.iloc[:,:7]
test_y =  test.iloc[:,7]


# train model
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(train_x, train_y)
y_hat=rf.predict_proba(test_x)[:,1]
feat_importances = pd.Series(rf.feature_importances_, index=aux_data.columns.values[:-1])
feat_importances.nlargest(20).nsmallest(20).plot(kind='barh')

It is clearly evident that the LogP is the most important parameter for determining toxicity, while the number of H-bridge donors and acceptors are less relevant.

That the LogP value is important is not surprising. For example, we can look at the density plots of active and inactive molecules.

import seaborn as sns
sns.kdeplot(aux_data.logP[aux_data.activity==1], color="red")
sns.kdeplot(aux_data.logP[aux_data.activity==0])

Here we see a clear trend. At higher LogP, the molecule is more active.

Try it yourself with the other descriptors(hb_donors, hb_acceptors, rotable_bonds, mw, mr, tpsa). Besides LogP, which descriptors have different distribution based on activity?

Practice Exercise

You have already learned about fingerprints as molecular representations. Since they are easy to calculate and always have a fixed length, they are well suited as input for ML models. However, fingerprints are not so easy for humans to interpret.

Your task will be to train a Random Forest model again. This time you will use the ECFP4 as input.

For you, the function get_fingerprints() has already been precoded. With it you can compute fingerprints from the SMILES.

fps = get_fingerprints(data)
fps["activity"] = data.activity
fps.head()

fps contains a total of 2049 columns. 2048 of them are the respective bits of the fingerprint. The last column contains the activity.

First, the data set is divided into training and test. 80% of the data should be in the training set and 20% in the test set.

train, test = train_test_split(_____,test_size= ___ , train_size= ______, random_state=1234)

train_x = __________
train_y = __________
test_x = __________
test_y = __________ 

After splitting the data, train a random forest classifier with the training dataset. Then use the trained model to classify the molecules in the test dataset.

rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(_____, _____)
y_hat=rf.predict_proba(______)[:,1]
roc_auc_score(___,____)

We can also take another look at the feature importance:

feat_importances = pd.Series(rf.feature_importances_, index=range(2048))
feat_importances.nlargest(20).nsmallest(20).plot(kind='barh', title = "Importance of Features")

Unfortunately, this plot can no longer be interpreted as easy, even though it becomes clear that the top five bits are important for the activity prediction.

The bits can’t be plotted that well, but with RDKit we can show the substructures that are assigned to each bit.

most_important_bits = feat_importances.nlargest(20).index.values
print("The 20 most important bits:", most_important_bits)
mol_ll = []
bi_ll = []


for i in range(20):
    bit = most_important_bits[i]
    for x in data.smiles:
        bi ={}
        mol = Chem.MolFromSmiles(x)
        fp = Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bi)
        if np.sum(np.array(list(bi))==bit)>0:
            mol_ll.append(mol)
            bi_ll.append(bi)
            break
        
prints=[(mol_ll[i],most_important_bits[i], bi_ll[i]) for i in range(20)]

Draw.DrawMorganBits(prints, useSVG=True, molsPerRow=3, legends= [str(most_important_bits[i]) for i in range(20)], subImgSize= [300,300])

The most important bit for our data is a phenolic hydroxy group (aromatic atoms are highlighted in yellow, the central atom is blue). Many other aromatic fragments are also represented in the most important bits.