DETERMINING THE BEST SET OF MOLECULAR DESCRIPTORS FOR A TOXICITY CLASSIFICATION PROBLEM

The safety norms for drug design are very strict with at least three stages of trials. One test, early on in the trials, is about the cardiotoxicity of the molecules, that is, whether the compound blocks any heart channel. Chemical libraries contain millions of compounds. Accurate a priori and in silico classification of non-blocking molecules, can reduce the screening for an effective drug, by half. The compound has to be checked for other risk factors alongside its therapeutic effect; these tests can also be done using a computer. Actual screening in a research laboratory is very expensive and time consuming. To enable the computer modelling, the molecules are provided in Simplified Molecular Input Line Entry (SMILE) format. In this study, they have been decoded using the chem-informatics development kit written in the Java language. The kit is accessed in the R statistical software environment through the rJava package, that is further wrapped in the rcdk package. The strings representing the molecular structure, are parsed by the rcdk functions, to provide structure-activity descriptors, that are known, to be good predictors of biological activity. These descriptors along with the known blocking behaviour of the molecule, constitute the input to the Decision Tree, Random Forest, Gradient Boosting, SupportVector-Machine, Logistic Regression, and Artificial Neural Network algorithms. This paper reports the results of the data analysis project with shareware tools, to determine the best subset of molecular descriptors, from the large set that is available. Mathematics Subject Classification. 62P10, 92-10. Received February 22, 2021. Accepted August 14, 2021.


Literature review
Notable researchers in the biophysics and biochemistry field are Paul and Gautham [17,18], Vengadesan and Gautham [23], and Wang [24]. A recent, but somewhat technical discussion of the viral pharmacokinetics is given by Hirano and Murakami [10]. Simplified Molecular Input Line Entry System is a notation or nomenclature that allows a user to represent a chemical structure in a way that can be parsed by the computer. This is compact way of representing the molecular structure of a drug molecule. A simple exposition of this notation is presented by Anderson et al. [3]. More detailed information about this notation is presented by Hunter et al. [11], Weininger [25] and Weininger et al. [26]. Roy, an expert in QSAR models, clarifies that toxic activity and therapeutic activity are the two sides of the same coin, in these classification studies. He also emphasizes, that in the end of the data analysis, a mechanistic interpretation of the final set of descriptors should be attempted [16]. Smith and Toppur, have investigated the geometrical structure of the Collagen protein, which is the connective network for transmitting forces in human and animal tissues, in the context of shortest interconnecting networks [20]. Collagen owes its unique properties not only to its chemical composition, but also to the physical arrangement of its individual molecules. The basic molecular polypeptide chain forms a left-handed helix, and three such helices are wrapped around each other to form a right-handed super-helix [12]. In retrospect, that was also a structure-activity study, which used the three-dimensional coordinates of the atoms. Atomic coordinates, have not been used so far in this study. Such an analytical approach combined with statistical methods, are sure to be useful with other molecules besides Collagen.
The dataset below in Table 1 shows an extract from a large dataset of molecules compiled by Sharath Kumar et al. [14]. The concern is about how they affect the human heart which is described by the dichotomous variable  Blocker/Non-blocker. The researchers used Random Forest, Multilayer Perceptron and Sequential Minimal Optimization techniques. Random Forest models, were found to be most robust. Various classification methods, have been explained in the SPSS guide by Elliott [7]. The mathematics underneath the method is explained, in the textbooks by James [13], and Kumar [15]. There are various diagnostic tests for determining the goodness of the model, such as precision, recall/sensitivity, specificity, Omnibus test, Wald test, Hosmer-Lemeshow test, Classification Plot, Receiver Operating Curve (ROC), and Area Under Curve (AUC). The precision of a model is the ratio of the number of true positives to the total number of predicted positives. The recall of a model is another name for the true positive rate. The specificity refers to the true negative rate. Approaches to Pattern Recognition (PR) in electrical and computer engineering, using structural, syntactic, and neural networks have been systematized as far back as 1992, by Schalkoff [19]. A recent book on deep learning, as the field of neural networks is referred to these days, and one specific to the R statistical environment, is by Chollet and Allaire [6].
Quantitative Structure-Activity Relationship (QSAR) models attempt to relate the structure of the molecule with its chemical and biological activity. A comprehensive and recent survey of QSAR approaches is presented by Abdel-Illah et al. [1]. They differentiate between structure based and ligand based approaches to drug design. QSAR is a type of ligand based drug design, in which the structure of the target molecule is not available. QSAR models are divided further into two dimensional and three dimensional models. We have seen in the literature, that some physical property such as boiling point of a compound can be predicted very precisely by a neural network that has been trained on inputs obtained from the structure of the molecule of the compound [8].
We next provide some context about the computer packages used in this paper; the Java computer language was developed at Oracle Corporation, in California. The Java software package CDK, stands for Cheminformatics Developer Kit and can be used for analysing molecular information. The OpenScience Project, can be used for free, to analyse molecular information [22]. A derived wrapper package rcdk by Rajarshi Guha, makes the functionality of the CDK accessible, to beginner and intermediate users of the R language [9]. Although advanced, licensed, and commercial systems are being used at the pharmaceutical companies, these shareware packages can be used by small researchers, who are interested in the search for the present day holy grailan effective antivirus for COVID-19. PaDEL is another popular software for generating the set of molecular descriptors using the CDK. Dong et al. have developed the Rcpi package that also provides similar descriptors [5]. Their paper mentions 48 descriptor types, yielding 288 molecular descriptors. The rcdk package by Guha, describe 55 molecular descriptors across 5 descriptor categories, which are enumerated in the next section. This is the package that we have used for obtaining the input variables.
The Rattle package in R from Togaware Pty, offers many classification tools [27]. We have tried out all the classification techniques, available in Rattle, namely, decision trees, Random Forest, Gradient Boosting, Logistic Regression, followed by Neural Network with one hidden layer. The package gives the results of most of the statistical tests such as sensitivity and specificity, Wald's test, ROC curve, and area under the curve, to determine the significance of the predictor variables, and many charts that can help one determine the robustness of the model. The package exhibits artificial intelligence; the variable selection rules for example, are built into the various classification algorithms. However, if the end user so wishes, he may also specify the variables to be used in a model.

Methodology
Just like there are various atomic descriptors, such as mass and charge, to tell apart different elements in the periodic table, the differences in the structure of a molecule is captured by the molecular descriptors. Common sense descriptors are counts of atoms, bonds, acids, bases, and about the lengths of chains and rings. The less obvious descriptors, are about attributes such as charge, polar surface area, anisotropy, and immiscibility. Anisotropy or Aelotropy is about possessing different physical properties in different directions; for example, certain crystals have a different refractive index, in different directions [12]. These descriptors may be used to relate the structure of the molecule to the biological activity of the molecule. First, we must look at the available descriptor categories. According to one popular taxonomy there are five descriptor classes: (1) Hybrid -2 descriptors.
Adding across these five descriptor types, there are 55 molecular descriptors. However they are not mutually exclusive categories, and one or two descriptors appear in more than one category. The dataset in question, has 9204 molecules in SMILE format. Though we have taken a black-box approach and simply used the classification tools in the Rattle data-miner, a few words about the internal mechanism of these tools, is essential, so that one may know how they are different from each other. Naïve Bayesian classification is traditionally, the simplest approach to classification, based on branching at some critical value of a variable. With multiple input variables, the branching decisions becomes increasingly complex. This methodology is generalized to Random Forest which creates numerous decision trees to determine the best one. Random Forest is an ensemble of unpruned decision trees, that are robust to variance and bias. Random forests are often used when we have large training datasets and particularly a very large number of input variables [27]. In Gradient Boosting, a weight is associated with each observation in the dataset. A series of models are built and the weights are increased or boosted, if a model incorrectly classifies an observation. Support Vector Machine models (SVM) attempt to identify a separating hyperplane between the two classes of points reminiscent of discriminant analysis. Logistic regression works by generalizing the multiple regression function, with a sigmoid function. The logistics function is typically defined in the form of a probability: This can be written as: After finding the odds ratio and taking the natural logarithm of both sides, we get the Generalized Linear Model (GLM) which resembles the multiple regression function. After the estimation of the beta parameters, one can calculate the class probability, for any set of values. From the probability, the odds, and furthermore, the Odds Ratio (OR) is calculated, that measures the importance of a predictor variable on the response. The relative importance of predictor variables is also ranked [15]. Finally neural networks, based on a connectionist architecture and view of learning patterns, adjust the values of a large network of weights, over many iterations, to correctly identify the output categories, when presented with an input.
Out of the 9204 molecules, 7630 molecules (83%) were classified as blockers, and 1574 molecules (17%) were classified as Non-blockers. Forty-nine unique descriptors, gave a total of 303 related molecular descriptors in the dataset. Only 190 were usable in the classification exercise, others being constant or "NA" or missing for the molecules. This then was the unbalanced dataset to use in the classification algorithms, with many of the molecules being of the blocker type. More weightage is to be given to the non-blocker data, since they constituted only 17% of the entire dataset, otherwise any hyperplane of discrimination, would be overfitted to the more prevalent class. A balanced dataset taking equal number of molecules from both classes had 3148 observations. All the experiments were performed on a Lenovo desktop computer running Windows 10, with an i3 dual core Intel chip, and 16 GB RAM.

Results
Though the variable selection was automatically done by the Rattle software, some exploratory data analysis was done to create box and whisker plots, and also histograms including probability density functions across the two classes. A few examples for selected descriptors are given in Appendix B. Such charts are useful to conjecture, which descriptors are going to be significant in the classification. If the boxplot for a class is higher or lower, or wider than for the other class, it is highly probable that the descriptor will play an important role in the model. Similarly if the overlap in the histograms for the two classes is almost perfect for some variables, then those descriptors are likely to play a minor role in the model. For any descriptor, a statistical two sample t-test gives the p-value for which the hypothesized mean difference for the two categories, is significantly nonzero. These t-tests can be used with the boxplots and other charts, to build up, like in step-wise regression, the number of input variables in the classification function.

Models with complete dataset
The complete dataset as mentioned earlier, has 9204 observations, and 192 variables. In our preliminary analysis, we used the entire dataset, and followed this up with the same battery of tests, on the balanced dataset.

Decision tree and random forest
The decision tree tool took 9.33 s, and uses only four variables, namely ALogP, khs.aaCH, MDEN.22, and XLogP. A definition for each descriptor is given in Appendix A. The classification error is very low at 2.2% for blocker class, and quite high at 84.5% for the non-blocker class. If we only want to eliminate the blocker class, this is good, but it is by no means a robust classification model. The random forest method took 3.50 min. The runtime is comparatively long, because it creates 500 decision trees, sampling 13 variables from the set of 190 input variables. The error for the Blocker class is 1.2%, and the error for the Non-Blocker class is reduced to 57.6%. The top ten descriptors from this execution were, XLogP, ALogP, ALogp2, WTPT.5, MDEC.33, MDEC.13, ATSc3, Fsp3, tpsaEfficiency, and MDEO.11.

Gradient boosting & SVM
From the execution of the Gradient Boosting method on the full dataset, the important descriptors were, ALogP, XLogP, MDEO.11, MDEN.23, and WTPT.4. The Boosting method took 8.07 s. The results are inferior to the Random Forest method, with error for the Blocker class at 3.3% and error for the Non-Blocker class at 52.7%. The SVM method took 49.87 s. The error for the Blocker class was 0% but 98.4% for the Non-Blocker class, a highly polarized outcome.

Logistic regression
An execution, of the logistic regression method, took 5.16 min. It yielded a Chi-square p-value of 0.000 and a pseudo R-Square (optimistic) of 0.54497665. For the test set, the overall error was 16.1%, and the averaged class error was 36.85%. The results for this experiment are displayed in Table 2. We also ran the logistic regression without partitioning the dataset into training, validation and testing parts. The pseudo r 2 improved to 0.53 from 0.51, and the misclassification of non-blockers was reduced by about 2.4% from 69%.

Neural network
An artificial neural network with 10 neurons in a single hidden layer took 33.09 s, and the classification matrix was as in Table 3. Increasing the number of neurons did not improve the performance. Furthermore, in neural network implementations, one is unable to tell the most important variables.
Comparing all these methods with respect to classification error, the Random Forest, ensemble method emerges the best. However the almost 50% misclassification for Non-Blocker class indicates that the using the full dataset, with an under-represented class is not effective.

Models with balanced dataset
To counter the effects of imbalance, we next took equal number of molecules from each class in a dataset referred to in the following tables as the balanced dataset. Since we had 1574 non-blocker molecules, we took 1574 molecules from the blocker class, for a total of 3148 observations. All cited results are for the test data which constitutes 15% of the balanced dataset.

Decision tree and random forest
The decision tree method for the reduced dataset, took 2.29 s. The error for the Blocker class was 17.4% and for the Non-Blocker class, it was 19.0%. The eleven variables selected for splitting the tree were ALogP, C2SP2, ECCEN, FMF, MDEO.11, nAcid, nBase, nG, SC.5, SP.5, and VCH.7.
The Random Forest method took 37.49 s. The error for the Blocker class was 9.8% and for the Non-Blocker class was 6.4% with an overall error of only 8.1%. This is the best performance so far. Figure 1 shows how the error rate falls, as the number of decision trees calculated increases. An ROC chart plots the true positive rate against the false positive rate [27]. The Area under the curve (AUC) of 0.957 in the Receiver Operating Characteristics (ROC) curve, in Figure 2 shows that the accuracy of the model is very good. In this particular figure the false positive rate along the x-axis, is indicated as False Alarm Rate, and the true positive rate along the y-axis is indicated as Hit Rate.
The ten most important descriptors in descending order are ALogP, Alogp2, XLogP, tpsaEfficiency, MDEO.11, SC.5, nHBAcc, nAcid, TopoPSA, and WTPT.5. In fact, all the variables are ranked according to the mean decrease in the Gini index. Statistical t-tests for difference of two means resulted in very small p-values indicating that the means were significantly different for the two classes. One can see from the correlation clusters in Figure 3, that the partition coefficient variables (ALogP, Alogp2, and XLogP) that measure the hydrophobicity or immiscibility of organic compounds in water, are correlated at a low minimum distance (MD). The other seven variables, such as number of acids (nAcid) or number of Hydrogen bond acceptors (nHBAcc) are relatively independent.

Gradient boosting & SVM
The gradient boosting method takes only 2.10 s on this dataset. The error is 11.5% and 5.5% respectively, for the Blocker and Non-Blocker class. The overall error was 8.5% which is the best amongst all the methods used. Comparing this to the skewed and poor results obtained from applying Gradient Boosting to the full dataset implies that a balanced dataset is necessary for the method to work well. The five best descriptors are ALogP,

Logistic regression
The logistics regression on this reduced dataset provided a much higher pseudo r 2 of about 0.62, up from 0.54 obtained for the full dataset. This does also suggest that imbalanced datasets detract significantly from the classification accuracy. However, from Table 4 for the test dataset, we can see that the classification error for the majority class, has gone up to 20.9% from 4.7%. The classification error for the minority class, is down from 69% to 26.3% which is 42.7% percentage points reduction in error. The overall error is 21%, and the averaged class error is 23.6%. Twenty four descriptors are shown to be significant at α = 0.001. These are nAcid, aLogP, Alogp2, apol, nA, nG, nAromBond, ATSc1, ATSc3, ATSm4, ATsp2, nBase, C1SP2, C3SP3, SCH.3, VCH.4, VCH.7, VP.3, Kier3, khs.sCH3, khs.sBr, MDEN.33, WTPT.2, and WTPT.5. Out of all these descriptors, four descriptors are common with the set obtained from Random Forests. These are nAcid, ALogP, ALogp2, and WTPT.5, and they represent all three clusters of variables in Figure 3.

Neural network
The neural network with 10 neurons in the hidden layer, is trained in 1.63 s. The error is quite high at 53.2% and 33.5%.

Conclusions
We have tried the entire range of methods available in the Rattle data-miner. If we are concerned only with excluding the blocker type of molecule, in the interest of reducing the search space, then the balanced dataset is not a necessary condition, and the error percentage for the class, which constitutes 83% of the data, is less than 3%. If good classification performance is required for both classes, a balanced dataset is a must.
With the balanced dataset, the Random forest method shows accuracy over 91% for both classes. Gradient Boosting method has shown that classification accuracy can be as good as 88.5% for both classes with up to 94.5% accuracy for one class. The significant descriptors have been identified. These descriptors may be used to create a mechanistic model of the pharmacokinetic action of concern.
The neural network implementation did not show very good results. Neural networks with its many weights and parameters require many hours of fine tuning and the development of a tuned neural network for this dataset, in a dedicated deep learning framework such as Keras, is definitely a research direction to be taken. Even if classification is near perfect in such a neural network, a method to determine the best descriptors from the weights would be a scientific advance.
This study has shown that shareware software such as rcdk package and the Rattle data-mining tool, are competitive with the commercial grade software in this area of scientific research. The Moreau-Broto autocorrelation descriptors using partial charges ATSm1, ATSm2, ATSm3, ATSm4, ATSm5

Hybrid
Eigenvalue based descriptor noted for its utility in chemical diversity described by Pearlman et al.

MW
Constitutional Descriptor based on the weight of atoms of a certain element type. If no element is specified, the returned value is the Molecular Weight WTPT-1 WTPT-2 WTPT-3 WTPT-4 WTPT-5

Topological
The weighted path (molecular ID) descriptors described by Randic.