Determining the best set of molecular descriptors for a Toxicity classification problem

Badri Toppur; K.J. Jaims

doi:10.1051/ro/2021134

All issues

Volume 55 / No 5 (September-October 2021)

RAIRO-Oper. Res., 55 5 (2021) 2769-2783

Abstract

Open Access

Issue		RAIRO-Oper. Res. Volume 55, Number 5, September-October 2021


Page(s)		2769 - 2783
DOI		https://doi.org/10.1051/ro/2021134
Published online		20 September 2021

RAIRO-Oper. Res. 55 (2021) 2769–2783

Determining the best set of molecular descriptors for a Toxicity classification problem

Badri Toppur¹^* and K.J. Jaims²

¹ Rajalakshmi School of Business, Chennai, India
² DC School of Management and Technology, Kochi, India

^* Corresponding author: badri.toppur@rsb.edu.in; badri.toppur@gmail.com

Received: 22 February 2021
Accepted: 14 August 2021

Abstract

The safety norms for drug design are very strict with at least three stages of trials. One test, early on in the trials, is about the cardiotoxicity of the molecules, that is, whether the compound blocks any heart channel. Chemical libraries contain millions of compounds. Accurate a priori and in silico classification of non-blocking molecules, can reduce the screening for an effective drug, by half. The compound has to be checked for other risk factors alongside its therapeutic effect; these tests can also be done using a computer. Actual screening in a research laboratory is very expensive and time consuming. To enable the computer modelling, the molecules are provided in Simplified Molecular Input Line Entry (SMILE) format. In this study, they have been decoded using the chem-informatics development kit written in the Java language. The kit is accessed in the R statistical software environment through the rJava package, that is further wrapped in the rcdk package. The strings representing the molecular structure, are parsed by the rcdk functions, to provide structure-activity descriptors, that are known, to be good predictors of biological activity. These descriptors along with the known blocking behaviour of the molecule, constitute the input to the Decision Tree, Random Forest, Gradient Boosting, Support-Vector-Machine, Logistic Regression, and Artificial Neural Network algorithms. This paper reports the results of the data analysis project with shareware tools, to determine the best subset of molecular descriptors, from the large set that is available.

Mathematics Subject Classification: 62P10 / 92-10

Key words: Data mining / Bayesian classification problem / random forest / gradient boosting / biochemistry

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.