--- title: "Using the CDK from R" author: "Rajarshi Guha" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Using the CDK from R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Using the CDK from R ## Introduction Given that much of cheminformatics involves mathematical and statistical modeling of chemical information, R is a natural platform for such work. There are many cheminformatics applications that will generate useful information such as descriptors, fingerprints and so on. While one can always run these applications to generate data that is then imported into R, it can be convenient to be able to manipulate chemical structures and generate chemical information with the R environment. The [CDK](https://cdk.github.io) is a Java library for cheminformatics that supports a wide variety of cheminformatics functionality ranging from reading molecular file formats, performing ring perception and aromaticity detection to fingerprint generation and molecular descriptors. The CDK website provides links to useful documentation as well as complete [Javadocs](https://cdk.github.io/cdk/2.3/docs/api/index.html?overview-summary.html) ## Getting started The goal of the `rcdk` package is to allow an R user to access the cheminformatics functionality of the [CDK](https://cdk.github.io) from within R. While one can use the [rJava](https://CRAN.R-project.org/package=rJava) package to make direct calls to specific methods in the [CDK](https://cdk.github.io), from R, such usage does not usually follow common R idioms. Thus `rcdk` aims to allow users to use the [CDK](https://cdk.github.io) classes and methods in an R-like fashion. The library is loaded as follows ```{r} library(rcdk) ``` We can also check the version of the [CDK](https://cdk.github.io) that is being used in the package ```{r} cdk.version() ``` The package also provides an example data set, called `bpdata` which contains 277 molecules, in SMILES format and their associated boiling points (BP) in Kelvin. The `data.frame` has two columns, viz., the SMILES and the BP. Molecules names are used as row names: ```{r echo=FALSE} data(bpdata) str(bpdata) ``` ## Input and Output Chemical structures come in a variety of formats and the CDK supports many of them. Many such formats are disk based and these files can be parsed and loaded by specifying their full paths ```{r, eval=FALSE} mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') ) ``` Note that the above function will load any file format that is supported by the CDK, so there’s no need to specify formats. In addition one can specify a URL (which should start with `http://`) to specify remote files as well. The result of this function is a list of molecule objects. The molecule objects are of class `jobjRef` (provided by the [rJava](https://CRAN.R-project.org/package=rJava) package). As a result,they are pretty opaque to the user and are really meant to be processed using methods from the `rcdk` or [rJava](https://CRAN.R-project.org/package=rJava) packages. However, since it loads all the molecules from the specified file into a list, large files can lead to out of memory errors. In such a situtation it is preferable to iterate over the file, one structure at a time. Currently this behavior is supported for SDF and SMILES files. An example of such a usage for a large SD file would be ```{r eval=FALSE} iter <- iload.molecules('verybig.sdf', type='sdf') while(hasNext(iter)) { mol <- nextElem(iter) print(get.property(mol, "cdk:Title")) } ``` ### Parsing SMILES Another common way to obtain molecule objects is by parsing SMILES strings. The simplest way to do this is ```{r eval=FALSE} smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1' mol <- parse.smiles(smile)[[1]] ``` Usage is more efficient when multiple SMILE are supplied, since then a single SMILES parser object is used to parse all the supplied SMILES. If you plan on parsing a large number of SMILES, you may run into memory issues, due to the large size of `IAtomContainer` objects. In such a case, it can be useful to call the Java and R garbage collectors explicitly at the appropriate time. In addition it can be useful to explicitly allocate a large amount of memory for the JVM. For example, ```{r eval=FALSE} options("java.parameters"=c("-Xmx4000m")) library(rcdk) for (smile in smiles) { m <- parse.smiles(smile) ## perform operations on this molecule jcall("java/lang/System","V","gc") gc() } ``` Given a list of molecule objects, it is possible to serialize them to a file in some specified format. Currently, the only output formats are SMILES or SDF. To write molecules to a disk file in SDF format. ```{r eval=FALSE} write.molecules(mols, filename='mymols.sdf') ``` By default, if mols is a list of multiple molecules, all of them will be written to a single SDF file. If this is not desired, you can write each on to individual files (which are prefixed by the value of filename): ```{r eval=FALSE} write.molecules(mols, filename='mymols.sdf', together=FALSE) ``` ### Generating SMILES Finally, we can generate a SMILES representation of a molecule using ```{r} smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC') mols <- parse.smiles(smiles) get.smiles(mols[[1]]) unlist(lapply(mols, get.smiles)) ``` The [CDK](https://cdk.github.io) supports a number of _flavors_ when generating SMILES. For example, you can generate a SMILES with or without chirality information or generate SMILES in [Kekule form](https://www.chemguide.co.uk/basicorg/bonding/benzene1.html). The `smiles.flavors` generates an object that represents the various flavors desired for SMILES output. See the [SmiFlavor javadocs](https://cdk.github.io/cdk/2.3/docs/api/index.html?org/openscience/cdk/smiles/SmiFlavor.html) for the full list of possible flavors. Example usage is ```{r} smiles <- c('CCC', 'c1ccccc1', 'CCc1ccccc1CC(C)(C)CC(=O)NC') mols <- parse.smiles(smiles) get.smiles(mols[[3]], smiles.flavors(c('UseAromaticSymbols'))) get.smiles(mols[[3]], smiles.flavors(c('Generic','CxSmiles'))) ``` Using the [CxSmiles](http://butane.chem.uiuc.edu/jsmoore/marvin/help/formats/cxsmiles-doc.html) flavors allows the user to encode a variety of information in the SMILES string, such as 2D or 3D coordinates. ```{r} m <- parse.smiles('CCC')[[1]] m <- generate.2d.coordinates(m) m get.smiles(m, smiles.flavors(c('CxSmiles'))) get.smiles(m, smiles.flavors(c('CxCoordinates'))) ``` ## Visualization The `rcdk` package supports 2D rendering of chemical structures. This can be used to view the structure of individual molecules or multiple molecules in a tabular format. It is also possible to view a molecular-data table, where one of the columns is the 2D image and the remainder can contain data associated with the molecules. Due to Java event handling issues on OS X, depictions are handled using an external helper, which means that depiction generation can be slower on OS X compared to other platforms. Molecule visualization is performed using the `view.molecule.2d` function. This handles individual molecules as well as a list of molecules. In the latter case, the depictions are arranged in a grid (with 4 columns by default). ```{r eval=FALSE} smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') mols <- parse.smiles(smiles) view.molecule.2d(mols[[1]]) view.molecule.2d(mols) ``` The CDK depiction routines allow for extensive customization. These customizations can be accessed by creating a depictor object using `get.depictor`, which allows you to specify the size of the depiction, the depiction style (black and white, color on white, etc.), atom annotations (e.g., atom index), whether functional group abbreviations should be used or not and so on. ```{r eval=FALSE} depictor <- get.depictor(style='cob', abbr='reagents', width=300, height=300) view.molecule.2d(mols[[5]], depictor=depictor) ``` Once you have a depictor object, you can set individual properties using the `$` notation. This can be useful if you plan to generate a lot of depictions so that a new depictor is not recreated for each new structure. ```{r eval=FALSE} depictor <- get.depictor(style='cob', abbr='reagents', width=300, height=300) view.molecule.2d(mols[[5]], depictor=depictor) #depictor$setStyle('cow') #view.molecule.2d(mols[[5]], depictor=depictor) ``` The method also allows you to highlight substructures using [SMARTS](https://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification). This is useful in highlight commen substructures in a set of molecules ```{r eval=FALSE} depictor <- get.depictor(style='cob', abbr='reagents', sma='N(C)(C)') view.molecule.2d(mols, depictor=depictor) ``` In many cases, it is useful to view a “molecular spreadsheet”, which is a table of molecular structures along with information (numeric or textual) related to the molecules being viewed. The data is arranged in a spreadsheet like manner, with one of the columns being molecules and the remainder being textual or numeric information. This can be achieved using the `view.table` method which takes a list of molecule objects and a `data.frame` containing the associated data. As expected, the number of rows in the `data.frame` should equal the length of the molecule list. Note that currently, there is not explicit binding between the rows of the `data.frame` and the elements of the list containing the molecules. Thus the user should take care that the ordering of the `data.frame` matches that of the list. ```{r eval=FALSE} smiles <- c('CCC', 'CCN', 'CCN(C)(C)','c1ccccc1Cc1ccccc1') mols <- parse.smiles(smiles) dframe <- data.frame(x = runif(4), toxicity = factor(c('Toxic', 'Toxic', 'Nontoxic', 'Nontoxic')), solubility = c('yes', 'yes', 'no', 'yes')) view.table(mols, dframe) ``` While the `view.molecule.2d` function is useful to visualize structures, the depictions can't be included in other visualizations such as plots. For such use cases, the `view.image.2d` function produces a raster image that can be included in plots. This function handles one molecule at a time. ```{r} img <- view.image.2d(parse.smiles("B([C@H](CC(C)C)NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)C2=NC=CN=C2)(O)O")[[1]]) plot(1:10, 1:10, pch=19) rasterImage(img, 1,6, 5,10) ``` Finally, the `copy.image.to.clipboard` function allows you to copy a depiction to the system clipboard, from where it can be pasted into other applications. This can be more convenient than saving a raster image. ## Manipulating Molecules In general, given a `jobjRef` for a molecule object one can access all the class and methods of the [CDK](https://cdk.github.io) library via rJava. However this can be cumbersome. The `rcdk` package exposes methods and classes that manipulate molecules. ### Adding Information to Molecules In many scenarios it’s useful to associate information with molecules. Within R, you could always create a `data.frame` and store the molecule objects along with relevant information in it. However, when serializing the molecules, you want to be able to store the associated information with the structure itself (though keep in mind that only certain chemical file formats support metadata along with the structure). Using the [CDK](https://cdk.github.io) it’s possible to directly add information to a molecule object using properties. Note that adding such properties uses a key-value paradigm, where the key should be of class character. The value can be of class `integer`, `double`, `character` or `jobjRef`. Obviously, after setting a property, you can get a property by its key. ```{r} mol <- parse.smiles('c1ccccc1')[[1]] set.property(mol, "title", "Molecule 1") set.property(mol, "hvyAtomCount", 6) get.property(mol, "title") ``` It is also possible to get all available properties at once in the from of a list. The property names are used as the list names. ```{r} get.properties(mol) ``` After adding such properties to the molecule, you can write it out to an SD file, so that the property values become SD tags. ```{r eval=FALSE} write.molecules(mol, 'tagged.sdf', write.props=TRUE) ``` ### Atoms and Bonds Probably the most important thing to do is to get the atoms and bonds of a molecule. The code below gets the atoms and bonds as lists of `jobjRef` objects, which can be manipulated using rJava or via other methods of this package. ```{r} mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]] atoms <- get.atoms(mol) bonds <- get.bonds(mol) cat('No. of atoms =', length(atoms), '\n') cat('No. of bonds =', length(bonds), '\n') ``` Given an atom the `rcdk` package does not offer a lot of methods to operate on it. One must access the [CDK](https://cdk.github.io) directly. In the future more manipulators will be added. Right now, you can get the symbol for each atom ```{r} unlist(lapply(atoms, get.symbol)) ``` It’s also possible to get the 3D (or 2D coordinates) for an atom. ```{r} coords <- get.point3d(atoms[[1]]) ``` Given this, it’s quite easy to get the 3D coordinate matrix for a molecule ```{r} coords <- do.call('rbind', lapply(atoms, get.point3d)) ``` Once you have the coordinate matrix, a quick way to check whether the molecule is flat is to do ```{r} if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) { print("molecule is flat") } ``` This is quite a simplistic check that just looks at whether the X, Y or Z coordinates are constant. To be more rigorous one could evaluate the moments of inertia about the axes. ### Substructure matching The [CDK](https://cdk.github.io) library supports substructure searches using [SMARTS](https://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification) (or SMILES) patterns. The implementation allows one to check whether a target molecule contains a substructure or not as well as to retrieve the atoms and bonds of the target molecule that match the query substructure. At this point, the `rcdk` only support the former operation - given a query pattern, does it occur or not in a list of target molecules. The `matches` method of this package returns a logical vector where the $i$'th element is `TRUE` if the $i$'th target molecules contains the query substructure. An example of its usage would be to identify molecules that contain a carbon atom that has exactly two bonded neighbors. ```{r} mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC')) query <- '[#6D2]' matches(query, mols) ``` ## Molecular Descriptors A key requirement for the predictive modeling of molecular properties and activities are [molecular descriptors](https://en.wikipedia.org/wiki/Molecular_descriptor) - numerical characterizations of the molecular structure. The [CDK](https://cdk.github.io/) implements a variety of molecular descriptors, categorized into topological, constitutional, geometric, electronic and hybrid. It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors. First, we can take a look at the available descriptor categories. ```{r} dc <- get.desc.categories() dc ``` Given the categories we can get the names of the descriptors for a single category. Of course, you can always provide the category name directly. ```{r} dn <- get.desc.names(dc[4]) dn ``` Each descriptor name is actually a fully qualified Java class name for the corresponding descriptor. These names can be supplied to `eval.desc` to evaluate a single or multiple descriptors for one or more molecules. ```{r eval=FALSE} aDesc <- eval.desc(mol, dn[4]) allDescs <- eval.desc(mol, dn) ``` The return value of eval.desc is a `data.frame` with the descriptors in the columns and the molecules in the rows. For the above example we get a single row. But given a list of molecules, we can easily get a descriptor matrix. For example, let's build a linear regression model to predict boiling points for the BP dataset. First we need a set of descriptors and so we evaluate all available descriptors. Also note that since a descriptor might belong to more than one category, we should obtain a unique set of descriptor names ```{r} descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names))) ``` For the current discussion we focus on a few, manually selected descriptors that we know will be related to boiling point. ```{r} data(bpdata) mols <- parse.smiles(bpdata[,1]) descNames <- c( 'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor', 'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor', 'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor') descs <- eval.desc(mols, descNames) class(descs) dim(descs) ``` When a descriptor value cannot be computed, it's value is set to `NA`. This may happen if a descriptor requires 3D coordinates, but only 2D coordinates are available. In this case, we have manually selected descriptors such that there will be no undefined values. Given the ubiquity of certain descriptors, some of them are directly available via their own functions. Specifically, one can calculate [TPSA](https://pubmed.ncbi.nlm.nih.gov/19149561/) (topological polar surface area), [AlogP](https://pubs.acs.org/doi/abs/10.1021/jp980230o) and [XlogP](http://www.compchemcons.com/xlogp/xlogp.html) without having to go through `eval.desc`. (Note that AlogP and XlogP assume that hydrogens are explicitly specified in the molecule. This may not be true if the molecules were obtained from SMILES) ```{r} mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]] convert.implicit.to.explicit(mol) get.tpsa(mol) get.xlogp(mol) get.alogp(mol) ``` Now that we have a descriptor matrix, we easily build a linear regression model. First, remove `NA`’s, correlated and constant columns. The code is shown below, but since it involves a stochastic element, we will not run it for this example. If we were to perform feature selection, then this type of reduction would have to be performed. ```{r} descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )] descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )] r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE) r2 <- r2[ r2[,1] > r2[,2] , ] descs <- descs[, -unique(r2[,2])] ``` Note that the above correlation reduction step is pretty crude and there are better ways to do it. Given the reduced descriptor matrix, we can perform feature selection (say using [leaps](https://CRAN.R-project.org/package=leaps), [caret](https://CRAN.R-project.org/package=caret) or a [GA](https://CRAN.R-project.org/package=GenAlgo) to identify a suitable subset of descriptors. Given that we selected the descriptors by hand, we can skip this section, and directly build the model and generate a plot of predicted versus observed BP. (Note that this is a toy example and is not an example of good QSAR practice!) ```{r} model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs)) summary(model) plot(bpdata$BP, predict(model, descs), xlab="Observed BP", ylab="Predicted BP", pch=19, xlim=c(100, 700), ylim=c(100, 700)) abline(0,1, col='red') ``` ## Fingerprints Fingerprints are a common representation used for a variety of purposes such as similarity searching and predictive modeling. The [CDK](https://cdk.github.io/) provides a variety of fingerprints ranging from path-based hashed fingerprints to circular (specifically, an implementation fo the [ECFP](https://pubs.acs.org/doi/abs/10.1021/ci100050t) fingerprints) and signature fingerprints (based on the [signature](https://pubs.acs.org/doi/abs/10.1021/ci020345w) molecular descriptor). Some of the fingerprints are represented as binary strings and other by integer vectors. The `rcdk` employs the [fingerprint](https://CRAN.R-project.org/package=fingerprint) package to support operations on the resultant fingerprints. In this section, we present an example of using fingerprints to generate a hierarchical clustering of a set of molecules from the included boiling point dataset. We first parse the SMILES for the molecules in the dataset and then compute the fingerprints, specifying the `circular` type. ```{r} data(bpdata) mols <- parse.smiles(bpdata[,1]) fps <- lapply(mols, get.fingerprint, type='circular') ``` With the fingerprints, we can then compute a pairwise similarity matrix using the [Tanimoto](https://en.wikipedia.org/wiki/Chemical_similarity) metric. Since R's `hclust` method requires a distance matrix, we convert the similarity matrix to a distance matrix ```{r} fp.sim <- fingerprint::fp.sim.matrix(fps, method='tanimoto') fp.dist <- 1 - fp.sim ``` Finally, we can perform the clustering. In this case we use the `hclust` method though any of R's clustering methods could be used. ```{r} cls <- hclust(as.dist(fp.dist)) plot(cls, main='A Clustering of the BP dataset', labels=FALSE) ``` Another common task for fingerprints is similarity searching. That is, given a collection of _target_ molecules, find those molecules that are similar to a _query_ molecule. This is achieved by evaluating a similarity metric between the query and each of the target molecules. Those target molecules exceeding a user defined cutoff will be returned. With the help of the [fingerprint](https://CRAN.R-project.org/package=fingerprint) package this is easily accomplished. For example, we can identify all the molecules in the BP dataset that have a Tanimoto similarity of 0.3 or more with acetalehyde, and then create a tabular summary. Note that this could also be accomplished with molecular descriptors, in which case you’d probably evaluate the Euclidean distance between descriptor vectors. ```{r} query.mol <- parse.smiles('CC(=O)')[[1]] target.mols <- parse.smiles(bpdata[,1]) query.fp <- get.fingerprint(query.mol, type='circular') target.fps <- lapply(target.mols, get.fingerprint, type='circular') sims <- data.frame(sim=do.call(rbind, lapply(target.fps, fingerprint::distance, fp2=query.fp, method='tanimoto'))) subset(sims, sim >= 0.3) ```