issue with aromatic rings in `write.molecules()`
From my email:
Hi,
I'm running into grief with write.molecules() in rcdk. Any molecules
that I am inputting with an aromatic ring containing double bonds, have
those double bonds delocalised in the output .sdf file.
See example of the pubchem sdf for salicylic acid (Structure2D_CID_338.sdf),
which I'm using as input. Using load.molecules() to read this in, then write.molecules()
to output it as Structure2D_CID_338_cdk.sdf changes the bond coding from fixed to
delocalised in the ring. Taking this .sdf as input in openbabel turns it back to Structure2D_CID_338_cdk_openbabel.sdf, which has fixed position aromatic bonds again.
Is there a way to control this behavior in the write to sdf file? I have some downstream programs that
can't deal with delocalised bonds in input sdf files (Compound DIscoverer)
thanks
Tony
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRblas.so
LAPACK: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_GB LC_NUMERIC=C LC_TIME=en_GB
[4] LC_COLLATE=en_GB LC_MONETARY=en_GB LC_MESSAGES=en_GB
[7] LC_PAPER=en_GB LC_NAME=en_GB LC_ADDRESS=en_GB
[10] LC_TELEPHONE=en_GB LC_MEASUREMENT=en_GB LC_IDENTIFICATION=en_GB
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rcdk_3.5.0 rcdklibs_2.3 rJava_0.9-11
loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.2 parallel_3.6.2
[4] fingerprint_3.5.7 tools_3.6.2 iterators_1.0.12
[7] itertools_0.1-3 png_0.1-7
Structure2D_CID_338.sdf
338
CDK 06292018032D
16 16 0 0 0 0 0 0 0 0999 V2000
2.5369 -0.0600 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -0.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -2.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 0.9400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.6720 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
2.8660 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.6720 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -2.6800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
2.0000 -0.3700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 2.0600 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 5 1 0 0 0 0
1 15 1 0 0 0 0
2 10 1 0 0 0 0
2 16 1 0 0 0 0
3 10 2 0 0 0 0
4 5 4 0 0 0 0
4 6 4 0 0 0 0
4 10 1 0 0 0 0
5 7 4 0 0 0 0
6 8 4 0 0 0 0
6 11 1 0 0 0 0
7 9 4 0 0 0 0
7 12 1 0 0 0 0
8 9 4 0 0 0 0
8 13 1 0 0 0 0
9 14 1 0 0 0 0
M END
> <PUBCHEM_COMPOUND_CID>
338
> <PUBCHEM_COMPOUND_CANONICALIZED>
1
> <PUBCHEM_CACTVS_COMPLEXITY>
133
> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
3
> <PUBCHEM_CACTVS_HBOND_DONOR>
2
> <PUBCHEM_CACTVS_ROTATABLE_BOND>
1
> <PUBCHEM_CACTVS_SUBSKEYS>
AAADcYBgMAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAwDoAAAgCIAiDSCAACAAAkIAAIiAEGCMgIJzaCFRKAcUAl4BEImYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA==
> <PUBCHEM_IUPAC_OPENEYE_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_CAS_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_NAME_MARKUP>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_SYSTEMATIC_NAME>
2-oxidanylbenzoic acid
> <PUBCHEM_IUPAC_TRADITIONAL_NAME>
salicylic acid
> <PUBCHEM_IUPAC_INCHI>
InChI=1S/C7H6O3/c8-6-4-2-1-3-5(6)7(9)10/h1-4,8H,(H,9,10)
> <PUBCHEM_IUPAC_INCHIKEY>
YGSDEFSMJLZEOE-UHFFFAOYSA-N
> <PUBCHEM_XLOGP3>
2.3
> <PUBCHEM_EXACT_MASS>
138.031694
> <PUBCHEM_MOLECULAR_FORMULA>
C7H6O3
> <PUBCHEM_MOLECULAR_WEIGHT>
138.12
> <PUBCHEM_OPENEYE_CAN_SMILES>
C1=CC=C(C(=C1)C(=O)O)O
> <PUBCHEM_OPENEYE_ISO_SMILES>
C1=CC=C(C(=C1)C(=O)O)O
> <PUBCHEM_CACTVS_TPSA>
57.5
> <PUBCHEM_MONOISOTOPIC_WEIGHT>
138.031694
> <PUBCHEM_TOTAL_CHARGE>
0
> <PUBCHEM_HEAVY_ATOM_COUNT>
10
> <PUBCHEM_ATOM_DEF_STEREO_COUNT>
0
> <PUBCHEM_ATOM_UDEF_STEREO_COUNT>
0
> <PUBCHEM_BOND_DEF_STEREO_COUNT>
0
> <PUBCHEM_BOND_UDEF_STEREO_COUNT>
0
> <PUBCHEM_ISOTOPIC_ATOM_COUNT>
0
> <PUBCHEM_COMPONENT_COUNT>
1
> <PUBCHEM_CACTVS_TAUTO_COUNT>
-1
> <PUBCHEM_COORDINATE_TYPE>
1
5
255
> <PUBCHEM_BONDANNOTATIONS>
4 5 8
4 6 8
5 7 8
6 8 8
7 9 8
8 9 8
$$$$
Structure2D_CID_338_cdk_openbabel.sdf
338
OpenBabel06292018072D
16 16 0 0 0 0 0 0 0 0999 V2000
2.5369 -0.0600 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -0.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4030 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -2.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 0.9400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.6720 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
2.8660 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.6720 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
4.2690 -2.6800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
2.0000 -0.3700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1350 2.0600 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 5 1 0 0 0 0
1 15 1 0 0 0 0
2 10 1 0 0 0 0
2 16 1 0 0 0 0
3 10 2 0 0 0 0
4 5 2 0 0 0 0
4 6 1 0 0 0 0
4 10 1 0 0 0 0
5 7 1 0 0 0 0
6 8 2 0 0 0 0
6 11 1 0 0 0 0
7 9 2 0 0 0 0
7 12 1 0 0 0 0
8 9 1 0 0 0 0
8 13 1 0 0 0 0
9 14 1 0 0 0 0
M END
> <PUBCHEM_COMPOUND_CID>
338
> <PUBCHEM_COMPOUND_CANONICALIZED>
1
> <PUBCHEM_CACTVS_COMPLEXITY>
133
> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
3
> <PUBCHEM_CACTVS_HBOND_DONOR>
2
> <PUBCHEM_CACTVS_ROTATABLE_BOND>
1
> <PUBCHEM_CACTVS_SUBSKEYS>
AAADcYBgMAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAwDoAAAgCIAiDSCAACAAAkIAAIiAEGCMgIJzaCFRKAcUAl4BEImYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA==
> <PUBCHEM_IUPAC_OPENEYE_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_CAS_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_NAME_MARKUP>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_NAME>
2-hydroxybenzoic acid
> <PUBCHEM_IUPAC_SYSTEMATIC_NAME>
2-oxidanylbenzoic acid
> <PUBCHEM_IUPAC_TRADITIONAL_NAME>
salicylic acid
> <PUBCHEM_IUPAC_INCHI>
InChI=1S/C7H6O3/c8-6-4-2-1-3-5(6)7(9)10/h1-4,8H,(H,9,10)
> <PUBCHEM_IUPAC_INCHIKEY>
YGSDEFSMJLZEOE-UHFFFAOYSA-N
> <PUBCHEM_XLOGP3>
2.3
> <PUBCHEM_EXACT_MASS>
138.031694
> <PUBCHEM_MOLECULAR_FORMULA>
C7H6O3
> <PUBCHEM_MOLECULAR_WEIGHT>
138.12
> <PUBCHEM_OPENEYE_CAN_SMILES>
C1=CC=C(C(=C1)C(=O)O)O
> <PUBCHEM_OPENEYE_ISO_SMILES>
C1=CC=C(C(=C1)C(=O)O)O
> <PUBCHEM_CACTVS_TPSA>
57.5
> <PUBCHEM_MONOISOTOPIC_WEIGHT>
138.031694
> <PUBCHEM_TOTAL_CHARGE>
0
> <PUBCHEM_HEAVY_ATOM_COUNT>
10
> <PUBCHEM_ATOM_DEF_STEREO_COUNT>
0
> <PUBCHEM_ATOM_UDEF_STEREO_COUNT>
0
> <PUBCHEM_BOND_DEF_STEREO_COUNT>
0
> <PUBCHEM_BOND_UDEF_STEREO_COUNT>
0
> <PUBCHEM_ISOTOPIC_ATOM_COUNT>
0
> <PUBCHEM_COMPONENT_COUNT>
1
> <PUBCHEM_CACTVS_TAUTO_COUNT>
-1
> <PUBCHEM_COORDINATE_TYPE>
1
5
255
> <PUBCHEM_BONDANNOTATIONS>
4 5 8
4 6 8
5 7 8
6 8 8
7 9 8
8 9 8
$$$$
further tests and input/output files attached. The problem seems to be that implicit H typing is lost in the write.molecules() process.
library("rcdk") Loading required package: rcdklibs Loading required package: rJava
load structure from downloaded pubchem .sdf
mols <- load.molecules("Structure2D_CID_338.sdf")
plot structure
img <- view.image.2d(mols[[1]]) plot.new() rasterImage(img, 0, 0, 1, 1) savePlot("Structure2D_CID_338.png") dev.off() null device 1
write to sdf
write.molecules(mols, "Structure2D_CID_338_cdk.sdf", write.props = TRUE)
load and re plot
mols2 <- load.molecules("Structure2D_CID_338_cdk.sdf")
img2 <- view.image.2d(mols2[[1]]) Error in .jcall(mi, "[B", "getBytes", as.integer(depictor$getWidth()), : java.lang.NullPointerException: One or more atoms had an undefined number of implicit hydrogens plot.new() rasterImage(img2, 0, 0, 1, 1) Error in rasterImage(img2, 0, 0, 1, 1) : object 'img2' not found savePlot("Structure2D_CID_338_cdk.png") dev.off() null device 1
sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS
Matrix products: default BLAS: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRblas.so LAPACK: /shared/storage/biology/rsrch/tf-PAB/lab/trl1/bin_research0/lib/R/lib/libRlapack.so
locale: [1] LC_CTYPE=en_GB LC_NUMERIC=C LC_TIME=en_GB [4] LC_COLLATE=en_GB LC_MONETARY=en_GB LC_MESSAGES=en_GB [7] LC_PAPER=en_GB LC_NAME=en_GB LC_ADDRESS=en_GB [10] LC_TELEPHONE=en_GB LC_MEASUREMENT=en_GB LC_IDENTIFICATION=en_GB
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] rcdk_3.5.0 rcdklibs_2.3 rJava_0.9-11
loaded via a namespace (and not attached): [1] compiler_3.6.2 parallel_3.6.2 fingerprint_3.5.7 iterators_1.0.12 [5] itertools_0.1-3 png_0.1-7
@trljcl apologies for the delay
I am looking at this again and agree that the hydrogens are lost. The write command is a wrapper of CDK's SDFWriter. Can you post this issue on the CDK user list? If it is explained/fixed in CDK, I can help get equivalent functionality in rCDK.
# suport for your theory
mols[[1]]$getAtomCount()
# 16
mols2[[1]]$getAtomCount()
# 10