MassBank-data icon indicating copy to clipboard operation
MassBank-data copied to clipboard

Create file for PubChem deposition at every release

Open schymane opened this issue 6 years ago • 9 comments

It would be great if we could auto-create a file to deposit in PubChem with every stable release of MassBank-data. To discuss: compound information only (=> relatively easy) or mappings with spectral IDs (slightly more info needed) or actual spectra as well (more work our side). Shall we start with getting a deposit file for compound information only? Then we need e.g.:

PUBCHEM_EXT_DATASOURCE_REGID <= InChIKey, or any unique identifier our side PUBCHEM_EXT_DATASOURCE_SMILES <= SMILES PUBCHEM_EXT_DATASOURCE_CID <= PubChem CID (if available) PUBCHEM_SUBSTANCE_COMMENT <= here we could e.g. provide accession IDs, collapsed PUBCHEM_SUBSTANCE_SYNONYM <= any names our side (can have multiple columns, but maybe e.g. max 3 would be sensible)

@meier-rene @sneumann @tsufz what do you think? If yes, who will look after the file? I would contact PubChem to get us a MassBank login for deposition, so credit goes to MassBank(EU) and we can track our submissions.

schymane avatar Nov 21 '19 16:11 schymane

Hi, we have started to embed bioschemas information, e.g. Line 60+ in view-source:https://msbi.ipb-halle.de/MassBank/RecordDisplay2?id=PB006301 and I would prefer if PubChem adopts that. Benefit would be that this way they can scrape other Bioschemas compatible stuff, I would reckon that Wikipathways also embeds such information. Otherwise we end up generating and maintaining mappings for PubChem, ChemSpider, CompTox, ... separately. Yours, Steffen

sneumann avatar Nov 21 '19 22:11 sneumann

Let's discuss with Evan and @egonw at Dagstuhl then ...

schymane avatar Nov 21 '19 22:11 schymane

adding @alasdairgray in to contribute his Bioschemas wisdom ...

schymane avatar Nov 22 '19 08:11 schymane

@AlasdairGray, I guess a good first step forward is to have that aggregator website you just showed crawl the MassBank website and extract the chemical structures and data record JSON-LD.

egonw avatar Nov 22 '19 10:11 egonw

In addition / alternatively to scraping the JSON-LD we can design a REST query that would deliver this. What exactly is needed ? CSV ? Can we have a pointer to an example deposition file ? Yours, Steffen

sneumann avatar Jan 28 '20 16:01 sneumann

This is the CSV format I use now:

PUBCHEM_EXT_DATASOURCE_REGID,PUBCHEM_SUBSTANCE_SYNONYM,PUBCHEM_EXT_SUBSTANCE_URL,PUBCHEM_EXT_DATASOURCE_SMILES
Q127900,β-lactose,https://tools.wmflabs.org/scholia/Q127900,C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O[C@@H]2[C@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O
Q128381,harmine,https://tools.wmflabs.org/scholia/Q128381,CC1=NC=CC2=C1NC3=C2C=CC(=C3)OC
Q128540,zirconyl chloride,https://tools.wmflabs.org/scholia/Q128540,[Cl-].[Cl-].[O].[Zr+2]
Q129163,tin(IV) oxide,https://tools.wmflabs.org/scholia/Q129163,O=[Sn]=O
Q130175,butane-1-selenol,https://tools.wmflabs.org/scholia/Q130175,CCCC[SeH]
Q130336,phenol,https://tools.wmflabs.org/scholia/Q130336,C1=CC=C(C=C1)O
Q130365,riboflavin,https://tools.wmflabs.org/scholia/Q130365,CC1=CC2=C(C=C1C)N(C3=NC(=O)NC(=O)C3=N2)C[C@@H]([C@@H]([C@@H](CO)O)O)O
Q131189,propane,https://tools.wmflabs.org/scholia/Q131189,CCC
Q131994,veratrole,https://tools.wmflabs.org/scholia/Q131994,COC1=CC=CC=C1OC

egonw avatar Jan 28 '20 23:01 egonw

@schymane is there any progress with the implementation in PubChem? I guess, you are in contact with Evan anyway. However, if not, I could contact him to discuss the topic further.

tsufz avatar Sep 14 '20 14:09 tsufz

BTW., our schema still has not the Description tag. See https://github.com/MassBank/MassBank-web/issues/206

tsufz avatar Sep 14 '20 14:09 tsufz

We need a consistent deposit file created MassBank side – preferably automatically with every release (see subject line). Then Jeff can grab and update automatically on the PubChem side. If we design it to have DTXSIDs in it as well, this is all CompTox would need. The process is simple … but the file needs to come from MassBank side. It’s basically the deduplicated record summary (i.e. a summary of unique compounds, not records). Could easily be auto-generated …

schymane avatar Sep 14 '20 15:09 schymane