Clinical data in PanCan TCGA
I was wondering if there are plans to expand the clinical data in the PanCan TCGA studies with data that is present in the Provisional TCGA data. For example, the Provisional TCGA BRCA has 140 clinical columns, while PanCan TCGA BRCA has 81. Columns that would be nice to have in BRCA PanCan are HER, ER and PR status.
Are many of the columns from Provisional missing in PanCan because of the difference processing pipelines? Would it be possible to add this data at the end, or would that require to modify the pipeline itself, and is therefore not in the scope of the PanCan project?
I am not aware of any such plans, although it may be a good thing to do in a well-planned manner. My understanding is that somebody created a huge clinical file for PanCanAtlas by basically collating individual clinical files (I believe that this is the final file made available at the GDC and containing 746 columns: http://api.gdc.cancer.gov/data/0fc78496-818b-4896-bd83-52db1f533c5c ). Since this table is really sparse, when the cBioPortal section for PanCanAtlas (the one combining all the studies was created), only a subset of those columns was kept and columns containing mostly NA or missing values were removed. That may have been the case for annotations such as HER, ER and PR status which may be relevant for BRCA but missing for most other types. The individual cancer type sections were then instantiated as subsets of the big PanCan section, so they would have the same pruned sets of clinical annotations.
For this particular example (receptor status in BRCA), the curated annotations must have been curated by the Pan-Gynecological group. It should be possible to look up their annotations in the supplemental file that they must have submitted with their PanCanAtlas manuscript and add them to the PanCan BRCA section of the portal. However, in my opinion, that sounds more as a manual process, not sure that it will be easy to automatize as part of an existing or modified pipeline.
@nanauat the PR, HER, ER status related fields (and more) are available in the sheet you shared (see screenshot for example). Thanks.

@n1zea144 is there something like a "pancan 2 cbio staging files" pipeline and would it be possible to adjust this to include these fields?
Hi @pieterlukasse We don't have a pancan pipeline. As far as I know, all the data for cBioPortal was put together in an adhoc manner. I would have imagined that most of the clinical annotation for PanCanAtlas would have just taken what was collection by the biospecimen core resource (maybe with the exception of genomic based annotations like receptor status, but did you notice in the file @nanauat referenced, there are only 9 records with values other than NA/NE/blank?).
If it is the right way to go (I'd like to get @schultzn input), as to expanding the current PanCan clinical annotation with data from the individual TCGA studies, it would just be a matter of determining its priority.
@n1zea144 thanks for the update. The file referenced by @nanauat actually looks quite complete, at least for the test we did: we were able to find 116 triple negative samples in it for BRCA, which is the same number of samples if we query the provisional TCGA BRCA study.
I wrote a parser to extract all rows per study, and drop the columns that only contain empty values. Columns with explicit NA values [Not Available] and [Not Applicable] are kept.
This way you end up with the following number of (rows , columns):
Shape of input file: (10956, 746)
| Study | PanCan file | Current |
|---|---|---|
| ACC | (92, 85) | (92, 30) |
| BLCA | (412, 88) | (411, 30) |
| BRCA | (1099, 137) | (1084, 30) |
| CESC | (308, 153) | (297, 30) |
| CHOL | (36, 83) | (36, 30) |
| COAD | (459, 100) | (439, 30) |
| DLBC | (48, 146) | (48, 30) |
| ESCA | (185, 103) | (182, 30) |
| GBM | (596, 40) | (585, 30) |
| HNSC | (528, 74) | (523, 30) |
| KICH | (113, 59) | (65, 30) |
| KIRC | (537, 65) | (512, 30) |
| KIRP | (291, 67) | (283, 30) |
| LGG | (515, 73) | (514, 30) |
| LIHC | (377, 79) | (372, 30) |
| LUAD | (522, 133) | (566, 30) |
| LUSC | (504, 131) | (487, 30) |
| MESO | (87, 70) | (87, 30) |
| OV | (587, 77) | (585, 30) |
| PAAD | (185, 96) | (184, 30) |
| PCPG | (179, 43) | (178, 30) |
| PRAD | (500, 77) | (494, 30) |
| READ | (171, 79) | (155, 30) |
| SARC | (261, 69) | (255, 30) |
| SKCM | (471, 85) | (442, 30) |
| STAD | (443, 89) | (440, 30) |
| TGCT | (134, 89) | (149, 30) |
| THCA | (507, 78) | (499, 30) |
| THYM | (124, 50) | (123, 30) |
| UCEC | (548, 77) | (529, 30) |
| UCS | (57, 69) | (57, 30) |
| UVM | (80, 69) | (80, 30) |
So there are a few more patients and a lot more columns per study. The current cBio PanCan format has the same 30 columns for each study.
The header of the current cBio PanCan format contains detailed names and descriptions. If there is a mapping file available on GDC for the columns names of the shared file, it would be possible to add detailed descriptions and pretty names for its columns as well.
Source of PanCan clinical file: https://gdc.cancer.gov/about-data/publications/pancanatlas
From https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=demographic a JSON file can be retrieved with the terms, pretty names and descriptions.
For example:
"her2_erbb2_result_fish": {
"description": "the type of outcome for HER2 as determined by an in situ hybridization (ISH) assay.\n",
"termDef": {
"cde_id": 2854089,
"cde_version": 1.0,
"source": "caDSR",
"term": "Laboratory Procedure HER2/neu in situ Hybridization Outcome Type",
"term_url": "https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search?publicId=2854089&version=1.0"
}
If all column names can be found in this JSON file, it should be easy to get those in the cBioPortal clinical metadata-header format.
EDIT: Unfortunately, only a small amount of attributes are in both files. Perhaps the PanCan file is relatively old. I can't figure out the versioning of these data.
Status:
-
Adding all clinical data to PanCan studies should be possible with
clinical_PANCAN_patient_with_followup.tsvsource file. There are 746 columns in total. When a column does not have data for a specific cancer type, this column can be excluded for that cancer type. My code to transform the files to cBioPortal staging file format: https://github.com/sandertan/clinical-data-tcga-pancancer/blob/master/transform_clinical_pancan_to_cbioportal.ipynb This still requires column and value harmonization. -
Data dictionary for
clinical_PANCAN_patient_with_followup.tsvis not available anymore from GDC, so columns should be manually to cBioPortal attribute IDs, attribute names and attribute descriptions, see Clinical Data Dictionary http://oncotree.mskcc.org/cdd/swagger-ui.html#/clinical-data-dictionary-controller.
merge into issue #241
let's keep this open until we add the pancan clinical data. I added to the check list in #241.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@jjgao what is the status of this one? Still want to keep it open?
@Sjoerd-van-Hagen This is part of the enhancement. We have added survival from the list. We will add more but we want to get the genomic part complete first.
Ok, great!
Good luck!
--
T. +31(0)30 700 9713
W. www.thehyve.nl
On Mon, Sep 28, 2020 at 6:07 PM ritikakundra [email protected] wrote:
@Sjoerd-van-Hagen https://github.com/Sjoerd-van-Hagen This is part of the enhancement. We have added survival from the list. We will add more but we want to get the genomic part complete first.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cBioPortal/datahub/issues/294#issuecomment-700131281, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMGKNMRSRFMHGZEXXCBOADSICYEVANCNFSM4FOWOHUA .
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@ritikakundra @jjgao do we keep this open or do we track in #241 ?
@Sjoerd-van-Hagen I think we should keep this open. #241 tracks all data types including clinical data.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.