datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Clinical data in PanCan TCGA

Open sandertan opened this issue 7 years ago • 18 comments

I was wondering if there are plans to expand the clinical data in the PanCan TCGA studies with data that is present in the Provisional TCGA data. For example, the Provisional TCGA BRCA has 140 clinical columns, while PanCan TCGA BRCA has 81. Columns that would be nice to have in BRCA PanCan are HER, ER and PR status.

Are many of the columns from Provisional missing in PanCan because of the difference processing pipelines? Would it be possible to add this data at the end, or would that require to modify the pipeline itself, and is therefore not in the scope of the PanCan project?

sandertan avatar Aug 09 '18 12:08 sandertan

I am not aware of any such plans, although it may be a good thing to do in a well-planned manner. My understanding is that somebody created a huge clinical file for PanCanAtlas by basically collating individual clinical files (I believe that this is the final file made available at the GDC and containing 746 columns: http://api.gdc.cancer.gov/data/0fc78496-818b-4896-bd83-52db1f533c5c ). Since this table is really sparse, when the cBioPortal section for PanCanAtlas (the one combining all the studies was created), only a subset of those columns was kept and columns containing mostly NA or missing values were removed. That may have been the case for annotations such as HER, ER and PR status which may be relevant for BRCA but missing for most other types. The individual cancer type sections were then instantiated as subsets of the big PanCan section, so they would have the same pruned sets of clinical annotations.

For this particular example (receptor status in BRCA), the curated annotations must have been curated by the Pan-Gynecological group. It should be possible to look up their annotations in the supplemental file that they must have submitted with their PanCanAtlas manuscript and add them to the PanCan BRCA section of the portal. However, in my opinion, that sounds more as a manual process, not sure that it will be easy to automatize as part of an existing or modified pipeline.

nanauat avatar Aug 09 '18 14:08 nanauat

@nanauat the PR, HER, ER status related fields (and more) are available in the sheet you shared (see screenshot for example). Thanks.

image

@n1zea144 is there something like a "pancan 2 cbio staging files" pipeline and would it be possible to adjust this to include these fields?

pieterlukasse avatar Aug 16 '18 15:08 pieterlukasse

Hi @pieterlukasse We don't have a pancan pipeline. As far as I know, all the data for cBioPortal was put together in an adhoc manner. I would have imagined that most of the clinical annotation for PanCanAtlas would have just taken what was collection by the biospecimen core resource (maybe with the exception of genomic based annotations like receptor status, but did you notice in the file @nanauat referenced, there are only 9 records with values other than NA/NE/blank?).

If it is the right way to go (I'd like to get @schultzn input), as to expanding the current PanCan clinical annotation with data from the individual TCGA studies, it would just be a matter of determining its priority.

n1zea144 avatar Aug 22 '18 13:08 n1zea144

@n1zea144 thanks for the update. The file referenced by @nanauat actually looks quite complete, at least for the test we did: we were able to find 116 triple negative samples in it for BRCA, which is the same number of samples if we query the provisional TCGA BRCA study.

pieterlukasse avatar Aug 22 '18 13:08 pieterlukasse

I wrote a parser to extract all rows per study, and drop the columns that only contain empty values. Columns with explicit NA values [Not Available] and [Not Applicable] are kept.

This way you end up with the following number of (rows , columns):

Shape of input file: (10956, 746)

Study PanCan file Current
ACC (92, 85) (92, 30)
BLCA (412, 88) (411, 30)
BRCA (1099, 137) (1084, 30)
CESC (308, 153) (297, 30)
CHOL (36, 83) (36, 30)
COAD (459, 100) (439, 30)
DLBC (48, 146) (48, 30)
ESCA (185, 103) (182, 30)
GBM (596, 40) (585, 30)
HNSC (528, 74) (523, 30)
KICH (113, 59) (65, 30)
KIRC (537, 65) (512, 30)
KIRP (291, 67) (283, 30)
LGG (515, 73) (514, 30)
LIHC (377, 79) (372, 30)
LUAD (522, 133) (566, 30)
LUSC (504, 131) (487, 30)
MESO (87, 70) (87, 30)
OV (587, 77) (585, 30)
PAAD (185, 96) (184, 30)
PCPG (179, 43) (178, 30)
PRAD (500, 77) (494, 30)
READ (171, 79) (155, 30)
SARC (261, 69) (255, 30)
SKCM (471, 85) (442, 30)
STAD (443, 89) (440, 30)
TGCT (134, 89) (149, 30)
THCA (507, 78) (499, 30)
THYM (124, 50) (123, 30)
UCEC (548, 77) (529, 30)
UCS (57, 69) (57, 30)
UVM (80, 69) (80, 30)

So there are a few more patients and a lot more columns per study. The current cBio PanCan format has the same 30 columns for each study.

The header of the current cBio PanCan format contains detailed names and descriptions. If there is a mapping file available on GDC for the columns names of the shared file, it would be possible to add detailed descriptions and pretty names for its columns as well.

Source of PanCan clinical file: https://gdc.cancer.gov/about-data/publications/pancanatlas

sandertan avatar Aug 27 '18 14:08 sandertan

From https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=demographic a JSON file can be retrieved with the terms, pretty names and descriptions.

For example:

    "her2_erbb2_result_fish": {
      "description": "the type of outcome for HER2 as determined by an in situ hybridization (ISH) assay.\n",
      "termDef": {
        "cde_id": 2854089,
        "cde_version": 1.0,
        "source": "caDSR",
        "term": "Laboratory Procedure HER2/neu in situ Hybridization Outcome Type",
        "term_url": "https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search?publicId=2854089&version=1.0"
      }

If all column names can be found in this JSON file, it should be easy to get those in the cBioPortal clinical metadata-header format.

EDIT: Unfortunately, only a small amount of attributes are in both files. Perhaps the PanCan file is relatively old. I can't figure out the versioning of these data.

sandertan avatar Aug 28 '18 10:08 sandertan

Status:

  1. Adding all clinical data to PanCan studies should be possible with clinical_PANCAN_patient_with_followup.tsv source file. There are 746 columns in total. When a column does not have data for a specific cancer type, this column can be excluded for that cancer type. My code to transform the files to cBioPortal staging file format: https://github.com/sandertan/clinical-data-tcga-pancancer/blob/master/transform_clinical_pancan_to_cbioportal.ipynb This still requires column and value harmonization.

  2. Data dictionary for clinical_PANCAN_patient_with_followup.tsv is not available anymore from GDC, so columns should be manually to cBioPortal attribute IDs, attribute names and attribute descriptions, see Clinical Data Dictionary http://oncotree.mskcc.org/cdd/swagger-ui.html#/clinical-data-dictionary-controller.

sandertan avatar Mar 04 '19 17:03 sandertan

merge into issue #241

yichaoS avatar Jun 04 '20 18:06 yichaoS

let's keep this open until we add the pancan clinical data. I added to the check list in #241.

jjgao avatar Jun 10 '20 18:06 jjgao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 09 '20 14:09 stale[bot]

@jjgao what is the status of this one? Still want to keep it open?

Sjoerd-van-Hagen avatar Sep 28 '20 16:09 Sjoerd-van-Hagen

@Sjoerd-van-Hagen This is part of the enhancement. We have added survival from the list. We will add more but we want to get the genomic part complete first.

ritikakundra avatar Sep 28 '20 16:09 ritikakundra

Ok, great!

Good luck!

--

E. [email protected]

T. +31(0)30 700 9713

W. www.thehyve.nl

On Mon, Sep 28, 2020 at 6:07 PM ritikakundra [email protected] wrote:

@Sjoerd-van-Hagen https://github.com/Sjoerd-van-Hagen This is part of the enhancement. We have added survival from the list. We will add more but we want to get the genomic part complete first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cBioPortal/datahub/issues/294#issuecomment-700131281, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMGKNMRSRFMHGZEXXCBOADSICYEVANCNFSM4FOWOHUA .

Sjoerd-van-Hagen avatar Sep 28 '20 16:09 Sjoerd-van-Hagen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 28 '21 06:03 stale[bot]

@ritikakundra @jjgao do we keep this open or do we track in #241 ?

Sjoerd-van-Hagen avatar Mar 29 '21 10:03 Sjoerd-van-Hagen

@Sjoerd-van-Hagen I think we should keep this open. #241 tracks all data types including clinical data.

jjgao avatar May 17 '21 14:05 jjgao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 14 '21 20:11 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 19 '22 12:06 stale[bot]