BioREx icon indicating copy to clipboard operation
BioREx copied to clipboard

Request support on data input sample and output sample just for prediction

Open Darrshan-Sankar opened this issue 1 year ago • 17 comments

I used AIONER output to extract relations, but it didn't work. Went through the issues and found the example to be in BioRED repo. Want to know how to create such data and a sample output, about how the predict.pubtator will look like

Darrshan-Sankar avatar Jul 16 '24 06:07 Darrshan-Sankar

Hi @Darrshan-Sankar,

The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

ptlai avatar Jul 18 '24 13:07 ptlai

Hi @Darrshan-Sankar,

The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better

Darrshan-Sankar avatar Jul 18 '24 13:07 Darrshan-Sankar

Hi @Darrshan-Sankar,

The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei ([email protected]), who deals with the entire backend process of our PubTator.

ptlai avatar Jul 18 '24 14:07 ptlai

Hi @Darrshan-Sankar,

The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei ([email protected]), who deals with the entire backend process of our PubTator.

@ptlai Yeah went through the FTP. As you said, only got relations for abstract. Thank you for providing contact of Dr.Wei to contact him

Darrshan-Sankar avatar Jul 18 '24 14:07 Darrshan-Sankar

Hi @Darrshan-Sankar, The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better

@ptlai Hi,could you please provide an example file of a normalized AIONER file? I'd like to review the workflow for extracting entities with AIONER and then performing relation extraction with BioRex.

zy2376 avatar Oct 28 '24 14:10 zy2376

Hi @zy2376 ,

A normalized example of an AIONER file can be found at bc8_biored_task1_val.txt](https://github.com/user-attachments/files/17559816/bc8_biored_task1_val.txt). Please note that AIONER NE types must be converted to their corresponding BioRED NE types (e.g., 'Gene' to 'GeneOrGeneProduct') before running BioREx.

ptlai avatar Oct 29 '24 16:10 ptlai

@ptlai Thank you very much for providing the normalized example, I was finally able to successfully run he AIONER-to-BioRex process for PubMed abstract. However, the process failed when applied to PMC full-text. Could you please provide guidance on resolving this issue?

zy2376 avatar Nov 19 '24 08:11 zy2376

Hi @zy2376 ,

To process the full-text data with BioREx, you can treat each paragraph as a separate abstract. For instance, take the article available at https://www.ncbi.nlm.nih.gov/research/pubtator3/publication/33202951.

You can format the content like this:

33202951|t|1. Introduction. Paragraph 1.
33202951|a|In general, N-nitrosamines (NAs) are the products of reactions between a nitrosating agent and a secondary or tertiary amine; NAs are formed preferentially at elevated temperature. Thus, NAs are mainly detected in food and drinks after processing. In foods, nitrous anhydride is the main nitrosating agent formed from nitrite in an acidic aqueous solution. In drinking water, N-nitrosodimethylamine (NDMA) is the most simple and volatile NA that can form during the degradation of dimethylhydrazine (a component of rocket fuel) by chloramination of amine-based precursors or as a byproduct of anion exchange purification of water. NDMA has been shown to be formed in certain foods due to a direct-fire drying process. International Agency for Research on Cancer (IARC) has classified NDMA as a probable carcinogen in humans. NDMA is known to be genotoxic in vivo and in vitro. Several case-control studies and a single cohort study of NDMA in humans supported the assumption that NDMA consumption is positively associated with either gastric or colorectal cancer. Therefore, due to possible contamination of water with NDMA, the World Health Organization (WHO) and U.S. Environmental Protection Agency (EPA) have set the drinking water guideline limits to 100 ng/L and 0.4 ng/L in tap water, respectively. Only in a few foods and countries, limits have been set for NAs. In the United States, a limit of 10 microg/kg has been set for total volatile NAs in cured meat products. In 2005, China introduced a limit of 4 and 7 microg/kg of NDMA in fish and related products, respectively. There are currently no maximum regulatory limits for the level of N-nitroso-compounds in food in the European Union.

ptlai avatar Nov 19 '24 22:11 ptlai

@ptlai Thanks to your comments, I've converted my full-text into |t| and |a| title format, and it works for some paragraphs(see the attached BioRex input file "PMC7611502_t-a-format_239rows.txt" and BioRex output file "PMC7611502_t-a-format_239rows_predict.txt". However, for the full-text data, I encountered a mapping issue, as indicated by the following warning: "IFN 15184 3 <annotation.AnnotationInfo object at 0x2af9f2c126d0> cannot be mapped to original text " By the way, the full-text data was formatted using the following workflow: paper from BioC API -> AIONER -> GNorm2 -> BioRED PubTator. I checked that the full-text data didn't change from BioC to BioRED PubTator , which led me to wonder if the issue might be due to a difference between AIONER and BioRex mapping. Could you help resolve the mapping issue?

PMC7611502_t-a-format.txt PMC7611502_t-a-format_239rows.txt PMC7611502_t-a-format_239rows_predict.txt

zy2376 avatar Dec 03 '24 12:12 zy2376

Hi @zy2376 ,

Thank you for providing the example PubTator files. Upon review, I noticed a few formatting issues that need to be addressed:

  1. Each document in the file should be separated by an empty line. For example:

Incorrect

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
7611502	265	268	A20	GeneOrGeneProduct	7128
7611502	299	302	A20	GeneOrGeneProduct	7128,21929
7611502	357	360	CD8	GeneOrGeneProduct	925
7611502	530	551	TANK-binding kinase 1	GeneOrGeneProduct	29110,56480
7611502	553	557	TBK1	GeneOrGeneProduct	29110,56480
7611502	602	607	STAT1	GeneOrGeneProduct	6772,20846
7611502	631	636	PD-L1	GeneOrGeneProduct	29126
7611502	733	736	A20	GeneOrGeneProduct	7128
7611502	791	794	A20	GeneOrGeneProduct	7128
7611502	840	844	TBK1	GeneOrGeneProduct	29110
7611502	845	850	STAT1	GeneOrGeneProduct	6772
7611502	851	856	PD-L1	GeneOrGeneProduct	29126,60533
7611502	401	409	patients	OrganismTaxon	9606
7611502	414	418	mice	OrganismTaxon	10090
7611502	718	722	mice	OrganismTaxon	10090
7611502	486	496	interferon	GeneOrGeneProduct	3439
7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502	1677	1708	Programmed Cell Death Protein 1	GeneOrGeneProduct	5133
7611502	1743	1748	PD-L1	GeneOrGeneProduct	29126
7611502	1583	1591	patients	OrganismTaxon	9606
7611502	1793	1801	patients	OrganismTaxon	9606

Correct

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
7611502	265	268	A20	GeneOrGeneProduct	7128
7611502	299	302	A20	GeneOrGeneProduct	7128,21929
7611502	357	360	CD8	GeneOrGeneProduct	925
7611502	530	551	TANK-binding kinase 1	GeneOrGeneProduct	29110,56480
7611502	553	557	TBK1	GeneOrGeneProduct	29110,56480
7611502	602	607	STAT1	GeneOrGeneProduct	6772,20846
7611502	631	636	PD-L1	GeneOrGeneProduct	29126
7611502	733	736	A20	GeneOrGeneProduct	7128
7611502	791	794	A20	GeneOrGeneProduct	7128
7611502	840	844	TBK1	GeneOrGeneProduct	29110
7611502	845	850	STAT1	GeneOrGeneProduct	6772
7611502	851	856	PD-L1	GeneOrGeneProduct	29126,60533
7611502	401	409	patients	OrganismTaxon	9606
7611502	414	418	mice	OrganismTaxon	10090
7611502	718	722	mice	OrganismTaxon	10090
7611502	486	496	interferon	GeneOrGeneProduct	3439

7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502	1677	1708	Programmed Cell Death Protein 1	GeneOrGeneProduct	5133
7611502	1743	1748	PD-L1	GeneOrGeneProduct	29126
7611502	1583	1591	patients	OrganismTaxon	9606
7611502	1793	1801	patients	OrganismTaxon	9606
  1. The first two lines in each document must begin with |t| (title) and |a| (abstract), respectively. Entity annotations should start from the third line onward.

Incorrect

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128

Correct

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
  1. Entity offsets should reset to 0 at the beginning of each document.

ptlai avatar Dec 03 '24 14:12 ptlai

@ptlai Thanks to your help, the full text can now be extracted using BioRex. However, another issue has arisen: each document provides the same relations🤦‍. I've included the input and output files below. Please help me check them. PMC7611502_input.txt PMC7611502_predict.txt

zy2376 avatar Dec 28 '24 15:12 zy2376

Apologies for the confusion. I noticed that the document ID serves as a unique index for the input. Therefore, you need to use a different index for each input text, as shown below:

7611502_0|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502_0|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502_0	18	21	A20	GeneOrGeneProduct	7128
7611502_0	242	249	TNFAIP3	GeneOrGeneProduct	7128
7611502_0	265	268	A20	GeneOrGeneProduct	7128
7611502_0	299	302	A20	GeneOrGeneProduct	7128
7611502_0	357	360	CD8	GeneOrGeneProduct	925
7611502_0	530	551	TANK-binding kinase 1	GeneOrGeneProduct	29110
7611502_0	553	557	TBK1	GeneOrGeneProduct	29110
7611502_0	602	607	STAT1	GeneOrGeneProduct	6772
7611502_0	631	636	PD-L1	GeneOrGeneProduct	29126
7611502_0	733	736	A20	GeneOrGeneProduct	7128
7611502_0	791	794	A20	GeneOrGeneProduct	7128
7611502_0	840	844	TBK1	GeneOrGeneProduct	29110
7611502_0	845	850	STAT1	GeneOrGeneProduct	6772
7611502_0	851	856	PD-L1	GeneOrGeneProduct	29126
7611502_0	401	409	patients	OrganismTaxon	9606
7611502_0	414	418	mice	OrganismTaxon	10090
7611502_0	718	722	mice	OrganismTaxon	10090

7611502_1|t|Introduction
7611502_1|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502_1	746	777	Programmed Cell Death Protein 1	GeneOrGeneProduct	5133
7611502_1	779	783	PD-1	GeneOrGeneProduct	5133
7611502_1	812	817	PD-L1	GeneOrGeneProduct	29126
7611502_1	652	660	patients	OrganismTaxon	9606
7611502_1	862	870	patients	OrganismTaxon	9606

ptlai avatar Dec 30 '24 14:12 ptlai

@ptlai Thank you very much for providing the normalized example, I was finally able to successfully run he AIONER-to-BioRex process for PubMed abstract. However, the process failed when applied to PMC full-text. Could you please provide guidance on resolving this issue?

Hello,

I am wondering how do you normalized the outputs from aioner. I cannot see how the aioner's normalized example above helped you because obviously the entities' tagging is different in your paper. I guess your are using gnorm2.

jose-lopez avatar Mar 05 '25 18:03 jose-lopez

Hi @jose-lopez ,

AIONER results cannot be directly input into BioREx. We utilize multiple components for normalization, including GNorm2, TaggerOne, the NLM-Chem model, and tmVar3. If you only need to process PubMed or PMC full-text articles, consider using our PubTator 3 API (https://www.ncbi.nlm.nih.gov/research/pubtator3/api) to retrieve the normalization results. Thanks

ptlai avatar Mar 05 '25 20:03 ptlai

Hi @jose-lopez ,

AIONER results cannot be directly input into BioREx. We utilize multiple components for normalization, including GNorm2, TaggerOne, the NLM-Chem model, and tmVar3. If you only need to process PubMed or PMC full-text articles, consider using our PubTator 3 API (https://www.ncbi.nlm.nih.gov/research/pubtator3/api) to retrieve the normalization results. Thanks

Thank you very much!.

jose-lopez avatar Mar 06 '25 00:03 jose-lopez

@ptlai Thanks to your comments, I've converted my full-text into |t| and |a| title format, and it works for some paragraphs(see the attached BioRex input file "PMC7611502_t-a-format_239rows.txt" and BioRex output file "PMC7611502_t-a-format_239rows_predict.txt". However, for the full-text data, I encountered a mapping issue, as indicated by the following warning: "IFN 15184 3 <annotation.AnnotationInfo object at 0x2af9f2c126d0> cannot be mapped to original text " By the way, the full-text data was formatted using the following workflow: paper from BioC API -> AIONER -> GNorm2 -> BioRED PubTator. I checked that the full-text data didn't change from BioC to BioRED PubTator , which led me to wonder if the issue might be due to a difference between AIONER and BioRex mapping. Could you help resolve the mapping issue?

PMC7611502_t-a-format.txt PMC7611502_t-a-format_239rows.txt PMC7611502_t-a-format_239rows_predict.txt

Hello, please, may you tell how you got the normalized annotations for the entities for each paragraph?. I have tried installing GNORM2 and passing aoiner`s annotations to it, but I didn't get a file that biorex can work with. I mean, the output for GNORM2 seems to need another stage of normalization before a full normalized pubtator file is produced. I am asking because I would like to make predictions for a full paper not publicly available. Thnaks a lot!

jose-lopez avatar Mar 13 '25 18:03 jose-lopez

Hi @jose-lopez ,

Sorry for the confusion. Regarding the problem with GNORM2, do you receive any error messages? GNORM2 was developed by Dr. Wei. I'll forward this issue to him for further assistance.

BioREx only supports the BioRED named entity types. Therefore, you might need an additional post-processing step to convert the GNORM2 named entity types to those recognized by BioRED before using BioREx.

If you already have data generated by GNORM2 but are unable to process it with BioREx, please feel free to send it to me. I'd be happy to review it and assist further. Thanks.

ptlai avatar Mar 13 '25 18:03 ptlai