Request support on data input sample and output sample just for prediction
I used AIONER output to extract relations, but it didn't work. Went through the issues and found the example to be in BioRED repo. Want to know how to create such data and a sample output, about how the predict.pubtator will look like
Hi @Darrshan-Sankar,
The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.
Hi @Darrshan-Sankar,
The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.
@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better
Hi @Darrshan-Sankar,
The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei ([email protected]), who deals with the entire backend process of our PubTator.
Hi @Darrshan-Sankar,
The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei ([email protected]), who deals with the entire backend process of our PubTator.
@ptlai Yeah went through the FTP. As you said, only got relations for abstract. Thank you for providing contact of Dr.Wei to contact him
Hi @Darrshan-Sankar, The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.
@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better
@ptlai Hi,could you please provide an example file of a normalized AIONER file? I'd like to review the workflow for extracting entities with AIONER and then performing relation extraction with BioRex.
Hi @zy2376 ,
A normalized example of an AIONER file can be found at bc8_biored_task1_val.txt](https://github.com/user-attachments/files/17559816/bc8_biored_task1_val.txt). Please note that AIONER NE types must be converted to their corresponding BioRED NE types (e.g., 'Gene' to 'GeneOrGeneProduct') before running BioREx.
@ptlai Thank you very much for providing the normalized example, I was finally able to successfully run he AIONER-to-BioRex process for PubMed abstract. However, the process failed when applied to PMC full-text. Could you please provide guidance on resolving this issue?
Hi @zy2376 ,
To process the full-text data with BioREx, you can treat each paragraph as a separate abstract. For instance, take the article available at https://www.ncbi.nlm.nih.gov/research/pubtator3/publication/33202951.
You can format the content like this:
33202951|t|1. Introduction. Paragraph 1.
33202951|a|In general, N-nitrosamines (NAs) are the products of reactions between a nitrosating agent and a secondary or tertiary amine; NAs are formed preferentially at elevated temperature. Thus, NAs are mainly detected in food and drinks after processing. In foods, nitrous anhydride is the main nitrosating agent formed from nitrite in an acidic aqueous solution. In drinking water, N-nitrosodimethylamine (NDMA) is the most simple and volatile NA that can form during the degradation of dimethylhydrazine (a component of rocket fuel) by chloramination of amine-based precursors or as a byproduct of anion exchange purification of water. NDMA has been shown to be formed in certain foods due to a direct-fire drying process. International Agency for Research on Cancer (IARC) has classified NDMA as a probable carcinogen in humans. NDMA is known to be genotoxic in vivo and in vitro. Several case-control studies and a single cohort study of NDMA in humans supported the assumption that NDMA consumption is positively associated with either gastric or colorectal cancer. Therefore, due to possible contamination of water with NDMA, the World Health Organization (WHO) and U.S. Environmental Protection Agency (EPA) have set the drinking water guideline limits to 100 ng/L and 0.4 ng/L in tap water, respectively. Only in a few foods and countries, limits have been set for NAs. In the United States, a limit of 10 microg/kg has been set for total volatile NAs in cured meat products. In 2005, China introduced a limit of 4 and 7 microg/kg of NDMA in fish and related products, respectively. There are currently no maximum regulatory limits for the level of N-nitroso-compounds in food in the European Union.
@ptlai Thanks to your comments, I've converted my full-text into |t| and |a| title format, and it works for some paragraphs(see the attached BioRex input file "PMC7611502_t-a-format_239rows.txt" and BioRex output file "PMC7611502_t-a-format_239rows_predict.txt". However, for the full-text data, I encountered a mapping issue, as indicated by the following warning: "IFN 15184 3 <annotation.AnnotationInfo object at 0x2af9f2c126d0> cannot be mapped to original text " By the way, the full-text data was formatted using the following workflow: paper from BioC API -> AIONER -> GNorm2 -> BioRED PubTator. I checked that the full-text data didn't change from BioC to BioRED PubTator , which led me to wonder if the issue might be due to a difference between AIONER and BioRex mapping. Could you help resolve the mapping issue?
PMC7611502_t-a-format.txt PMC7611502_t-a-format_239rows.txt PMC7611502_t-a-format_239rows_predict.txt
Hi @zy2376 ,
Thank you for providing the example PubTator files. Upon review, I noticed a few formatting issues that need to be addressed:
- Each document in the file should be separated by an empty line. For example:
Incorrect
7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502 18 21 A20 GeneOrGeneProduct 7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502 242 249 TNFAIP3 GeneOrGeneProduct 21929,7128
7611502 265 268 A20 GeneOrGeneProduct 7128
7611502 299 302 A20 GeneOrGeneProduct 7128,21929
7611502 357 360 CD8 GeneOrGeneProduct 925
7611502 530 551 TANK-binding kinase 1 GeneOrGeneProduct 29110,56480
7611502 553 557 TBK1 GeneOrGeneProduct 29110,56480
7611502 602 607 STAT1 GeneOrGeneProduct 6772,20846
7611502 631 636 PD-L1 GeneOrGeneProduct 29126
7611502 733 736 A20 GeneOrGeneProduct 7128
7611502 791 794 A20 GeneOrGeneProduct 7128
7611502 840 844 TBK1 GeneOrGeneProduct 29110
7611502 845 850 STAT1 GeneOrGeneProduct 6772
7611502 851 856 PD-L1 GeneOrGeneProduct 29126,60533
7611502 401 409 patients OrganismTaxon 9606
7611502 414 418 mice OrganismTaxon 10090
7611502 718 722 mice OrganismTaxon 10090
7611502 486 496 interferon GeneOrGeneProduct 3439
7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502 1677 1708 Programmed Cell Death Protein 1 GeneOrGeneProduct 5133
7611502 1743 1748 PD-L1 GeneOrGeneProduct 29126
7611502 1583 1591 patients OrganismTaxon 9606
7611502 1793 1801 patients OrganismTaxon 9606
Correct
7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502 18 21 A20 GeneOrGeneProduct 7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502 242 249 TNFAIP3 GeneOrGeneProduct 21929,7128
7611502 265 268 A20 GeneOrGeneProduct 7128
7611502 299 302 A20 GeneOrGeneProduct 7128,21929
7611502 357 360 CD8 GeneOrGeneProduct 925
7611502 530 551 TANK-binding kinase 1 GeneOrGeneProduct 29110,56480
7611502 553 557 TBK1 GeneOrGeneProduct 29110,56480
7611502 602 607 STAT1 GeneOrGeneProduct 6772,20846
7611502 631 636 PD-L1 GeneOrGeneProduct 29126
7611502 733 736 A20 GeneOrGeneProduct 7128
7611502 791 794 A20 GeneOrGeneProduct 7128
7611502 840 844 TBK1 GeneOrGeneProduct 29110
7611502 845 850 STAT1 GeneOrGeneProduct 6772
7611502 851 856 PD-L1 GeneOrGeneProduct 29126,60533
7611502 401 409 patients OrganismTaxon 9606
7611502 414 418 mice OrganismTaxon 10090
7611502 718 722 mice OrganismTaxon 10090
7611502 486 496 interferon GeneOrGeneProduct 3439
7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502 1677 1708 Programmed Cell Death Protein 1 GeneOrGeneProduct 5133
7611502 1743 1748 PD-L1 GeneOrGeneProduct 29126
7611502 1583 1591 patients OrganismTaxon 9606
7611502 1793 1801 patients OrganismTaxon 9606
- The first two lines in each document must begin with |t| (title) and |a| (abstract), respectively. Entity annotations should start from the third line onward.
Incorrect
7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502 18 21 A20 GeneOrGeneProduct 7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502 242 249 TNFAIP3 GeneOrGeneProduct 21929,7128
Correct
7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502 18 21 A20 GeneOrGeneProduct 7128
7611502 242 249 TNFAIP3 GeneOrGeneProduct 21929,7128
- Entity offsets should reset to 0 at the beginning of each document.
@ptlai Thanks to your help, the full text can now be extracted using BioRex. However, another issue has arisen: each document provides the same relations🤦. I've included the input and output files below. Please help me check them. PMC7611502_input.txt PMC7611502_predict.txt
Apologies for the confusion. I noticed that the document ID serves as a unique index for the input. Therefore, you need to use a different index for each input text, as shown below:
7611502_0|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502_0|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502_0 18 21 A20 GeneOrGeneProduct 7128
7611502_0 242 249 TNFAIP3 GeneOrGeneProduct 7128
7611502_0 265 268 A20 GeneOrGeneProduct 7128
7611502_0 299 302 A20 GeneOrGeneProduct 7128
7611502_0 357 360 CD8 GeneOrGeneProduct 925
7611502_0 530 551 TANK-binding kinase 1 GeneOrGeneProduct 29110
7611502_0 553 557 TBK1 GeneOrGeneProduct 29110
7611502_0 602 607 STAT1 GeneOrGeneProduct 6772
7611502_0 631 636 PD-L1 GeneOrGeneProduct 29126
7611502_0 733 736 A20 GeneOrGeneProduct 7128
7611502_0 791 794 A20 GeneOrGeneProduct 7128
7611502_0 840 844 TBK1 GeneOrGeneProduct 29110
7611502_0 845 850 STAT1 GeneOrGeneProduct 6772
7611502_0 851 856 PD-L1 GeneOrGeneProduct 29126
7611502_0 401 409 patients OrganismTaxon 9606
7611502_0 414 418 mice OrganismTaxon 10090
7611502_0 718 722 mice OrganismTaxon 10090
7611502_1|t|Introduction
7611502_1|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502_1 746 777 Programmed Cell Death Protein 1 GeneOrGeneProduct 5133
7611502_1 779 783 PD-1 GeneOrGeneProduct 5133
7611502_1 812 817 PD-L1 GeneOrGeneProduct 29126
7611502_1 652 660 patients OrganismTaxon 9606
7611502_1 862 870 patients OrganismTaxon 9606
@ptlai Thank you very much for providing the normalized example, I was finally able to successfully run he AIONER-to-BioRex process for PubMed abstract. However, the process failed when applied to PMC full-text. Could you please provide guidance on resolving this issue?
Hello,
I am wondering how do you normalized the outputs from aioner. I cannot see how the aioner's normalized example above helped you because obviously the entities' tagging is different in your paper. I guess your are using gnorm2.
Hi @jose-lopez ,
AIONER results cannot be directly input into BioREx. We utilize multiple components for normalization, including GNorm2, TaggerOne, the NLM-Chem model, and tmVar3. If you only need to process PubMed or PMC full-text articles, consider using our PubTator 3 API (https://www.ncbi.nlm.nih.gov/research/pubtator3/api) to retrieve the normalization results. Thanks
Hi @jose-lopez ,
AIONER results cannot be directly input into BioREx. We utilize multiple components for normalization, including GNorm2, TaggerOne, the NLM-Chem model, and tmVar3. If you only need to process PubMed or PMC full-text articles, consider using our PubTator 3 API (https://www.ncbi.nlm.nih.gov/research/pubtator3/api) to retrieve the normalization results. Thanks
Thank you very much!.
@ptlai Thanks to your comments, I've converted my full-text into |t| and |a| title format, and it works for some paragraphs(see the attached BioRex input file "PMC7611502_t-a-format_239rows.txt" and BioRex output file "PMC7611502_t-a-format_239rows_predict.txt". However, for the full-text data, I encountered a mapping issue, as indicated by the following warning: "IFN 15184 3 <annotation.AnnotationInfo object at 0x2af9f2c126d0> cannot be mapped to original text " By the way, the full-text data was formatted using the following workflow: paper from BioC API -> AIONER -> GNorm2 -> BioRED PubTator. I checked that the full-text data didn't change from BioC to BioRED PubTator , which led me to wonder if the issue might be due to a difference between AIONER and BioRex mapping. Could you help resolve the mapping issue?
PMC7611502_t-a-format.txt PMC7611502_t-a-format_239rows.txt PMC7611502_t-a-format_239rows_predict.txt
Hello, please, may you tell how you got the normalized annotations for the entities for each paragraph?. I have tried installing GNORM2 and passing aoiner`s annotations to it, but I didn't get a file that biorex can work with. I mean, the output for GNORM2 seems to need another stage of normalization before a full normalized pubtator file is produced. I am asking because I would like to make predictions for a full paper not publicly available. Thnaks a lot!
Hi @jose-lopez ,
Sorry for the confusion. Regarding the problem with GNORM2, do you receive any error messages? GNORM2 was developed by Dr. Wei. I'll forward this issue to him for further assistance.
BioREx only supports the BioRED named entity types. Therefore, you might need an additional post-processing step to convert the GNORM2 named entity types to those recognized by BioRED before using BioREx.
If you already have data generated by GNORM2 but are unable to process it with BioREx, please feel free to send it to me. I'd be happy to review it and assist further. Thanks.