text_content_manipulation
text_content_manipulation copied to clipboard
Text Content Manipulation
NBA Game Dataset for Text Content Manipulation
This is a dataset for the task of text content manipulation, as first proposed in the paper:
Toward Unsupervised Text Content Manipulation
Wentao Wang*, Zhiting Hu*, Zichao Yang, Haoran Shi, Frank Xu, Eric P. Xing; 2019
Data Format
Each example in the dataset consists of four elements, namely, (x, y_aux, x_ref, y_ref), where
-
xis a content record containing a set of data tuplesx = {x_i}. Each tuplex_icontains three fields(type, value, associated). For example,x_i = (TEAM-AST, 25, Boston)means "The Boston got 25 team assists". More specifically,type: data type of the tuple, e.g.,TEAM-AST,PLAYER-PTS, etc. There are 34 data types in total. See the file x_type.vocab.txt for all data types.value: value of the data. Usually a scalar number or a string (e.g., a player's name).associated: the associated team or player of the tuple.
The above three fields of each
xinstance are stored in three parallel files, respectively. For example, each line in the filetrain/x_type.train.txtcontains data types of all tuples in eachxtraining instance. Data types are separated by white spaces. For example, the first line intrain/x_type.train.txtisTEAM_NAME TEAM-AST TEAM-AST TEAM_NAME, meaning that there are 4 tuples in the firstxinstance, each of which has the respective type.We also provide joined files of
x. For example, each line intrain/x.joined.train.txtcontains all tuples in eachxtraining instance. In each tuple, the three fields are joined, separated by|. For example, the first line intrain/x.joined.train.txtisBoston|TEAM_NAME|Boston 25|TEAM-AST|Boston 11|TEAM-AST|New_York New_York|TEAM_NAME|New_York. These files are simply joined from the separated files, and only used when evaluating the results. -
y_auxis the auxiliary sentence describing the content ofx. -
x_refis the content record of reference sentencey_ref, in the same format asx. During data construction, we have guaranteedx_refhas a similar structure withx, but has a different number of tuples or has different values or types. -
y_refis the reference sentence that defines the desired writing style of output sentence.
Data Files
-
The dataset is split into train/val/test sets, each in corresponding folder, respectively.
-
The four elements
(x, y_aux, x_ref, y_ref)of each example are stored in parallel files, respectively. For example, each line oftrain/y_aux.train.txtis an auxiliary sentence of the respective data example.As explained above, three fields of
xare separately stored in three files, namely, (taking training data for example),x_type.train.txt,x_value.train.txt, andx_associated.train.txt, respectively. Also, joined tuples ofxare stored in a single file, namely, (again taking training data for example),x.joined.train.txt.x_refis stored in the same format, in files likex_ref_type.train.txtorx_ref.joined.train.txt. -
The vocabulary file
y.vocab.txtcontains all words that have occurred iny_auxandy_ref.x_type.vocab.txt,x_value.vocab.txt, andx_associated.vocab.txtare the vocabulary of the 'type', 'value', and 'associated' fields of bothxandx_ref.
Data Statistics
| train | valid | test | |
|---|---|---|---|
| #Instances | 31,751 | 6,833 | 6,999 |
| #Tokens | 1.64M | 0.35M | 0.36M |
| Avg Sentence Length | 25.90 | 25.87 | 25.99 |
| #Data Types | 34 | 34 | 34 |
| Avg Record Length | 4.88 | 4.88 | 4.94 |
Dataset Creation Process
We briefly describe the process of creating the above dataset.
This dataset is derived from one of the Data-to-Text Datasets (RotoWire) proposed in the paper (Wiseman et al., 2017) Challenges in Data-to-Document Generation, which is for NBA game report generation. The original data can be downloaded from here.
The original dataset consists of (table, paragraph) pairs. We first split each data example into (record, sentence) pairs:
-
The original dataset is then preprocessed with a modified version of the script provided in the Data-to-Text dataset. In this step, we make sure each name of an entity (team/city/player) become a single token (e.g.,
LeBron_James,Los_Angeles_Clippers), and all numbers are replaced by their digital forms (e.g., if the original text isfifty, we replace it with50). -
We split the paragraph in each data example into sentences, i.e., the
y_aux. -
We then use the above script to extract all candidate relations between entities and numbers in each sentence
y_aux. More rule-based constraints are imposed to filter out as many redundant relations as possible. These extracted relations forms the recordx. So far, we have obtained all(x, y_aux)pairs.
We next use a retrieval method to retrieve from the training set a (x_ref, y_ref) pair for each of the above (x, y_aux) pairs. In particular, as mentioned above, we want to guarantee x_ref has a similar but not exact the same content with x. Formally, we use the following criteria for retrieval:
where types(x) is the set of all data types in record x; J(A, B) is the Jaccard index between two sets A and B. The larger J(A, B) is, the closer A and B are. When J(A, B) = 1, A is exactly the same as B, otherwise there is some difference between them. We measure similarity between two records based on their types. Therefore, our criteria find x_ref that is most similar to but not exactly the same with x.