dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

TigerXmlReader produces wrong range when a target is noncontiguous

Open maxxkia opened this issue 9 years ago • 9 comments

TigerXmlReader produces wrong begin and end index for target (SemPred) of a semantic frame when the target is noncontiguous.

For instance in the following sentence: w1 w2 w3 w4 w5 w6 w7

, if a target consists of w2 and w5 then the corresponding begin and end indexes for target will be wrongly set as:

target.begin = w2.begin;
target.end = w5.end;

To fix this issue:

  • [x] modify the reader so that in this case it returns the offset for first token (w2 in this case)
  • [x] merge boundary of neighboring tokens inside a frame
  • [x] produce appropriate warning message when a frame spans over noncontiguous tokens
  • [ ] implement a data structure (e.g. List) to store all constituents of a target

maxxkia avatar Jun 08 '16 16:06 maxxkia

Please also add an info about where you got the new tiger sample from to the NOTICE.txt file.

reckart avatar Jun 09 '16 10:06 reckart

How should we deal neighbouring tokens? for example in w1 w2 w3 w4 w5 w6 w7 if a target is made up of w2 w3 which one is the correct assignment of begin and end for the first element? a)

begin = w2.begin
end = w3.end

b)

begin = w2.begin
end = w2.end

@reckart any suggestions?

maxxkia avatar Jun 09 '16 10:06 maxxkia

I think for continuous spans, we can just extend the offsets. I think the only problem are non-continous spans, because we presently do not have a concept to represent these in DKPro Core.

reckart avatar Jun 09 '16 10:06 reckart

@reckart The problem that I imagined to be discontinuous frame arguments (#895) turned out to be another issue.

Having the following example:

<frame name="SubjectiveExpression" id="s6_f2">
    <target>
        <fenode idref="s6_3"/>
        <fenode idref="s6_2"/>
    </target>
    <fe name="Source" id="s6_f2_e1">
        <flag name="Sprecher">
        </flag>
    </fe>
    <fe name="Target" id="s6_f2_e2">
        <fenode idref="s6_4"/>
        <fenode idref="s6_503"/>
        <fenode idref="s6_5"/>
    </fe>
</frame>

, when the reader processes the frame target (id="s6_f2_e2") it creates 3 instances of SemArgLink having the role set to Target and each linking to an instance of SemArg representing the annotation covered by each of fenodes (i.e. s6_4, s6_503 and s6_5).

These SemArgLinks are accessible as arguments of a SemPred:

FSArray arguments = element.getArguments();

However since instances of SemArgLink belonging to a single argument are not stored in a unique collection, one has to iterate over all of them to identify the SemArgLink group. One solution to this would be to iterate over them and group them based on their frame name (i.e. Target in this case), whose value I'm not sure to be distinct (can there be two FE in a TigerXml file having the same name but different ids?). Also note that the frame id (i.e. s6_f2_e2), which can be used to uniquely identify arguments, is dropped in TigerXmlReader.

maxxkia avatar Jul 01 '16 12:07 maxxkia

This problem was raised when I tried to identify the boundaries of sources and targets for subjective expressions.

maxxkia avatar Jul 01 '16 12:07 maxxkia

can there be two fe in a TigerXml file having the same name but different ids?

In principle, yes. There could be two arguments with the same role name.

     <fe name="Target" id="s6_f2_e2">
        <fenode idref="s6_4"/>
        <fenode idref="s6_503"/>
        <fenode idref="s6_5"/>
    </fe>

I think that if these three are adjacent tokens, they should be merged into a single SemArg span. So if they are not adjacent tokens, then we have discontinuous SemArg. Does that make sense to you?

reckart avatar Jul 01 '16 12:07 reckart

Only one SemArgLink & SemArg should IMHO be per FE.

reckart avatar Jul 01 '16 12:07 reckart

Actually in this example and in many more examples I checked manually the constituents of a FE are adjacent and they can be merged. I should write a piece of code to see if there exists any discontinuous FE in my dataset.

Only one SemArgLink & SemArg should IMHO be per FE.

I agree, since I haven't yet seen any example violating this condition.

maxxkia avatar Jul 01 '16 14:07 maxxkia

@reckart I could find discontinuous FE instances (look here in #895).

maxxkia avatar Jul 01 '16 16:07 maxxkia