dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

Add language feature to Div type

Open maxxkia opened this issue 9 years ago • 3 comments

Some time ago I had a discussion with @reckart about a project I am working on where we need paragraph- or even sentence-level language annotations in our document. Then the conclusion was that maybe it's a good idea to add language as a feature to Div type (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div) which is the super-type of the aforementioned types.

Are there any other things that need to be done with respect to this requirement?

  • [ ] add feature
  • [ ] add feature in UML diagrams in type system documentation

maxxkia avatar Sep 20 '16 12:09 maxxkia

As far as I remember, Div does not have any capability of storing values, such as language. MetaDataStringField might be more suitable, as it allows you to store arbitrary key-value pairs, e.g. lang:en.

carschno avatar Sep 20 '16 15:09 carschno

@carschno well, the idea is that we add the capability of storing language information to the div.

I consider the MetaDataStringField to be used for metadata affecting the whole document, not only sections of it.

reckart avatar Sep 20 '16 15:09 reckart

A question is if components should be aware of divs with a language and how they should handle them. E.g. should a POS tagger use an English model on text in a "en" div and a German model on text in a "de" div? Or should we introduce some CAS multiplier that splits up a document into multiple CASes - one per div-with-language - and then we run separate POS taggers...?

reckart avatar Sep 20 '16 19:09 reckart