dkpro-cassis icon indicating copy to clipboard operation
dkpro-cassis copied to clipboard

High level helper functions for extracting typesystem independent annotations

Open aggarwalpiush opened this issue 6 years ago • 12 comments

Hi,

We are trying to extract token's text_strings and pos tags from cas objects. Also, different type systems lead to return different pos tags formats. @zesch Please correct me here if I am wrong. It would be great to have some helper functions (some are shown in the below examples) that could solve these requests.

For example for the given cas object:

  1. To return all the token texts:
 cas.get_token_strings() or cas.select(TOKEN).as_text()
  1. To return all pos tags with ptb pos tag format:
cas.select(TOKEN).get_pos_tags(format='ptb')

We hope to see these helper functions as part of this API.

Thanks!!

aggarwalpiush avatar Nov 05 '19 17:11 aggarwalpiush

Hi @aggarwalpiush :)

The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of

@jcklie and I have beed throwing around a number of ideas, e.g.

  • passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods: cas = CAS(DKPro_Core); cas.get_tokens() - but there would be no IDE auto-completion support
  • using some kind generic typing, e.g. cas = CAS(); cas.$(DKPro_Core).get_tokens() - where $ would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE
  • an extension mechanism like Pandas has it; but again no auto-complete support
  • simply using static functions: import dkpro_core.accessors; get_tokens(cas)- at least some IDE auto-complete support, but not necessarily a nice API
  • subclassing the CAS: cas = DKProCoreCAS()- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah...
  • wrapping the CAS with an accessor which implements the same interface as the CAS: cas = DKPro_Core(CAS()); cas.get_tokens() - has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems

... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.

That said ...

cas.select(TOKEN).as_text()

This is something which I think would be really nice to have.

reckart avatar Nov 05 '19 18:11 reckart

Hi @aggarwalpiush :)

The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of

@jcklie and I have beed throwing around a number of ideas, e.g.

  • passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods: cas = CAS(DKPro_Core); cas.get_tokens() - but there would be no IDE auto-completion support
  • using some kind generic typing, e.g. cas = CAS(); cas.$(DKPro_Core).get_tokens() - where $ would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE
  • an extension mechanism like Pandas has it; but again no auto-complete support
  • simply using static functions: import dkpro_core.accessors; get_tokens(cas)- at least some IDE auto-complete support, but not necessarily a nice API
  • subclassing the CAS: cas = DKProCoreCAS()- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah...
  • wrapping the CAS with an accessor which implements the same interface as the CAS: cas = DKPro_Core(CAS()); cas.get_tokens() - has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems

... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.

That said ...

cas.select(TOKEN).as_text()

This is something which I think would be really nice to have.

reckart avatar Nov 05 '19 18:11 reckart

Wouldn't that also be type system specific?

cas.select(TOKEN).as_text() # token.getCoveredText()
cas.select(LEMMA).as_text() # lemma.getValue()

zesch avatar Nov 05 '19 19:11 zesch

If we imagine TOKEN and LEMMA to be type name string constants - no.

reckart avatar Nov 05 '19 19:11 reckart

How would cassis know what feature to use for as_text()?

jcklie avatar Nov 05 '19 19:11 jcklie

In Python, one would normally just use a list comprehension for that, e.g.

values = [x.value for x in cas.select(LEMMA)]

jcklie avatar Nov 05 '19 19:11 jcklie

For as_text(), we would use get_covered_text(), not a feature value.

reckart avatar Nov 05 '19 20:11 reckart

This would somewhat diminish the usefulness, as many types beyond token would not return useful results. If we use an accessor, couldn't it decide to return different feature values depending on the type?

zesch avatar Nov 05 '19 20:11 zesch

It probably could, but it could be confusing. E.g. if as_text() returns the covered text for tokens but say the entity type for entities, I would find that confusing. How would I get the covered text of an entity? If you wanted to introduce a convenience accessor for "the most commonly used feature value", I would find it sensible for it to have a different name, e.g. as_value() - this could e.g. return the "value" feature for named entities (instead of the "identifier" feature) or the "PosValue" feature for POS tags (instead of the "CoarseValue").

reckart avatar Nov 05 '19 20:11 reckart

  1. There should be a way to access feature values of annotations.
  2. I would find it confusing if cas.select(TOKEN).as_text() and cas.select(POS).as_text() would return the same values (as they would do now, right?)

zesch avatar Nov 05 '19 21:11 zesch

There is a way to access feature values, e.g. as @jcklie illustrated:

values = [x.value for x in cas.select(LEMMA)]

x.value reads the feature value on the feature structure x. You can also write to the feature x.value = "value".

Right now, as_text() does not exist. cas.select(XXX) returns a "Generator", i..e not a list - so evaluation is lazy. That is why we currently cannot easily add methods to it - we can also not easily figure out if the result is none-empty. We have been looking e.g. at https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.peekable or considered to return a list ... no final decision for the time being. I think it would be good if cas.select(xxx) returned something we can define methods on - some kind of lazily evaluated iterable maybe to allow eventually mirroring the UIMAv3 select API - or at least do a Pythonista version of it.

reckart avatar Nov 05 '19 21:11 reckart

I will track the extension mechanism in #83 and the extension methods you want here so that we do not mix up the issues.

jcklie avatar Nov 05 '19 21:11 jcklie