python-ucto icon indicating copy to clipboard operation
python-ucto copied to clipboard

Accessing hyphenated tokens at the end of a paragraph

Open pirolen opened this issue 2 years ago • 2 comments

Hyphens are source of to some more problems in certain types of documents: e.g tokens at the end of a paragraph that end with a hyphen are not valid tokens, nor is their continuation in the subsequent paragraph.

I am exploring various ways to identify such token fragments with the FoLiA ecosystem and encountered the following:

  • One can access the last token with some basic scripting from the tokenizer, but the token's hyphenation information is not (that trivially) accessible. Or ist it? Do you think a corresponding method on ucto.Token instances would help me in this use case, ie. a token.isendofparagraph() method (maybe with some keyword arguments)? Or a method analogous to token.nospace(), eg. (token.ishyphenated()? Or do I overlook some simple way to achieve what I want?

"In addition to the low-level process() method, the tokenizer can also read an input file and produce an output file, in the same fashion as ucto itself does when invoked from the command line."

One would need to be able to pass the --textclass argument as well, at least I would like to.

"Text is passed to the tokeniser using the process() method, this method returns the number of tokens rather than the tokens itself."

I tried this out, but the process() method seems to return None in all cases.

pirolen avatar Mar 14 '23 21:03 pirolen

One can access the last token with some basic scripting from the tokenizer, but the token's hyphenation information is not (that trivially) accessible. Or ist it?

I think a hyphen is simply included in the token as a suffix currently, or it ends up caught by another and ends up as a separate token. In either case, you just need to inspect the string contents of the token (str(token)). There is no token.ishyphenated(), that interpretation is done on the conversion to FoLiA level.

ie. a token.isendofparagraph() method

Check the token.isnewparagraph() method on the next token (or the case in which there is no next token at all).

One would need to be able to pass the --textclass argument as well, at least I would like to.

Ah! That's a good one, it currently is not exposed via the Python API. I'll make a separate issue for it: #15

I tried this out, but the process() method seems to return None in all cases.

It stores things in an internal buffer which you can access by iterating over the Tokenizer instance.

proycon avatar Apr 03 '23 11:04 proycon

One can access the last token with some basic scripting from the tokenizer, but the token's hyphenation information is not (that trivially) accessible. Or ist it?

I think a hyphen is simply included in the token as a suffix currently, or it ends up caught by another and ends up as a separate token. In either case, you just need to inspect the string contents of the token (str(token)).

The hyphen gets stripped off from the text, so it is not there anymore in str(token). Such a token is of type WORD, it is not an instance of token.nospace(), but an enumeration counter shows that it is the last token in e.g. a paragraph. (I call python-ucto on paragraph level).

pirolen avatar Apr 03 '23 21:04 pirolen