wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

English forms inside head_templates not properly parsed

Open seth-js opened this issue 2 years ago • 3 comments

The forms array for the term big, fat, hairy deal should be:

[
  { form: 'big, fat, hairy deals', tags: [ 'plural' ] }
]

Instead it looks like this:

[
  { form: 'big', tags: [ 'plural' ] },
  { form: 'fat', tags: [ 'plural' ] } 
]

The term speak to has the forms:

[
  {
    form: 'speaks to',
    tags: [ 'present', 'singular', 'third-person' ]
  },
  { form: 'speaking to', tags: [ 'participle', 'present' ] }, 
  { form: 'spoke to', tags: [ 'past' ] },
  { form: 'to', tags: [ 'colloquial', 'participle', 'past' ] }
]

It should be:

[
  {
    form: 'speaks to',
    tags: [ 'present', 'singular', 'third-person' ]
  },
  { form: 'speaking to', tags: [ 'participle', 'present' ] }, 
  { form: 'spoke to', tags: [ 'past' ] },
  { form: 'spoken to', tags: [ 'participle', 'past' ] }
]

I also don't know where it's getting the colloquial tag since I don't see it on Wiktionary.

seth-js avatar Oct 20 '23 00:10 seth-js

At first guess, the first one has problems due to the commas, which I guess is a bit obvious but bears saying anyhow... For the second, the "spoken" gets parsed as a tag "spoken" -> ["colloquial"]. Both are probably going to be really annoying edge cases involving delving into some spectacularly tricky bits of code, so unless someone else wants to take a look at it I'm leaving this on the backburner for a bit.

kristian-clausal avatar Oct 20 '23 05:10 kristian-clausal

No problem.

seth-js avatar Oct 20 '23 06:10 seth-js

It seems the "forms" data are added from the parse_word_head() function. That complex function processes the expanded plain text, perhaps it could be easier to work with if the function is processing HTML nodes.

xxyzz avatar Oct 20 '23 09:10 xxyzz

This has been long coming, but now with the addition of a kludge so that split_at_semi_comma() can skip given words and phrases, I've made a commit that should fix this.

kristian-clausal avatar Apr 11 '24 10:04 kristian-clausal