extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Server crashes with StringIndexOutOfBoundsException when processing Macedonian (mk) templates using 'Шаблон:' namespace

Open haniyakonain opened this issue 3 months ago • 7 comments

Describe the bug Server crashes during startup with StringIndexOutOfBoundsException when processing Macedonian (mk) language template redirects. Many Macedonian Wikipedia templates use 'Шаблон:' (Cyrillic for "Template") instead of the expected 'Предлошка:' namespace prefix, causing the substring operation to fail with index -1.

The crash occurs at:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(String.java:1931)
        at org.dbpedia.extraction.server.stats.MappingStatsHolder$$anonfun$1.apply(MappingStatsHolder.scala:54)

Over 100+ mk templates are affected, including:

  • Шаблон:Инфокутија Верски објект
  • Шаблон:2TeamBracket
  • Шаблон:Инфокутија Православна црква
  • Шаблон:Оклопно возило
  • And many more...

Expected behaviour The server should either:

  1. Handle alternative namespace prefixes for mk language (recognizing both 'Предлошка:' and 'Шаблон:')
  2. Log warnings and skip invalid templates without crashing (PR #795 provides a temporary fix for this)
  3. Successfully start and process mk language templates without throwing exceptions

Environment

  • Extraction: (commit hash): 5eb208b932a63a6f0cd5cbede3e446315686e6a7 (enable-wikidata-server branch)
  • OS: Linux 6.14.0-33-generic (Ubuntu)
  • Java SDK Version (java --version): 1.8.0_462 (OpenJDK)
  • Maven version (mvn --version): Apache Maven 3.8.7

To reproduce

  1. Enable Macedonian (mk) language in server.default.properties with @mappings or any configuration
  2. Start the DBpedia extraction server: cd server && ../run server
  3. Server attempts to load mk template statistics
  4. Server crashes with StringIndexOutOfBoundsException during MappingStatsHolder initialization

Additional context & logs The root cause is in MappingStatsHolder.scala:54 where the code expects all templates to start with 'Предлошка:' but many mk templates use 'Шаблон:'.

Related:

  • PR #795 fixes the crash by adding validation but doesn't address the namespace mismatch
  • This may require updating the mk language configuration to recognize 'Шаблон:' as a valid template namespace

Full stack trace:

org.dbpedia.extraction.server.stats.MappingStatsHolder$$anonfun$apply$2 apply
WARNING: mk template 'Шаблон:Инфокутија Верски објект' does not start with 'Предлошка:'
[... 100+ similar warnings ...]

java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
        at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(String.java:1931)
        at org.dbpedia.extraction.server.stats.MappingStatsHolder$$anonfun$1.apply(MappingStatsHolder.scala:54)
        at org.dbpedia.extraction.server.stats.MappingStatsHolder$$anonfun$1.apply(MappingStatsHolder.scala:54)
        at scala.collection.MapLike$FilteredKeys$$anonfun$foreach$1.apply(MapLike.scala:231)
        [...]

haniyakonain avatar Oct 31 '25 21:10 haniyakonain

Thanks for the report — this reproduces locally. The root cause is that MappingStatsHolder assumed templates start with 'Предлошка:'; many mk templates use 'Шаблон:' so substring(...) fails.

I’m preparing a small fix that:

  • accepts both 'Предлошка:' and 'Шаблон:' for mk,
  • logs and skips unexpected templates instead of crashing,
  • adds tests.

If maintainers prefer a config-driven approach, I can read valid template prefixes from the language config instead of hardcoding. Please assign this to me if you want me to proceed with a PR.

arnavsharma990 avatar Nov 27 '25 15:11 arnavsharma990

Greetings @haniyakonain , I am interested to work on the issue. Can you please assign this me. ThankYou

DhanashreePetare avatar Dec 18 '25 14:12 DhanashreePetare

Thanks for the interest! @arnavsharma990 and @DhanashreePetare feel free to coordinate on this. @arnavsharma990 approach of accepting both 'Предлошка:' and 'Шаблон:' for mk with proper logging sounds good. Looking forward to seeing a PR from whoever gets to it first!

haniyakonain avatar Dec 21 '25 07:12 haniyakonain

Hello @haniyakonain. I have followed following approach for solving the issue:

  1. Query ALL valid template prefixes from the Namespaces configuration
  2. Match templates against ANY valid prefix instead of just one
  3. Apply the same logic to redirect processing

This prevents the Macedonian crashes (‘Предлошка:’/‘Шаблон:’) and stays compatible for languages with a single prefix. Requesting a review for the same. Please let me know about any further changes needed. ThankYou.

DhanashreePetare avatar Dec 21 '25 21:12 DhanashreePetare

Hi @DhanashreePetare, Thanks for working on this! I noticed the PR includes many unintended file deletions (1,900+ files). Could you please update it to include only the changes to MappingStatsHolder.scala? Thanks!

haniyakonain avatar Dec 22 '25 17:12 haniyakonain

Thank you for the feedback! I tried working on the same. I got those deleted files pulled into my work and my first PR were with those..in 2nd PR i just committed 1 changed file on a clean branch..but still when I am trying to resolve conflicts..I'm encountering file path conflicts due to Windows limitations with some test files (colons in filenames like Lexeme:L11/wiki.xml.bz2).

My actual fix is very clean - just the changes to MappingStatsHolder.scala. Is there any way out of this..Would you be able to cherry-pick commit cd3c26b03 from the fix/issue-804-clean branch? That contains only the fix without any other changes.

Or if you can suggest me some fix to this..it would be great.

DhanashreePetare avatar Dec 22 '25 21:12 DhanashreePetare

@DhanashreePetare, thanks for working on this.

The Windows path limitation with : filenames is causing the conflicts, and you’ll need to resolve that on your side (e.g. via WSL/Linux or a clean branch with only the intended change).

For now, I’m merging a temporary fix to prevent the mk crash so the server stays operational. Proper mk namespace handling (Предлошка: / Шаблон:) is still needed, and I’ll be happy to review a clean PR for that later.

haniyakonain avatar Dec 23 '25 11:12 haniyakonain

@haniyakonain and @jimkont..I will resolve the deleted files issue by using WSL..and submit a PR with the changes shortly.

DhanashreePetare avatar Jan 21 '26 11:01 DhanashreePetare