extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Patch WikiInfo.scala

Open ghost-2362003 opened this issue 3 months ago • 10 comments

Updated the fromLines() and fromLine() methods of WikiInfo.scala for proper parsing of wikipedias.csv.

Summary by CodeRabbit

  • Chores

    • Added CSV processing library to project dependencies
  • Bug Fixes

    • Improved CSV data parsing with enhanced field validation and error handling for invalid entries

ghost-2362003 avatar Nov 02 '25 07:11 ghost-2362003

📝 Walkthrough

Walkthrough

A Maven dependency on the scala-csv library is added to the project. WikiInfo.scala refactors CSV parsing from naive string splitting to proper CSV reader-based parsing with explicit field validation and language code verification, replacing exceptions with warning logs and None returns for invalid input.

Changes

Cohort / File(s) Summary
Maven Dependency
core/pom.xml
Added scala-csv library dependency (com.github.tototoshi:scala-csv_2.11:1.3.10) to support CSV parsing functionality
CSV Parsing Refactor
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala
Migrated fromLines and fromLine methods from naive string splitting to CSVReader-based parsing; added field count validation (≥15 fields); added language code validation; replaced exception throwing with warning logs and None returns for invalid input; ensured reader resource cleanup in finally block

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Line(s)
    participant Parser as CSVReader
    participant Validator as Field Validator
    participant LangCheck as Language Validator
    participant Output as Result

    Input->>Parser: Parse CSV
    Parser->>Validator: Extract fields
    Validator->>Validator: Check field count ≥ 15
    alt Field count valid
        Validator->>LangCheck: Extract & validate language code
        alt Language code valid
            LangCheck->>Output: Return WikiInfo(Some)
        else Language code invalid
            LangCheck->>Output: Log warning, Return None
        end
    else Field count invalid
        Validator->>Output: Log warning, Return None
    end
    Output-->>Input: Result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • CSV parsing correctness: Verify that CSVReader properly handles edge cases, field extraction, and line joining logic in fromLines
  • Validation logic: Ensure field count (≥15) and language code validation are consistent and correct
  • Error handling strategy: Confirm that warning logs and None returns are appropriate fallback behavior instead of exceptions
  • Resource management: Verify CSVReader is properly closed via finally block in fromLine
  • Dependency compatibility: Confirm scala-csv 1.3.10 is compatible with the Scala 2.11 version used in the project

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The pull request title "Patch WikiInfo.scala" is vague and uses non-descriptive language that fails to convey the actual nature of the changes. While the title correctly identifies WikiInfo.scala as the modified file, it employs the generic term "Patch" without explaining what the patch accomplishes. According to the PR objectives, the main goal is to update the CSV parsing methods to enable proper parsing of wikipedias.csv, which is a meaningful change that should be reflected in the title. The current title does not communicate this purpose to reviewers scanning the repository history. Consider revising the title to be more specific and descriptive, such as "Add CSV parsing support to WikiInfo.scala" or "Implement proper CSV parsing for wikipedias.csv in WikiInfo.scala". This would clearly communicate the main objective of the pull request and help reviewers understand the changeset at a glance without needing to read the detailed description.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Nov 02 '25 07:11 coderabbitai[bot]

Hi @ghost-2362003! Could you confirm how you tested these parsing changes?

haniyakonain avatar Nov 11 '25 09:11 haniyakonain

I tested these by running the entire framework using redeploy-server script. Earlier it used to gave errors due to parsing of wikipedias.csv Now it does not do so. The framework compiles fine

ghost-2362003 avatar Nov 11 '25 09:11 ghost-2362003

Could you share the exact error message/stack trace that was occurring before your fix when you ran the redeploy-server script?

The current test failures I'm seeing are:

  • NonIsoLanguagesMappingTest - failing due to Wikipedia API user-agent issue
  • BooleanParserTest - failing due to Language$ class initialization

These appear unrelated to your WikiInfo CSV parsing changes. I want to see what the original CSV parsing error looked like to confirm your fix addresses the right issue.

haniyakonain avatar Nov 11 '25 10:11 haniyakonain

Unfortunately I did not take any screenshots I only pasted the error in the prompt to understand and fix the error

ghost-2362003 avatar Nov 11 '25 10:11 ghost-2362003

Maybe you can try rebasing with the latest master and run the redeploy-server script again you might get the same error message. If it appears, please paste it here so we can confirm it matches the original CSV parsing issue.

haniyakonain avatar Nov 11 '25 13:11 haniyakonain

Well I was not able to reproduce the exact error, however i did get this by resetting a dummy branch to a relatively old commit, about 2 months old. Would this be of any help ?
Screenshot 2025-11-11 214452

ghost-2362003 avatar Nov 11 '25 16:11 ghost-2362003

Thanks! The screenshot shows an XML dump parsing error (corrupted/incomplete dump), not the CSV issue so it doesn’t confirm the original problem.

haniyakonain avatar Nov 18 '25 08:11 haniyakonain

@haniyakonain Do you know how to get past the error of snapshot deploy ?

ghost-2362003 avatar Nov 19 '25 04:11 ghost-2362003