extraction-framework Patch WikiInfo.scala

Updated the fromLines() and fromLine() methods of WikiInfo.scala for proper parsing of wikipedias.csv.

Summary by CodeRabbit

Chores
- Added CSV processing library to project dependencies
Bug Fixes
- Improved CSV data parsing with enhanced field validation and error handling for invalid entries

Nov 02 '25 07:11 ghost-2362003

📝 Walkthrough

Walkthrough

A Maven dependency on the scala-csv library is added to the project. WikiInfo.scala refactors CSV parsing from naive string splitting to proper CSV reader-based parsing with explicit field validation and language code verification, replacing exceptions with warning logs and None returns for invalid input.

Changes

Cohort / File(s)	Summary
Maven Dependency `core/pom.xml`	Added scala-csv library dependency (`com.github.tototoshi:scala-csv_2.11:1.3.10`) to support CSV parsing functionality
CSV Parsing Refactor `core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala`	Migrated `fromLines` and `fromLine` methods from naive string splitting to CSVReader-based parsing; added field count validation (≥15 fields); added language code validation; replaced exception throwing with warning logs and None returns for invalid input; ensured reader resource cleanup in finally block

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Line(s)
    participant Parser as CSVReader
    participant Validator as Field Validator
    participant LangCheck as Language Validator
    participant Output as Result

    Input->>Parser: Parse CSV
    Parser->>Validator: Extract fields
    Validator->>Validator: Check field count ≥ 15
    alt Field count valid
        Validator->>LangCheck: Extract & validate language code
        alt Language code valid
            LangCheck->>Output: Return WikiInfo(Some)
        else Language code invalid
            LangCheck->>Output: Log warning, Return None
        end
    else Field count invalid
        Validator->>Output: Log warning, Return None
    end
    Output-->>Input: Result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

CSV parsing correctness: Verify that CSVReader properly handles edge cases, field extraction, and line joining logic in fromLines
Validation logic: Ensure field count (≥15) and language code validation are consistent and correct
Error handling strategy: Confirm that warning logs and None returns are appropriate fallback behavior instead of exceptions
Resource management: Verify CSVReader is properly closed via finally block in fromLine
Dependency compatibility: Confirm scala-csv 1.3.10 is compatible with the Scala 2.11 version used in the project

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The pull request title "Patch WikiInfo.scala" is vague and uses non-descriptive language that fails to convey the actual nature of the changes. While the title correctly identifies WikiInfo.scala as the modified file, it employs the generic term "Patch" without explaining what the patch accomplishes. According to the PR objectives, the main goal is to update the CSV parsing methods to enable proper parsing of wikipedias.csv, which is a meaningful change that should be reflected in the title. The current title does not communicate this purpose to reviewers scanning the repository history.	Consider revising the title to be more specific and descriptive, such as "Add CSV parsing support to WikiInfo.scala" or "Implement proper CSV parsing for wikipedias.csv in WikiInfo.scala". This would clearly communicate the main objective of the pull request and help reviewers understand the changeset at a glance without needing to read the detailed description.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Nov 02 '25 07:11 coderabbitai[bot]

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Nov 02 '25 07:11 sonarqubecloud[bot]

Hi @ghost-2362003! Could you confirm how you tested these parsing changes?

Nov 11 '25 09:11 haniyakonain

I tested these by running the entire framework using redeploy-server script. Earlier it used to gave errors due to parsing of wikipedias.csv Now it does not do so. The framework compiles fine

Nov 11 '25 09:11 ghost-2362003

Could you share the exact error message/stack trace that was occurring before your fix when you ran the redeploy-server script?

The current test failures I'm seeing are:

NonIsoLanguagesMappingTest - failing due to Wikipedia API user-agent issue
BooleanParserTest - failing due to Language$ class initialization

These appear unrelated to your WikiInfo CSV parsing changes. I want to see what the original CSV parsing error looked like to confirm your fix addresses the right issue.

Nov 11 '25 10:11 haniyakonain

Unfortunately I did not take any screenshots I only pasted the error in the prompt to understand and fix the error

Nov 11 '25 10:11 ghost-2362003

Maybe you can try rebasing with the latest master and run the redeploy-server script again you might get the same error message. If it appears, please paste it here so we can confirm it matches the original CSV parsing issue.

Nov 11 '25 13:11 haniyakonain

Well I was not able to reproduce the exact error, however i did get this by resetting a dummy branch to a relatively old commit, about 2 months old. Would this be of any help ?
Screenshot 2025-11-11 214452

Nov 11 '25 16:11 ghost-2362003

Thanks! The screenshot shows an XML dump parsing error (corrupted/incomplete dump), not the CSV issue so it doesn’t confirm the original problem.

Nov 18 '25 08:11 haniyakonain

@haniyakonain Do you know how to get past the error of snapshot deploy ?

Nov 19 '25 04:11 ghost-2362003