scautable icon indicating copy to clipboard operation
scautable copied to clipboard

Add enhanced header management utilities

Open krishivseth opened this issue 10 months ago • 11 comments

Enhanced Header Management for Scautable

This PR introduces a comprehensive header management system for Scautable, providing robust utilities for handling CSV headers. The implementation includes:

  1. Header Normalization: Converts headers to valid Scala identifiers, handles spaces/special characters, ensures uniqueness, and prevents conflicts with reserved keywords.

  2. Header Validation: Detects empty/duplicate headers, validates against Scala identifier rules, and provides detailed error messages with severity levels.

  3. Content-Based Column Name Inference: Analyzes column content to suggest meaningful names for numbers, dates, emails, URLs, booleans, and names.

  4. Header Improvement Suggestions: Suggests better names for poor headers, converts to camelCase, and provides explanations for suggestions.

The implementation includes a sealed trait hierarchy for header problems, comprehensive error reporting, and a complete test suite covering all features and edge cases.

krishivseth avatar Apr 08 '25 03:04 krishivseth

Wow - thanks for looking at this.

My request would be to see how we can integrate it into the object which reads the CSV i.e. pass a test like this;

https://github.com/Quafadas/scautable/blob/f1bebfc0080f16bd3bae240852e70311d0a766bf/scautable/test/jvm/src/testCsv.scala#L228C1-L232C4

 test("reading data") {
    val csv: CsvIterator[("col1", "col2", "col3")] = CSV.absolutePath(Generated.resourceDir0 + "simple.csv")

  }

But for a file which has duplicated headers in? If we can get that to compile, I'd love to merge this.

Quafadas avatar Apr 08 '25 05:04 Quafadas

Hi Simon,

I've implemented the duplicate headers support. The solution maintains compile-time typing while handling duplicate column names.

I took three approaches:

  1. An auto-detection method that requires no type annotations (detectHeaders)
  2. An explicit method where you provide normalized header types (absolutePathWithDuplicates)
  3. Helper methods for users who need to determine normalized headers first

The CsvIterator now tracks both original and normalized headers (dups get numbered like "col1_1"), with utilities to check the mapping.

I've included tests with a duplicate-header CSV file to demonstrate all approaches. All tests are passing. Let me know if you'd like any adjustments before merging.

krishivseth avatar Apr 08 '25 23:04 krishivseth

This is the part I'm really interested in .

test("access columns by original name") {
    val csv: CsvIterator[("col1", "col2", "col1_1", "col3")] = 
      CSV.absolutePathWithDuplicates(Generated.resourceDir0 + "duplicate_headers.csv")

because this is the "user-facing API" that you are expecting a human to work with and read documentation about.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

Quafadas avatar Apr 09 '25 06:04 Quafadas

I think we can keep detectHeaders as the main way to read the files (since it avoids the type signature issue), but add more intuitive runtime helper methods for accessing columns, like this:

   // Keep as the recommended way to read
   val csv = CSV.detectHeaders("path/to/duplicates.csv")

   // Add helper methods for easier access:
   val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)
   val thirdColValue = csv.getByPosition(2)             // get column by original position

   // Users can easily find which columns are duplicates:
   val duplicates = csv.getDuplicateColumns() // e.g., Map("col1" -> 2)

The getByOccurrence and getByPosition methods feel more user-friendly. The underlying type-safe access (CSV.column["col1_1"]) is available for those who prefer it or need that level of compile-time checking, but it's not the primary way we'd document accessing duplicates. Please let me know what you think, and I'll make a new commit.

krishivseth avatar Apr 10 '25 04:04 krishivseth

I agree, that we should have as few entry points as possible.

I'm not going to lie; I don't understand what these are or why I would need them.

   // Add helper methods for easier access:
   val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)
   val thirdColValue = csv.getByPosition(2)             // get column by original position

I'm happy for you to make the case but it looks to me as though you're writing a custom version of scala's standard library.

 val secondCol1Value = csv.getByOccurrence("col1", 1) // get 2nd occurrence (0-indexed)

Could this be expressed as

csv.drop(1).take(1).head.col1

I'll come out and say that if that is more or less what you are proposing, I am opposed to custom coding stdlib functionality in the library. A guiding principle here is that if we need standard functionality, we should use scalas standard collection library - a terrific piece of work that is maintained by the scala centre.

I am interested in the header deduplication, but I think we need to avoid creating a zoo of entry points. I haven't had the time to review this is great detail, so I could be wrong, but I'm not sure why this deduplication can't be a method on the CSVIterator class that simply deduplicates the headers and returns another CSV iterator with better columns headers?

I'm also not sure what the "type signature issue" is? You'll have to lay that out in more detail.

Quafadas avatar Apr 10 '25 06:04 Quafadas

Let me jump into this conversation.

I think it is a good idea right here. Particularly, the work on header deduplication is indispensable. The user is virtually blocked to work on the data otherwise.

As my understanding goes, named tuples may be constructed with empty identifiers and/or duplicated ones. Unless the user manipulates the underlying type, they are left with inaccessible columns in these cases.

With this in mind, I would implement this normalization (of duplicated headers) directly on the main entry points for the library — namely absolutePath, resource, etc. As it is a matter of necessity, not only a style choice.

For the other normalizations, I think it is worthwhile mentioning the Scala's literal identifiers feature (see this Stack Overflow answer). The feature by itself makes most normalizations unnecessary.

val `A identifier with spaces` = true
val `1 identifier with a number on the start` = true
val `Backticks are a thing, and they let me use "any string that's accepted by the runtime" as an identifier` = "Awesome, don't you think? It's a shame that this feature is somewhat obscure."

The only exception I can think of is if the column has a backtick on it, as the specification forbids backticks inside the identifier (see this other question). So, if we choose to opt for using backticks, I don't think that other normalizations are crucial.

Finally, for the suggestions feature, I think they ARE useful, and at the same time I fail to imagine a good use case for including them on the library. So I don't have a strong opinion here.

The getByOccurrence and getByPosition methods feel more user-friendly. The underlying type-safe access (CSV.column["col1_1"]) is available for those who prefer it or need that level of compile-time checking, but it's not the primary way we'd document accessing duplicates.

I'm happy for you to make the case but it looks to me as though you're writing a custom version of scala's standard library.

I interpreted this differently, something like

// suppose a csv that contains the headers: ("dup_col", "middle_col", "dup_col", "unique_col")
//                                              vvvvvvvvv we rename duplicated names
val csv: CsvIterator[("dup_col", "middle_col", "dup_col_1", "unique_col")] = CSV.detectHeaders(...)

val duplicated_first_ref = csv.getByOcurrence("duplicated_col", 0) // "dup_col"
val duplicated_second_ref = csv.getByOcurrence("duplicated_col", 1) // "dup_col_1"
val col_by_position = csv.getByPosition(3) // "unique_col"

If this is indeed the case, how would we access values inside map functions, for example? One of the beauties of using named tuples is that we can do things like this:

val csv: CsvIterator[("Name", "Surname", "Age")] =
  CSV.absolutePath(...)
    .mapColumn["Age", Option[Int]](_.toIntOption)

val data: Iterator[(`Full Name`: String, Age: Option[Int])] =
  csv.map(x => (`Full Name` = s"${x.Name} ${x.Surname}", Age = x.Age))

@krishivseth could you enlighten us on this?

Dorodri avatar Apr 11 '25 02:04 Dorodri

@Dorodri My understanding is that this PR was about header management. I am interested in it and it's definitely a goal.

I think the discussion around these extra methods can be moved to a seperate PR unless it's to do with header management? (I don't think it is) val duplicated_first_ref = csv.getByOcurrence("duplicated_col", 0) // "dup_col"

At least one concern, is where this leads. CSV.absolutePathWithDuplicates

Are we also going to have CSV.absolutePathWithDuplicatesAndSemicolonSeperator("/some4thing.csv")

My commentary here stands.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

Quafadas avatar Apr 11 '25 05:04 Quafadas

I think the discussion around these extra methods can be moved to a seperate PR unless it's to do with header management? (I don't think it is)

Sure! Let's see what krishivseth has to say about this first; maybe we misunderstood the idea.

Do you think it makes sense? I'm not convinced - why might that be? Can you think of a better alternative?

I think we share the same concern here.

I don't think it is necessary to add more entry points for this feature, and I would rather have the header deduplication be the default behavior (that is, just be a part of CSV.absolutePath and the other — existing — entry points).

My argument for this is that if the user tries to import a CSV that contains duplicated headers, with the current behavior, they end up with inaccessible fields.*

Having inaccessible fields is surely not a feature, so we can opt to

  1. provide an error message, or
  2. deduplicate the headers.

I think that deduplicating the headers and emitting a warning is a good default behavior. What do you think?

*: For documentation purposes, this behavior of NamedTuples can be tested with this snippet:

import NamedTuple.*

val t = ("1", "2").withNames[("foo", "foo")]
println(t.foo) // output: 1

Dorodri avatar Apr 11 '25 19:04 Dorodri

Some notes:

https://github.com/Quafadas/scautable/discussions/43

Quafadas avatar Apr 12 '25 06:04 Quafadas

Hi @Quafadas and @Dorodri,

Thanks for the feedback, and Simon, really appreciate you sharing PR #44 and your notes on the match type complexity. To me it seems that the core constraint is NamedTuple needing unique names at compile time. We have to normalize duplicates somehow to get a valid type.

Seeing the challenges you hit with the runtime .deduplicateHeaders() in #44, and thinking about Dorodri's point about default handling and the goal of fewer entry points, how about we try this:

  1. Normalize by Default, Inside Existing Methods: Rather than adding a new method like in #44, we modify the existing macros (CSV.absolutePath, CSV.resource, etc.). They'd always run the HeaderUtils normalization logic upfront. This seems necessary anyway, as Dorodri mentioned, to avoid inaccessible columns.

  2. Add Compile-Time Warnings: If normalization actually renames a header (e.g., "col1" becomes "col1_1"), the compiler gives a heads-up: Warning: Duplicate header 'col1' normalized to 'col1_1'. Access using the normalized name. This makes the default behavior visible.

  3. Clean Up Entry Points: We'd remove the extra detectHeaders and absolutePathWithDuplicates methods I currently have in my branch (header-management2), simplifying the API.

  4. Standard Access: Users stick with the normal csv.column["header_name"]. If a name was normalized, the warning tells them which name to use (like csv.column["col1_1"]). This keeps the standard type-safe access.

  5. Keep Discovery Tools: csv.headers and csv.headerNormalizationReport remain so users can always check the final normalized names.

This feels like it integrates the necessary deduplication more directly into the existing flow Simon is familiar with, makes it default as Dorodri suggested, avoids the runtime method complexity from #44, and keeps the API surface cleaner. Does this sound like a more practical way forward based on everything we've discussed? Please let me know your thoughts.

krishivseth avatar Apr 22 '25 20:04 krishivseth

I've been AFK for a couple of days so apologies.

I had another look at it and updated it again today, and I'm leaning toward #44 actually solving the problem I wanted it to, in that it appears to pass the tests and fits reasonable well my mental model of how I'd like the API to look.

However, I'm not sure if #44 is the best solution, and I agree the points you made above. I'd be curious to see if you can indeed keep the API clean and have some nice diagnostics and warnings to go with it. I'm not in a rush t merge anything and am still exploring the solution space, so I'll leave this open a bit longer.

Quafadas avatar Apr 26 '25 18:04 Quafadas