csv icon indicating copy to clipboard operation
csv copied to clipboard

#convert_fields's sym/string break isn't clarified in #initialize's docs

Open troycroz opened this issue 7 years ago • 5 comments

If iterating thru multiple converters, if any single converter results in a symbol, subsequent converters are skipped. As seen in the #convert_field docs below:

#
  # Processes +fields+ with <tt>@converters</tt>, or <tt>@header_converters</tt>
  # if +headers+ is passed as +true+, returning the converted field set.  Any
  # converter that changes the field into something other than a String halts
  # the pipeline of conversion for that field.  This is primarily an efficiency
  # shortcut.
  #
  def convert_fields(fields, headers = false)
    # see if we are converting headers or fields
    converters = headers ? @header_converters : @converters

    fields.map.with_index do |field, index|
      converters.each do |converter|
        break if field.nil?
        field = if converter.arity == 1  # straight field converter
          converter[field]
        else                             # FieldInfo converter
          header = @use_headers && !headers ? @headers[index] : nil
          converter[field, FieldInfo.new(index, lineno, header)]
        end
        break unless field.is_a? String  # short-circuit pipeline for speed
      end
      field  # final state of each field, converted or original
    end
  end

However the docs for the #initialize options don't relay that same requirement about converter arrays:

# <b><tt>:converters</tt></b>::         An Array of names from the Converters
  #                                       Hash and/or lambdas that handle custom
  #                                       conversion.  A single converter
  #                                       doesn't have to be in an Array.  All
  #                                       built-in converters try to transcode
  #                                       fields to UTF-8 before converting.
  #                                       The conversion will fail if the data
  #                                       cannot be transcoded, leaving the
  #                                       field unchanged.

This leads a reader to believe that you can pass converters like so:

:header_converters => [:symbol, lambda { |h| h.do_some_stuff }]

But this will result in lambda { |h| h.do_some_stuff } never executing.

I could reverse the order of my converters, as shown below, and now both will execute, but the intended result is dependent on the correct order.

:header_converters => [lambda { |h| h.do_some_stuff }, :symbol]

My workaround: Because I didn't want to duplicate your :symbol lambda (and risk falling out of sync with subsequent changes), I had to write a custom lambda, that as a part of its definition invokes the default symbol lambda:

  def self.symbol_and_squeezed_lambda
    lambda do |header|
      native_symbol_lambda = CSV::HeaderConverters[:symbol]
      first_conversion = native_symbol_lambda.call(header)
      first_conversion.to_s.squeeze('_').to_sym
    end
  end
  ...

:header_converters => symbol_and_squeezed_lambda

All of this could only be found by putting debuggers inside csv.rb and tracing my way down to and through the convert_field method before I realized why me second/custom converter wasn't being executed.

Possible changes:

  1. Document this symbol/order requirement of converted in the docs/comments for #initialize
  2. Remove the break in the #convert_fields

Even if change 1 is implemented, it would only help clarity, and would still require the use of workaround as I showed above.

troycroz avatar Mar 28 '18 17:03 troycroz

Can you show a runnable Ruby script that reproduces this case?

kou avatar Mar 31 '18 13:03 kou

@kou

raw_csv_data = "col_header_1, col_header_2\nval_1, val_2\n"

# This should raise and break execution flow if it is called
custom_converter = lambda { |header_val| raise 'break'}

broken_converters = [:symbol, custom_converter]
working_converters = [custom_converter, :symbol]

# custom_converter not called (as seen via lack of raise)
# symbol converter happens first, and short-circuits subsequent converters
CSV.parse(raw_csv_data, :headers => true, :header_converters => broken_converters)
# custom_converter called (as seen via raise)
CSV.parse(raw_csv_data, :headers => true, :header_converters => working_converters)

troycroz avatar Apr 02 '18 15:04 troycroz

Thanks.

In your use case (symbol -> squeeze("_") -> symbol), you should use :headers_converters =>[squeeze_converter, :symbol]. Because you can't call squeeze for Symbol. You can't put the squeeze converter after symbol converter.

:headers_converters => [:symbol, squeeze_converter] is a your application bug.

The current behavior is reasonable for this use case. Do you have any other use case?

kou avatar Apr 07 '18 16:04 kou

@kou regardless of what my converter does, if its after the :symbol converted, it would never execute.

The docs for the CSV initializer method don't state this, and it seems unreasonable that a dev should have to dig way deep into the source code to discover that the header_converter loop early-exits after anything returns a symbol.

It seems like that early-exit is a bug, or, at the very least, needs to be better documented in the initializer docs.

troycroz avatar Apr 12 '18 18:04 troycroz

Can you try sending a pull request to improve the current document?

kou avatar Apr 19 '18 02:04 kou