#convert_fields's sym/string break isn't clarified in #initialize's docs
If iterating thru multiple converters, if any single converter results in a symbol, subsequent converters are skipped. As seen in the #convert_field docs below:
#
# Processes +fields+ with <tt>@converters</tt>, or <tt>@header_converters</tt>
# if +headers+ is passed as +true+, returning the converted field set. Any
# converter that changes the field into something other than a String halts
# the pipeline of conversion for that field. This is primarily an efficiency
# shortcut.
#
def convert_fields(fields, headers = false)
# see if we are converting headers or fields
converters = headers ? @header_converters : @converters
fields.map.with_index do |field, index|
converters.each do |converter|
break if field.nil?
field = if converter.arity == 1 # straight field converter
converter[field]
else # FieldInfo converter
header = @use_headers && !headers ? @headers[index] : nil
converter[field, FieldInfo.new(index, lineno, header)]
end
break unless field.is_a? String # short-circuit pipeline for speed
end
field # final state of each field, converted or original
end
end
However the docs for the #initialize options don't relay that same requirement about converter arrays:
# <b><tt>:converters</tt></b>:: An Array of names from the Converters
# Hash and/or lambdas that handle custom
# conversion. A single converter
# doesn't have to be in an Array. All
# built-in converters try to transcode
# fields to UTF-8 before converting.
# The conversion will fail if the data
# cannot be transcoded, leaving the
# field unchanged.
This leads a reader to believe that you can pass converters like so:
:header_converters => [:symbol, lambda { |h| h.do_some_stuff }]
But this will result in lambda { |h| h.do_some_stuff } never executing.
I could reverse the order of my converters, as shown below, and now both will execute, but the intended result is dependent on the correct order.
:header_converters => [lambda { |h| h.do_some_stuff }, :symbol]
My workaround: Because I didn't want to duplicate your :symbol lambda (and risk falling out of sync with subsequent changes), I had to write a custom lambda, that as a part of its definition invokes the default symbol lambda:
def self.symbol_and_squeezed_lambda
lambda do |header|
native_symbol_lambda = CSV::HeaderConverters[:symbol]
first_conversion = native_symbol_lambda.call(header)
first_conversion.to_s.squeeze('_').to_sym
end
end
...
:header_converters => symbol_and_squeezed_lambda
All of this could only be found by putting debuggers inside csv.rb and tracing my way down to and through the convert_field method before I realized why me second/custom converter wasn't being executed.
Possible changes:
- Document this symbol/order requirement of converted in the docs/comments for
#initialize - Remove the
breakin the#convert_fields
Even if change 1 is implemented, it would only help clarity, and would still require the use of workaround as I showed above.
Can you show a runnable Ruby script that reproduces this case?
@kou
raw_csv_data = "col_header_1, col_header_2\nval_1, val_2\n"
# This should raise and break execution flow if it is called
custom_converter = lambda { |header_val| raise 'break'}
broken_converters = [:symbol, custom_converter]
working_converters = [custom_converter, :symbol]
# custom_converter not called (as seen via lack of raise)
# symbol converter happens first, and short-circuits subsequent converters
CSV.parse(raw_csv_data, :headers => true, :header_converters => broken_converters)
# custom_converter called (as seen via raise)
CSV.parse(raw_csv_data, :headers => true, :header_converters => working_converters)
Thanks.
In your use case (symbol -> squeeze("_") -> symbol), you should use :headers_converters =>[squeeze_converter, :symbol]. Because you can't call squeeze for Symbol. You can't put the squeeze converter after symbol converter.
:headers_converters => [:symbol, squeeze_converter] is a your application bug.
The current behavior is reasonable for this use case. Do you have any other use case?
@kou regardless of what my converter does, if its after the :symbol converted, it would never execute.
The docs for the CSV initializer method don't state this, and it seems unreasonable that a dev should have to dig way deep into the source code to discover that the header_converter loop early-exits after anything returns a symbol.
It seems like that early-exit is a bug, or, at the very least, needs to be better documented in the initializer docs.
Can you try sending a pull request to improve the current document?