eel-sdk icon indicating copy to clipboard operation
eel-sdk copied to clipboard

CsvSource type conversion with custom schema

Open tszolar opened this issue 8 years ago • 3 comments

From the project README - CSV source part I got the idea that type conversion for loaded CSV should be performed according to the specified schema.

But if I define a custom schema for a CsvSource which has columns with other types than String (Int for example), then the values in that column are still returned as String.

Is it intended behaviour, bug or it just haven't been implemented?

Runnable example:

import java.io.ByteArrayInputStream
import java.nio.charset.StandardCharsets
import io.eels.component.csv.CsvSource
import io.eels.schema._

object CsvSourceTypeConversionTest extends App {

  val exampleCsvString =
    """A,B,C,D
      |1,2.2,3,foo
      |4,5.5,6,bar
    """.stripMargin

  val stream = new ByteArrayInputStream(exampleCsvString.getBytes(StandardCharsets.UTF_8))
  val schema = new StructType(Vector(
    Field("A", IntType.Signed),
    Field("B", DoubleType),
    Field("C", IntType.Signed),
    Field("D", StringType)
  ))
  val ds = new CsvSource(stream _, Some(schema)).toDataStream()
  val firstRow = ds.iterator.toIterable.head
  val firstRowA = firstRow.get("A")
  println(firstRowA) // prints 1 as expected
  println(firstRowA.getClass.getTypeName) // prints java.lang.String
  assert(firstRowA == 1) // this assertion will fail because firstRowA is not an Int
}

tszolar avatar Feb 08 '18 16:02 tszolar

@flexik Have you considered using the SchemaInferrer?

val inferrer = SchemaInferrer(SchemaType.String, SchemaRule("qty", SchemaType.Int, false), SchemaRule(".*_id", SchemaType.Int))
CsvSource("myfile").withSchemaInferrer(inferrer)

I take your point though that perhaps if you explicitly pass in the schema it should use the schema under-the-hold - we will be looking into this.

hannesmiller avatar Feb 13 '18 08:02 hannesmiller

@hannesmiller Using SchemaInferrer yields the exactly same result as using schema directly. No type conversion happens. All Row fields are still Strings.

Updated example with SchemaInferrer:

import java.io.ByteArrayInputStream
import java.nio.charset.StandardCharsets

import io.eels.{DataTypeRule, SchemaInferrer}
import io.eels.component.csv.CsvSource
import io.eels.schema._

object CsvSourceTypeConversionTest extends App {

  val exampleCsvString =
    """A,B,C,D
      |1,2.2,3,foo
      |4,5.5,6,bar
    """.stripMargin

  def stream = new ByteArrayInputStream(exampleCsvString.getBytes(StandardCharsets.UTF_8))
  val inferrer = SchemaInferrer(
    StringType,
    DataTypeRule("A", IntType.Signed),
    DataTypeRule("B", DoubleType),
    DataTypeRule("C", IntType.Signed),
    DataTypeRule("D", StringType)
  )
  val ds = new CsvSource(stream _).withSchemaInferrer(inferrer).toDataStream()
  val firstRow = ds.iterator.toIterable.head
  val firstRowA = firstRow.get("A")
  println(firstRowA) // prints 1 as expected
  println(firstRowA.getClass.getTypeName) // prints java.lang.String
  assert(firstRowA == 1) // this assertion will fail because firstRowA is not an Int
}

tszolar avatar Feb 14 '18 14:02 tszolar

@flexik ok this maybe a bug that was introduced between versions - nevertheless I agree this should match the supplied schema - we will make this a priority for the next release which is looking like the early part of March.

Will keep you posted If we manage to get this resolved in alpha release beforehand.

Regards Hannes

hannesmiller avatar Feb 14 '18 16:02 hannesmiller