snowpark-python
snowpark-python copied to clipboard
SNOW-826851: Improve DataFrameReader and DataFrameWriter API
What is the current behavior?
Currently the API for DataFrameReader and DataFrameWriter has several differences with the Spark DataFrame APIs. Also it is a little inconsistent as DataFrameReader has option()/options() and writer does not.
What is the desired behavior?
Provide an API that is more familiar with typical Spark reader/writer code,
How would this improve snowflake-snowpark-python?
that will make it easier for people transitioning or looking to move to snowpark. Also it might enable a more natural API for supporting additional formats
References, Other Background
@sfc-gh-mrojas could you please provide concrete examples?
Sure common DataFrame Writer patterns:
# Write CSV file with column header (column names)
df.write.option("header",True) \
.csv("/tmp/spark_output/zipcodes")
# Other CSV options
df2.write.options(header='True', delimiter=',') \
.csv("/tmp/spark_output/zipcodes")
# Saving modes
df2.write.mode('overwrite').csv("/tmp/spark_output/zipcodes")
# You can also use this
df2.write.format("csv").mode('overwrite').save("/tmp/spark_output/zipcodes")
# Read and create a temporary view
# Infer schema (note that for larger files you
# may want to specify the schema)
df = (spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(csv_file))
df.createOrReplaceTempView("us_delay_flights_tbl")
# In Python
# Path to our US flight delays CSV file
csv_file = "/databricks-datasets/learning-spark-v2/flights/departuredelays.csv"
# Schema as defined in the preceding example
schema="date STRING, delay INT, distance INT, origin STRING, destination STRING"
flights_df = spark.read.csv(csv_file, schema=schema)