csv-schema-inference icon indicating copy to clipboard operation
csv-schema-inference copied to clipboard

Files w/ quoted values that have commas throw excetion

Open greghall76 opened this issue 2 years ago • 0 comments

Describe the bug File contains quoted numbder "2,126,000,000".... Throws off index alignment between types extracted in headers and data....

File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema) File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel return [p.get() for p in results] File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in return [p.get() for p in results]

To Reproduce Steps to reproduce the behavior:

  1. See example below... "id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation" 0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X" 1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent" 2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X" 3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation" 4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers" 5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"

  2. See code below... from multiprocessing import freeze_support, Process from csv_schema_inference import csv_schema_inference

def main(): #if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT conditions = {"INTEGER":"FLOAT"} pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions) aprox_schema = csv_infer.run_inference(pathfile) csv_infer.pretty(aprox_schema)

if name == 'main': freeze_support() Process(target=main).start()

Expected behavior Should have made it to some kind of schema inference. e.g. 0 name Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location type STRING nullable False ....

Desktop (please complete the following information):

  • OS: Ubuntu 22.04 and Python 3.10.12

greghall76 avatar Aug 24 '23 14:08 greghall76