bug/get_delimiter() fails when file read stops in the middle of a line
Describe the bug
When ingesting CSV files, sometimes it fails with "Error("Could not determine delimiter")". This only happens for some CSV files, for others, it works as expected. The bug is arising from the get_delimiter() function.
To Reproduce Provide a code snippet that reproduces the issue.
PMID35839768_Correlation_matrix.csv
Code snippet, using the above attached file:
from unstructured.partition.csv import get_delimiter
get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")
Output:
---------------------------------------------------------------------------
Error Traceback (most recent call last)
Cell In[9], line 1
----> 1 get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")
File [/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py:124](http://localhost:8886/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py#line=123), in get_delimiter(file_path, file)
121 with open(file_path) as f:
122 data = f.read(num_bytes)
--> 124 return sniffer.sniff(data, delimiters=[",", ";"]).delimiter
File [/usr/local/lib/python3.11/csv.py:187](http://localhost:8886/usr/local/lib/python3.11/csv.py#line=186), in Sniffer.sniff(self, sample, delimiters)
183 delimiter, skipinitialspace = self._guess_delimiter(sample,
184 delimiters)
186 if not delimiter:
--> 187 raise Error("Could not determine delimiter")
189 class dialect(Dialect):
190 _name = "sniffed"
Error: Could not determine delimiter
Expected behavior The function returns the delimiter, which is ',' for this file.
Screenshots Not applicable.
Environment Info Python 3.11.8 unstructured 0.12.5
Additional context After looking into this issue for a bit, I found this similar issue for another Python module: https://github.com/Textualize/rich-cli/issues/54#issuecomment-1135595885
Scrolling down further on that same issue thread, I found another comment (https://github.com/Textualize/rich-cli/issues/54#issuecomment-1252186953) that mentions that the example on the official Python csv.Sniffer docs also has the same issue, which may be the source of this bug, since the implementation in unstructured is nearly identical.
Here is a code snippet I used to fix the issue, by reading in whole lines instead of truncating the read mid-line. This same concept should be applied to both instances of the .read() function that appear in get_delimiter() function, they should both be changed to read_lines().
import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv") as f:
line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter
Output:
','
@awalker4 it looks like this is the commit where the bug was introduced: https://github.com/Unstructured-IO/unstructured/commit/d594c06a3e3a6c3b19fefc3fbcc316f3f872c530
Tagging you since you would probably be the best one to make this minor fix! :)
Sorry for the delay! This does look like a good fix. I can try to get it to soon, but if you have a chance to make a pr that would be a huge help 🙏
In addition, sometimes this portion of the code errors out when the open() function encounters a UnicodeDecodeError. I'd recommend passing in errors = 'ignore' to the open() function to allow the delimiter to still be determined instead of erroring out simply because of a stray character that can't be decoded.
import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv", errors='ignore') as f:
line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter