Problem with the labels.csv file
On this site : https://www.jaided.ai/easyocr/modelhub/ Dataset link : en_sample.zip
if you download the Dataset .csv file you will find some issues in there :
Normally, all the data should be in the 1st column. if you scroll down, you will see that several data are in the 2nd column.
- At cell B71, we have eleven pictures in the same cell : 95-50-93-52-96-91-98-89-94-63-102.jpg. I don't know why they are together.
- Apart from line 71 in point 2 above, all other cells in column B with data appear to have been separated to simulate ";". (169-393-561-677-685-768-910-946-97.jpg. )
- In addition, the ";" character is not found in the choices provided to the function pd.read_csv( sep='^([^,]+),' ) in the dataset.py
- not forgetting that the CSV separator character is the comma, but that this comma is also found inside the text of certain photos
So I think it's impossible to use the dataset directly without modifying it. Personally, I added a ";" caracter and the data in the B cells in the A cells. For the 11 photos on line 71, I've simply transcribed them onto different lines.
Is it just me who doesn't understand how to use easy_ocr or is there really a problem with the .csv? I think the ideal solution would be to create a CSV using a TAB as separator. What do you think?
THX and sorry for asking this question!
Alex
Ok i modified the dataset.py to use the TAB in csv file and it's work well
dataset.py at line 153 :
self.df = pd.read_csv(os.path.join(root,'labels.txt'), sep='\t', engine='python', usecols=['filename', 'words'], keep_default_na=False)
Also, since lines can't be sorted directly according to the numerical value of the photo name, I've simply chosen all photo names starting with "1" for EN_VAL and the rest for _TRAIN. Which represents 12.5% of the total dataset. There is my new .CSV files and the new .TXT files : labels_val.csv labels__train_.csv labels_val.txt labels__train_.txt
I also spent lots of time figuring out these issues. (my issue: my model wasn't able to recognize comma characters) Thanks! i will also use tab as separator