EasyOCR icon indicating copy to clipboard operation
EasyOCR copied to clipboard

Problem with the labels.csv file

Open iAlexMG opened this issue 2 years ago • 2 comments

On this site : https://www.jaided.ai/easyocr/modelhub/ Dataset link : en_sample.zip

if you download the Dataset .csv file you will find some issues in there :

Normally, all the data should be in the 1st column. if you scroll down, you will see that several data are in the 2nd column.

  1. At cell B71, we have eleven pictures in the same cell : 95-50-93-52-96-91-98-89-94-63-102.jpg. I don't know why they are together.
  2. Apart from line 71 in point 2 above, all other cells in column B with data appear to have been separated to simulate ";". (169-393-561-677-685-768-910-946-97.jpg. )
  3. In addition, the ";" character is not found in the choices provided to the function pd.read_csv( sep='^([^,]+),' ) in the dataset.py
  4. not forgetting that the CSV separator character is the comma, but that this comma is also found inside the text of certain photos

So I think it's impossible to use the dataset directly without modifying it. Personally, I added a ";" caracter and the data in the B cells in the A cells. For the 11 photos on line 71, I've simply transcribed them onto different lines.

Is it just me who doesn't understand how to use easy_ocr or is there really a problem with the .csv? I think the ideal solution would be to create a CSV using a TAB as separator. What do you think?

THX and sorry for asking this question!

Alex

iAlexMG avatar Jun 20 '23 01:06 iAlexMG

Ok i modified the dataset.py to use the TAB in csv file and it's work well

dataset.py at line 153 : self.df = pd.read_csv(os.path.join(root,'labels.txt'), sep='\t', engine='python', usecols=['filename', 'words'], keep_default_na=False)

Also, since lines can't be sorted directly according to the numerical value of the photo name, I've simply chosen all photo names starting with "1" for EN_VAL and the rest for _TRAIN. Which represents 12.5% of the total dataset. There is my new .CSV files and the new .TXT files : labels_val.csv labels__train_.csv labels_val.txt labels__train_.txt

iAlexMG avatar Jun 20 '23 01:06 iAlexMG

I also spent lots of time figuring out these issues. (my issue: my model wasn't able to recognize comma characters) Thanks! i will also use tab as separator

BMukhtar avatar Jan 01 '24 08:01 BMukhtar