doctr icon indicating copy to clipboard operation
doctr copied to clipboard

Adding TableNet model to extract tabular data

Open felixdittrich92 opened this issue 4 years ago • 8 comments

add a tablenet model to extract tabular data as dataframe from images (i have a ready to use model(.pt) trained on marmot dataset and need a bit guidiance where to add - prefered as onnx and for self training i can add also in reference /same for dataset but only in Pytorch (Lightning))

After the restructuring / hocr pdfa export @fg-mindee @charlesmindee

felixdittrich92 avatar Oct 04 '21 16:10 felixdittrich92

Hi @felixdittrich92,

Thanks for bringing this on the table, it is a very interesting and useful feature. It would be interesting to integrate such a model in doctr, however we need to think about the global architecture: Should it be a separate model (no shared features) from our detection + recognition pipeline (which would for sure slow down the end to end prediction), or should it be integrated to the detection predictor to maximize feature sharing ?

To answer this question we can look at the speed of your model, can you benchmark this on your side ?

If it is fast enough, we can start by implementing it separately in a new module, and it will run independently from the main pipeline. We can first implement the model in pytorch as you suggested, and provide a pretrained version (.pt) in the config, and tackle the dataset/training script integration later on!

Have a nice day ! :smile:

charlesmindee avatar Oct 06 '21 07:10 charlesmindee

@charlesmindee yes i will do i think later today :) I wish you the same I have attached the tensorboard logs if you want to take a look version_0.zip

felixdittrich92 avatar Oct 06 '21 09:10 felixdittrich92

@charlesmindee on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz the onnx model takes ~ 3-3.5 sec without (tesseract) OCR (tomorrow i can test the pure .pt model also if you want !?) (I think optimizations are still possible, such as smaller input sizes or model prunning) Sample output:

                                                0     1      2      3     4     5     6
0   Protein-ligand Complex #rotable bonds stoDock  Dock  FlexX    ICM  GOLD   T10   120
1                                     3pib 3 0.80  0.59     Mu    0.4   109  0.56   054
2                                      ing 2 0.62  0.86    108    O71   189  0.70  0.69
3                                      Lin) 3 121   156    173   2.17   190   142  1.50
4                                      ink 4 1.69   187   1.70   2.53   308  1.16    14
5                                      ini 5 2.61  5.26   2.73   3.40   493  2.22  2.22
6                                     Lipp 7 1.80  3.25    195     un   233  2.43   253
7                                    Ipph "1 5.14  3.91   3.27    144    43  4.00  0.53
8                                     Ipht 1 2.09  2.39   4.68    123    42   120  1.20
9                                     Iphg 5 3.52   537    487   0.46   420   107   108
10                                    2epp 3 3.40  2.48    04d   2.53   349  3.26  3.27
11                                    Inse 2 1.40  4.86   6.00    180   102   147  1.40
12                                    Insd n 1.20   451    156   1.04   096   18s   18s
13                                   Innb nl 0.92   451   0.92  1.08,   034  1.67  3.97
14                                    lebx 5 1.33  3.13    132   0.82   187  0.62  0.62
15                                    Bepa 8 2.22  6.48    151     on    87  2.22  2.22
16                                    Gepa 16 830   830   9.83   1.60   496  4.00  4.00
17                                    labe 4 0.16   187    OSS    036   ois  0.56  0.56
18                                    labf 5 0.48  3.25   0.76   0.61   030  0.68  0.70
19                                    Sabp 6 0.48  3.89   4.68    oss   030  0.48   O51
20                                    letr 15 461  6.66   7.26   0.87   $90  1.09  1.09
21                                    lets B 5.06  3.93      2   6.22   230   197   197
22                                     lett n 812   133   6.24  0.98,   130  0.82  0.82
23                                    3tmn 10 4si  7.09    530    136   396  3.65  3.65
24                                     Stln 4 534   139    633    142   160   421   421
25                                    ima 20 8.72   778    451   2.60   ssa   221   224
26                                    apt 30 1.89  8.06   5.95   0.88   882  5.72  4.79
27                                   lapu 29 9.10   758    843   2.02  1070   132   132
28                                   2itb 1s 3.09   143   8.94   1.04    26  2.09  5.19
29                                     teil 6 581  2.78   3.52   2.00    04  1.86  1.86
30                                      lok 5 854  5.65    422   3.03   385  2.84  2.84
31                                    Lenx B 10.9   735    683   2.09   632  6.20  6.20
32        

What do you think ?

felixdittrich92 avatar Oct 06 '21 13:10 felixdittrich92

Hi @felixdittrich92,

Thanks for the benchmark, does the ONNX model which takes 3s to run include the OCR task as well (I understand that it doesn't include tesseract but is there any other module appart from the raw tablenet ?) ? If so, we should benchmark to tab detection part alone, and if it is only the tab detection module it seems quite slow (we are aiming at ~1s inference per page for our end to end pipe on CPU, maybe more if the document is large), and we should see how we can optimize that.

Have a nice day! :smile:

charlesmindee avatar Oct 12 '21 09:10 charlesmindee

@charlesmindee yes currently the pure table segmentation needs ~3sec for this reason i have wrote model prunning, a smaller input size, the try for teacher / student experiment or else can be helpful to optimize. I currently have an internal problem to take care of, so I probably won't get to it in the near future (just like with the reorganized problem # 512). However, if you want, I can send you the data set and the training scripts !?

I wish you the same

felixdittrich92 avatar Oct 12 '21 13:10 felixdittrich92

Hi @felixdittrich92,

It is absolutely not a problem if we don't take care of this in the near future, It could be indeed great for us if you could share the dataset/training scripts but don't get too wrapped up in it!

Best!

charlesmindee avatar Oct 13 '21 10:10 charlesmindee

@charlesmindee you can download it (also my pretrained) at Dataset_Model_Trained tell me if you got it :) One thing: if you train this on a multi gpu system before saving the model you have to set the world rank to zero or save after training from checkpoint :)

felixdittrich92 avatar Oct 13 '21 12:10 felixdittrich92