tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Is there any way to extract only a particular columns without specifying the area but with the column name?

Open rakshitcgupta opened this issue 7 years ago • 11 comments

rakshitcgupta avatar May 29 '18 12:05 rakshitcgupta

What's your use-case?

criztovyl avatar May 29 '18 12:05 criztovyl

I want only debit and credit columns from the bank statements

rakshitcgupta avatar May 30 '18 09:05 rakshitcgupta

Tabula is for extracting only, I think this is not possible at the moment.

But it should be easy to post-proccess the Tabula result so you get only the columns you want.

criztovyl avatar May 30 '18 18:05 criztovyl

I did the column extraction in the post-process in the CSVWriter class, but thats only when tabula has extracted all the table. I actually want to specify the columns in the pre-processing phase which can reduce the complexity. So, I wanted to know where exactly is the header extraction/detection is done in the code.

rakshitcgupta avatar May 31 '18 07:05 rakshitcgupta

I'm not sure where that code is atm, let's see if I can find it.

criztovyl avatar May 31 '18 07:05 criztovyl

A quick look didn't bring up what you want. If I understand correctly, Tabula does not even have a concept of headings - they're just the first line of the table.

On a side note, why do you think that specifying the column names you want beforehand reduces complexity? I would say it even increases complexity, if my statement about Tabula having no concept of headers is correct.

Or are you scared about memory consumption? Here I would say that reading a CSV can be done line-by-line and most CSV parsers most likely won't consume that much memory.

Sorry for the discussion; I'm just trying to help with what I know. :)

criztovyl avatar May 31 '18 08:05 criztovyl

First of all, I have gone through the Nurminen algorithm for detecting table which is used in the code https://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3 There is a mention of the header detection in the thesis but in the code there is no header detection algorithms implemented.

Yeah I understand that this repository does not even have a concept of headings - they're just the first line of the table. I wanted help with finding the first line of the table.

rakshitcgupta avatar May 31 '18 10:05 rakshitcgupta

Okay, then I can't help you, sorry.

But I saw #230 earlier, so I thought this issue here was about getting the data from a PDF and the other one about header detection in Tabula.

criztovyl avatar May 31 '18 10:05 criztovyl

Yeah, I just want help with finding the first line of the table in the code so that I can manipulate the code to extract particular columns there itself.

rakshitcgupta avatar Jun 01 '18 05:06 rakshitcgupta

Did you ever get around to this @rakshitcgupta?

thubamamba avatar Nov 08 '23 11:11 thubamamba

Did you ever get around to this @rakshitcgupta?

I dont even remember. Its been 5 years. 😞

rakshitcgupta avatar Nov 11 '23 09:11 rakshitcgupta