tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

How are folks here specifying the areas to extract in tabula-java?

Open vishaln79 opened this issue 7 years ago • 4 comments

For example, I am trying to run a batch job based on a Tabula template I had created: {"page":5,"extraction_method":"guess","x1":306.4589953308106,"x2":548.1989953308106,"y1":177.0975,"y2":230.6475,"width":241.74,"height":53.550000000000004}

I am not sure how to specify the area based on this.

Thanks!

vishaln79 avatar Aug 10 '18 14:08 vishaln79

The area's selection is defined as coordinates (i.e. x1, x2, y1, and y2) while the page's width and height is defined (I believe). On a side note: the one problem I have with this JSON object is that you have redundant data when you repeat to the selection to multiple pages (does the width and height of the pages ever differ in a document?).

rosenjcb avatar Sep 05 '18 23:09 rosenjcb

I'm also interested in understanding how to convert the JSON information in command line options. For example, how should I convert this line: {"page":1,"extraction_method":"guess","x1":1.1161425018310547,"x2":594.9039534759522,"y1":342.6557480621338,"y2":761.5812337493896,"width":593.7878109741212,"height":418.9254856872559} in order to call the CL interface with the -a parameter? java -jar tabula-1.0.2.jar -b docs -p 1 -a ...

edit: it seems that the exported data needs to be transformed as follows: -a 342.656,1.116,761.581,594.904 ... meaning: y1,x1,y2,x2

ghost avatar Oct 22 '18 10:10 ghost

@fabcan Refer to CommandLineApp.java: it states that flag 'a' "Accepts top,left,bottom,right". In other words, it takes coordinates as y1,x1,y2,x2. You can further deduce its functionality by investigating the methods that are called when extraction is done via CLI. Specifically, I think the coordinates the user enters are properly mapped in this conditional:

if (pageAreas != null) {
	for (Pair<Integer, Rectangle> areaPair : pageAreas) {
		Rectangle area = areaPair.getRight();
		if (areaPair.getLeft() == RELATIVE_AREA_CALCULATION_MODE) { 
			area  = new Rectangle((float) (area.getTop() / 100 * page.getHeight()),
			(float) (area.getLeft() / 100 * page.getWidth()), (float) (area.getWidth() / 100 * page.getWidth()),
			(float) (area.getHeight() / 100 * page.getHeight()));                            
			}
	tables.addAll(tableExtractor.extractTables(page.getArea(area)));
    }
} else {
	tables.addAll(tableExtractor.extractTables(page));
}

rosenjcb avatar Oct 22 '18 16:10 rosenjcb

Incase anyone comes accross this trying to define the extraction area in java code as opposed to with the command line, you can do this: BasicExtractionAlgorithm extractionAlgorithm = new BasicExtractionAlgorithm(); List<Table> tables = extractionAlgorithm.extract(page.getArea(399.6f, 71.052f, 817.284f, 556.14f))

getArea arguments being top, left, bottom and right coordinates. This thing really needs better documentation.

KalebTeixeira avatar Feb 05 '22 20:02 KalebTeixeira