Potential bug: Rows from one table appear in the parsing of another when 1 row is added to a third
This is kind of a weird one. I am not sure if I am doing something wrong or if there is a bug in some of the _start_/_end_ logic (or somewhere else).
Here's the setup: I have text file with many tables and other values that needs to be extracted. For the purpose of this issue I have reduced the text file to only 3 tables and a few other values to be a minimal reproducible example.
Here is the file:
Here is the template file:
<vars>
HASH3 = "\#\#\#"
</vars>
<group name="network">
{{ ignore("HASH3") }} START NETWORK DATA {{ _start_ }}
#Network_nSections {{ num_sections | to_int }}
<group name="section_data" method="table">
#Network_SectionData {{ ignore(".*") }} {{ _start_ }}
<group>
{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}
</group>
#Application_FileName {{ ignore(".*") }} {{ _end_ }}
</group>
<group name="application">
#Application_FileName {{ filename }}
#Application_SpreadFactor {{ spread_factor }}
</group>
{{ ignore("HASH3") }} END NETWORK DATA {{ _end_ }}
</group>
In this template I only care about extracting values from one table, namely "Network_SectionData` (second table), plus a few other values. In the text file, we also have a building table (first table) and a summary table (third table).
If I run
python -m ttp.ttp -t example.ttp -d example-output-01.txt -o json > out.json
Then I see the list of expected extracted rows in network.section_data.
However, if the following line is added to the end of the building table, just above ### END BUILDING DATA
52.422 6.502 22.2 0.0 0.65 2.100E+02 8.086E+03 4.982E+11 3.654E+03
then these values from the third table starts to appear in the parsed output:
...
{
"A": "id",
"B": "mode",
"C": "on/off",
"D": "Light",
"J": "Freq.",
"K": "Ship.",
"L": "[log.dec.]",
"active": "[Hz]"
},
{
"A": "2",
"B": "Found",
"C": ":",
"D": "u_z",
"J": "dynamic",
"K": "(Hz)",
"L": "0.000",
"active": "0.00"
},
...
These are the values from the summary table (third table) that happen to match the
{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}
I find this very peculiar, because
- The change happens in a part of the file that
ttpseemingly shouldn't care about. - I have 2 different
_end_indicators and if just one of them found a correct match, it should never look down in the summary table section in the first place.
Note: I know that I could probably find a way around this by making sure that my match indicators only match number for instance, but for my use case I need to rely solely on _start_ and _end_ indicators.
Windows 10, python 3.7, ttp 0.9.1
@dmulyalin - Sorry for tagging you directly, but do you have any idea what could be the cause of this?
Would recommend to try simplifying your template e.g. this gives same results as yours one but a bit easier to read IMHO:
<vars>
HASH3 = "\#\#\#"
</vars>
<group name="network">
{{ ignore("HASH3") }} START NETWORK DATA {{ _start_ }}
<group name="section_data">
{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D | DIGIT }} {{ J }} {{ active | DIGIT }} {{ K }} {{ L }}
</group>
{{ ignore("HASH3") }} END NETWORK DATA {{ _end_ }}
</group>
<group name="network.application">
#Network_nSections {{ num_sections | to_int }}
#Application_FileName {{ filename }}
#Application_SpreadFactor {{ spread_factor }}
</group>
For undesired matches - was not able to reproduce the problem by doing this:
However, if the following line is added to the end of the building table, just above ### END BUILDING DATA
52.422 6.502 22.2 0.0 0.65 2.100E+02 8.086E+03 4.982E+11 3.654E+03
but, several tecniques to avoid unnecessary matches:
- use end idicator - you already using it
- use more specific regexes, e.g. in you template you ar eusing:
{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D }} {{ J }} {{ active }} {{ K }} {{ L }}while in my template I am using:{{ ignore(" *") }}{{ A }} {{ B }} {{ C }} {{ D | DIGIT }} {{ J }} {{ active | DIGIT }} {{ K }} {{ L }}my template will only match digits forDandactivevariables, that alone should solve the problem with false matches in your case - Pre-process input data by removing parts of it that does not need to be matched, in other words provide TTP with as clean data as possible, where ideally each line will be matched by some variables
- Do inline filtering using conditions functions, e.g. using
{{ A | contains(".") }}will filter any unwanted matches that does not contain dot character in them
Sorry I haven't gotten back to you yet.
I find it a bit unsettling that you are not able to reproduce the problem. It makes me doubt whether there is a setup issue at my end. But I did test it multiple times and tried to boil it down to the very core before creating the issue.
Regarding the change of template: You are probably right, but this template was taken from a bigger templates, perhaps 10 times as large with a lot of complexity. It might not be possible to do these simplifications in real life. And since the data we are parsing is quite messy and outside our control (and kind of unpredictable sometimes), then we need at least some general matching.
I think you might be right that we need to pre-process the input data - I had just hoped that TTP would spare us for that because it is such a strong parser. Regardless, I will work more on the template.
Regarding reproducing: Before I close this, I would like to make one last effort to see if anyone else can reproduce it. Let me think a bit about how.
I have now reproduced the issue in PythonAnywhere.
main.py

I then installed ttp==0.9.1 in the python3.7 environment.
Calling main.py with the input file without the line mentioned above:

Calling main.py with the input file with the problematic line:

In the red circle, I have highlighted a couple of matches that contains data from a different table than what it is supposed to, for instance
{
"A": "id",
"B": "mode",
"C": "on/off",
"D": "Light",
"J": "Freq.",
"K": "Ship.",
"L": "[log.dec.]",
"active": "[Hz]"
}
Here's the live console you can play around with: https://www.pythonanywhere.com/shared_console/0e996b6e-fa94-4244-aed7-4c000b7fdd60