Attribute single quote mark errors as "Ambiguous syntax"
Quite excited to see what I can make of the pyVHDLparser. In using it on a simple enough file, I encounter the following:
================================================================================
pyVHDLParser - Test Application
================================================================================
FATAL: An unknown or unhandled exception reached the topmost exception handler!
Exception type: TokenizerException
Exception message: (line: 15, col: 37): Ambiguous syntax detected. buffer: ''l'
Caused in: GetVHDLTokenizer in file '/mnt/d/Development/OpenSource/pyVHDLParser/pyVHDLParser/Token/Parser.py' at line 368
--------------------------------------------------------------------------------
...Token/Parser.py", line 368, in GetVHDLTokenizer
raise TokenizerException("Ambiguous syntax detected. buffer: '{buffer}'".format(buffer=buffer), start)
--------------------------------------------------------------------------------
Please report this bug at GitHub: https://github.com/VLSI-EDA/pyIPCMI/issues
--------------------------------------------------------------------------------
This seems to be triggered by the access of 'length... I'll be diving into the Parser.py file to see if I can rectify.
Thanks!
pyVHDLParser/Token/Parser.py : 363-365
if ((buffer[0] in __ALPHA_CHARACTERS__) and (buffer[1] in __ALPHA_CHARACTERS__)):
tokenKind = cls.TokenKind.AlphaChars
elif ((buffer[0] in __WHITESPACE_CHARACTERS__) and (buffer[1] in __WHITESPACE_CHARACTERS__)):
...
So buffer[0] is the single quote... doesn't seem like any attribute access will survive this. Am I missing something?
I changed line 362 to
buffer = buffer[1:3]
from
buffer = buffer[:2]
Hi, sorry for the wrong issue link in the error message. It's a pyVHDLParser error, not an pyIPCMI error :).
Parsing attributes is very complex in VHDL and by nature ambiguous. Some parser (to be exact lexers/tokenizers) protect them selves by allowing only attribute names longer then 1 character, otherwise it might get mixed up with character literals 'c'.
Examples:
-
a'b'c=> character literalcbetweenaandc -
aa'bb'cc=> chain of attribute namesccapplied tobbtoaa
Okay, nice! Only character literals have single quotes, (I believe for longer literals double quotes are required) right? If not one could exhaust the attribute list. Otherwise, perhaps the following is helpful in achieving that definition?
re.search("('\w(\w+))+", buffer)
>>> l = "a'b'c"
>>> a = "a'bb'cc"
>>> al = "a'bb'c"
>>> print(re.search("('\w(\w+))+", l))
None
>>> print(re.search("('\w(\w+))+", a))
<re.Match object; span=(1, 7), match="'bb'cc">
>>> print(re.search("('\w(\w+))+", al))
<re.Match object; span=(1, 4), match="'bb">
It would catch a chain of attributes as a single compound-attribute, which may be quite neat. The chain will extend until there is a character literal. It'd have to be used recursively if one wants the full tree of literal-attribute accesses.
It doesn't care if there are literals before an attribute:
>>> la = "a'c'bb"
>>> print(re.search("('\w(\w+))+", la))
<re.Match object; span=(3, 6), match="'bb">
Is ☝️ an issue?
The list of attributes is not limited. Users can define own attributes, thus comparing against such a list doesn't work.
Moreover, a tokenizer doesn't know these details. I just splits the input file into a stream of tokens and tries it's best to figure out what kind of token it is. The Tokenizer in pyVHDLParser already creates more token types then any other parser I know.
A literal is a class of base element in a language:
- keywords (VHDL calls them reserved words)
- identifiers
- extended identifier
- literals (leaf elements in an expression tree)
- integer number
- floating point number
- character
- string
- bitstrings
- operators
- delimiter
- whitespace
- comment
Thus, 125, 45.975, 'c', "hello world", x"0110" are literals.
I extensively edited my last comment, primarily because I thought to research literals. Thanks for exhaustively listing them. I think, though, that you weren't answering the meat of what I was querying...
I think that, from the get-go, none of the conversation has moved towards a solution:
- I believe that isAlpha(buffer[0:2]) will always fail because buffer[0] is the apostrophe.
- I believe that the original intention was to ensure that there are at least two alpha characters after the apostrophe (isAlpha(buffer[1:3])), in order to ensure that one is not dealing with a character literal.
- It still has not been confirmed that the only literal that uses an apostrophe (single-quote) is the character literal, justifying that it is the only non-attribute ambiguity to guard against.
- I proposed a superfluous regex to accomplish finding attributes in general, perhaps undermining whatever pre-processing occured to buffer.
I am trying to share that the tokeniser failed on my file which used attributes, and I am looking commit the fix. I think 2. may be more immediate and I was looking to see if that was your original intention.
I feel that this repo is on hold, so I don't mind if your head is not in the right space to fully consider any fix whatsoever. Just say.
@RocketRoss yes development here is currently very slow due to other activities.
The repo has now >250 test cases and reaches 48% branch coverage. I'm working on improving test coverage and also documentation. While doing so, some bugs where discovered and fixed. Your issue is not yet investigated.
For a regexp solution: The Tokenizer works without regexp to ensure a high performance.
My plan is to work more in Christmas holidays on this project.