pyVHDLParser Attribute single quote mark errors as "Ambiguous syntax"

Quite excited to see what I can make of the pyVHDLparser. In using it on a simple enough file, I encounter the following:

================================================================================
                        pyVHDLParser - Test Application
================================================================================
FATAL: An unknown or unhandled exception reached the topmost exception handler!
  Exception type:      TokenizerException
  Exception message:   (line:  15, col: 37): Ambiguous syntax detected. buffer: ''l'
  Caused in:           GetVHDLTokenizer in file '/mnt/d/Development/OpenSource/pyVHDLParser/pyVHDLParser/Token/Parser.py' at line 368
--------------------------------------------------------------------------------
  ...Token/Parser.py", line 368, in GetVHDLTokenizer
    raise TokenizerException("Ambiguous syntax detected. buffer: '{buffer}'".format(buffer=buffer), start)
--------------------------------------------------------------------------------
Please report this bug at GitHub: https://github.com/VLSI-EDA/pyIPCMI/issues
--------------------------------------------------------------------------------

This seems to be triggered by the access of 'length... I'll be diving into the Parser.py file to see if I can rectify.

Thanks!

Jun 10 '20 19:06 radonnachie

pyVHDLParser/Token/Parser.py : 363-365

if ((buffer[0] in __ALPHA_CHARACTERS__) and (buffer[1] in __ALPHA_CHARACTERS__)):
  tokenKind =     cls.TokenKind.AlphaChars
elif ((buffer[0] in __WHITESPACE_CHARACTERS__) and (buffer[1] in __WHITESPACE_CHARACTERS__)):
...

So buffer[0] is the single quote... doesn't seem like any attribute access will survive this. Am I missing something?

Jun 10 '20 19:06 radonnachie

I changed line 362 to

buffer =          buffer[1:3]

from

buffer =          buffer[:2]

Jun 10 '20 19:06 radonnachie

Hi, sorry for the wrong issue link in the error message. It's a pyVHDLParser error, not an pyIPCMI error :).

Parsing attributes is very complex in VHDL and by nature ambiguous. Some parser (to be exact lexers/tokenizers) protect them selves by allowing only attribute names longer then 1 character, otherwise it might get mixed up with character literals 'c'.

Examples:

a'b'c => character literal c between a and c
aa'bb'cc => chain of attribute names cc applied to bb to aa

Jun 11 '20 23:06 Paebbels

Okay, nice! Only character literals have single quotes, (I believe for longer literals double quotes are required) right? If not one could exhaust the attribute list. Otherwise, perhaps the following is helpful in achieving that definition?

re.search("('\w(\w+))+", buffer)

>>> l = "a'b'c"
>>> a = "a'bb'cc"
>>> al = "a'bb'c"
>>> print(re.search("('\w(\w+))+", l))
None
>>> print(re.search("('\w(\w+))+", a))
<re.Match object; span=(1, 7), match="'bb'cc">
>>> print(re.search("('\w(\w+))+", al))
<re.Match object; span=(1, 4), match="'bb">

It would catch a chain of attributes as a single compound-attribute, which may be quite neat. The chain will extend until there is a character literal. It'd have to be used recursively if one wants the full tree of literal-attribute accesses.

It doesn't care if there are literals before an attribute:

>>> la = "a'c'bb"
>>> print(re.search("('\w(\w+))+", la))
<re.Match object; span=(3, 6), match="'bb">

Is ☝️ an issue?

Jun 13 '20 07:06 radonnachie

The list of attributes is not limited. Users can define own attributes, thus comparing against such a list doesn't work.

Moreover, a tokenizer doesn't know these details. I just splits the input file into a stream of tokens and tries it's best to figure out what kind of token it is. The Tokenizer in pyVHDLParser already creates more token types then any other parser I know.

A literal is a class of base element in a language:

keywords (VHDL calls them reserved words)
identifiers
extended identifier
literals (leaf elements in an expression tree)
- integer number
- floating point number
- character
- string
- bitstrings
operators
delimiter
whitespace
comment

Thus, 125, 45.975, 'c', "hello world", x"0110" are literals.

Jun 13 '20 14:06 Paebbels

I extensively edited my last comment, primarily because I thought to research literals. Thanks for exhaustively listing them. I think, though, that you weren't answering the meat of what I was querying...

I think that, from the get-go, none of the conversation has moved towards a solution:

I believe that isAlpha(buffer[0:2]) will always fail because buffer[0] is the apostrophe.
I believe that the original intention was to ensure that there are at least two alpha characters after the apostrophe (isAlpha(buffer[1:3])), in order to ensure that one is not dealing with a character literal.

It still has not been confirmed that the only literal that uses an apostrophe (single-quote) is the character literal, justifying that it is the only non-attribute ambiguity to guard against.

I proposed a superfluous regex to accomplish finding attributes in general, perhaps undermining whatever pre-processing occured to buffer.

I am trying to share that the tokeniser failed on my file which used attributes, and I am looking commit the fix. I think 2. may be more immediate and I was looking to see if that was your original intention.

I feel that this repo is on hold, so I don't mind if your head is not in the right space to fully consider any fix whatsoever. Just say.

Jun 14 '20 09:06 radonnachie

@RocketRoss yes development here is currently very slow due to other activities.

The repo has now >250 test cases and reaches 48% branch coverage. I'm working on improving test coverage and also documentation. While doing so, some bugs where discovered and fixed. Your issue is not yet investigated.

For a regexp solution: The Tokenizer works without regexp to ensure a high performance.

My plan is to work more in Christmas holidays on this project.

Nov 28 '20 20:11 Paebbels