DbcParser icon indicating copy to clipboard operation
DbcParser copied to clipboard

VAL_ parsing error

Open seobilee opened this issue 2 years ago • 6 comments

Hello,

I wanted to bring to your attention an issue I recently discovered with the signal value table in some dbc files. It appears that the contents of the dbc file are not parsing properly, and they seem somewhat weird. The syntax of the dbc file does not necessarily consist of just one line, which seems to be causing problems with line-level parsing with Regex.

Here's an example:

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

I noticed that this dbc file is written over two lines, and it's actually being used. Unfortunately, this syntax is also parsed correctly in Vector CANdb++.

Could you please consider including improvements in this area in your plans? If not, should we provide guidance to writers on writing the dbc file accurately, line by line?

Your assistance in addressing this matter would be greatly appreciated.

Thank you for your time and attention.

seobilee avatar Nov 06 '23 04:11 seobilee

Hello @seobilee ,

you are right, unfortunately at the moment the library relies on the file being escaped in a specific way, we do expect a command to start at the beginning of the line. Multiline support is present, but only for some commands (e.g. comments)

Could you please consider including improvements in this area in your plans?

Unfortunately, this would not be an improvement, this would be a complete rewriting of the parsers. We do contribute to this project in our spare time, so I cannot promise this rewriting will happen anytime soon. To be fair, that would be right thing to do but it would take a lot of time. In the vast majority of cases line by line works fine.

Bottom line: thanks for spotting this, we'll keep you posted but realistically nothing will happen in the short term

Regards A

Adhara3 avatar Nov 06 '23 12:11 Adhara3

Hello @Adhara3,

I want to express my gratitude for your prompt response.

In light of your guidance, I will make the necessary adjustments by writing the dbc file in one line to address the issue. Additionally, your information regarding INextLineProvider has been invaluable in this regard.

I also wanted to inform you that, due to the circumstances mentioned, we are presently utilizing tag version 1.2.0.

I am eager to contribute to a resolution code-wise and will explore opportunities to do so.

Once again, thank you for your assistance.

Warm regards S

seobilee avatar Nov 07 '23 01:11 seobilee

Hello @seobilee ,

sorry, could you please elaborate on why you are using version 1.2.0? Is it related to the issue above? Or is there another reason?

Regards A

Adhara3 avatar Nov 09 '23 15:11 Adhara3

Hello @Adhara3

Yes, the observed issue is indeed connected.

There has been a modification in the parser's behavior, starting from version 1.3.0 or later. The change primarily involves parsing using regular expressions. In version 1.2.0, a line is parsed even if it does not end with ";".

        private const string ValueTableLinkParsingRegex = @"VAL_\s+(\d+)\s+([a-zA-Z_][\w]*)\s+([a-zA-Z_][\w]*)\s*;";
        private const string ValueTableParsingRegex = @"VAL_\s+(?:(?:(\d+)\s+([a-zA-Z_][\w]*))|([a-zA-Z_][\w]*))\s+((?:(?:-?\d+)\s+(?:""[^""]*"")\s+)*)\s*;";

To illustrate, consider the following example:

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

In version 1.2.0, only the first line is parsed as a VALUE TABLE, with the second line being omitted. However, later versions cannot parse the above example, and there are some additional minor issues, such as quotes(") being included in the value of the Dictionary of the Value Table.

Regards S

seobilee avatar Nov 11 '23 18:11 seobilee

Hi @seobilee,

you are right, since version v1.3.0 regex has been introduced in all parsers to have a more robust and efficient parsing, hence the problem you are facing. However several bugs have been fixed and new features have been added in the latest versions (quotes are no longer included in the dictionary value), if you have the opportunity to modify your dbc I suggest you update to the latest version.

Regards

Whitehouse112 avatar Nov 13 '23 13:11 Whitehouse112

I just did a quick check if you could add general multi line support for the parser (in this case firstly the val parser) without completly changing it up and with minimal effort yet probably not and amzing solution too.

i came up with the following (one a proof of concept) :

        private static readonly IReadOnlyCollection<string> keywords = new List<string> { "SG_", "CM_", "BO_", "BA_", "VAL_" }; //No complete list just a mock

        private static string TryparseNextLines(string currentLine, IParseFailureObserver observer, INextLineProvider nextLineProvider)
        {
            var stringBuilder = new StringBuilder();
            stringBuilder.AppendLine(currentLine);

            while (nextLineProvider.TryGetLine(out var nextLine))
            {
                var cleanLine = nextLine.Trim(' ');

                if (cleanLine == string.Empty)
                {
                    observer.CurrentLine++;
                    break;
                }

                if (keywords.Any(keyword => cleanLine.StartsWith(keyword)))
                {
                    break;
                }

                observer.CurrentLine++;
                stringBuilder.Append(" " + cleanLine);

                if (cleanLine.EndsWith(";"))
                {
                    break;
                }
            }
            return stringBuilder.ToString().Replace("\r\n", string.Empty) + "\r\n";
        }

this would read lines until either it sees an empty line or the line ends with ";" or the line is the next value (marked by a keyword).

if you the use

            if (!cleanLine.EndsWith(";"))
            {
                cleanLine = TryparseNextLines(cleanLine, m_observer, nextLineProvider);
            }

before the actual regex match in the ValueTableLineParser it combines the two rows into one for parsing.

this worked for me in some test however i notices that:

  1. Looking at the dbc docu file from vector present in the solution im not sure if a VAL_ even must end on ";". This is not explicitly marked as required. If it woudl be requried you could just use teh same logic as in the comment for reading until the nextline ends with ;
  2. Parsing the exact string that is given in this issue doesnt even work if in a single line. The problem seems to be that for some reason the line must end with at least one " " (space) before the ";" This can be checked by using the "ExplicitValTableIsAppliedTest" Testcase and removing the space there. Im not sure why this is because the regex actually specifies zero or more spaces before the ";" but zero doesnt work
  3. Even if i add the space so it gets parsed the Default value is then just added to the string like this: image Im not sure if thats the intended way. Is the default value here just a comment or is it something that needs to be parsed seperatly?

@Whitehouse112 could you check on 1 and 2 when you find time?

Uight avatar Aug 11 '24 11:08 Uight

so i looked into this further specially in the part that VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used Default value: 0x0"; (notably in a single line) wouldn't get parsed.

Actually there is an exact testcase currently present that even checks that this doesnt get parserd in ValueTableLineParserTests.cs: [TestCase("VAL_ 869 qGearboxOil 0 "Running" 1 "Idle";")] public void ValueTableSyntaxErrorIsObserved(string line)

I dont know why this testcase is there as the syntax is valid and just missing a final " " which isnt required in my opinion.

this is the parsing string: @"VAL_\s+(?:(?:(\d+)\s+([a-zA-Z_][\w]))|([a-zA-Z_][\w]))\s+((?:(?:-?\d+)\s+(?:""[^""]"")_____\s+____))\s*;"; I marked the problem by putting it in between lot of underscores. This would only match the final value table value if atleast one space is behind it.

You can fix that by just replacing it with \s* and that would only fail the one UnitTest listed above which in my opinion is wrong. All other tests are successfull.

For the test case i would change it to something like this as this in my opinion is a bit clearer than using the mock behavior strict. But doing this for all test would be time consuming.

        public void ValueTableSyntaxErrorIsObserved(string line)
        {
            var observerMock = m_repository.Create<IParseFailureObserver>();
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();
            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            int counter = 0;
            observerMock.Setup(o => o.ValueTableSyntaxError())
            .Callback(() => counter++);

            var lineParser = new ValueTableLineParser(observerMock.Object);
            lineParser.TryParse(line, dbcBuilderMock.Object, nextLineProviderMock.Object);
            Assert.That(counter, Is.EqualTo(1));
        }

This part of the issure could be solved. The other part with the multiline support is just a matter of if the ";" is actually required and if it is then use the same code as in the comment multiline stuff or if it isnt using the code i posted before

@Adhara3, @Whitehouse112 or @EFeru could one of you check the thing i just posted about with the testcase actually being valid. In that case i could create a fix for that and change the testcase (move it to the valid testcases and use the \s* instead. I could also do the multiline stuff in the same PR if someone tells me if the ";" is required

Uight avatar Aug 27 '24 19:08 Uight

Ok, let's stop a second here.

VAL_ parsing without space before ";"

This is a bug in the regex due to the fact that VAL_ contains a list of items. @Whitehouse112 will provide a fix in a dedicated PR

Multiline support

History

In the past we allowed multiline support in an akward way but for a very specific reason and in a very specific scenario:

CM_ EV_ envVarData "We would like to format this comment, with a sort of bulletlist:
* Item 1
* Item 2
* Item 3

Ok, done!";

This comment is somehow formatted, the spacing and the newline have a meaning for the user to help visualize. Moreover, a comment text is delimited by "" and what is inside is considered free text. Both CANdb++ and Kvaser db editor do preserve the formatting. This has a value we wanted to preserve. Implementation wise, the solution was akward and we realized that the line by line parsing was weak.

Other examples

Current issue is about a TAG (VAL_) spreading multiple lines. It is not text spreading, it's the fact that a definition is splitted over multiple lines

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

Note that VAL_ does end with ; so a similar solution to comments could be implemented, but what about this?

BO_ 1160 DAS_steeringControl: 4 NEO
 SG_ DAS_steeringControlType : 23|2@0+ 
 (1,0) [0|0] ""  EPAS
 SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] ""  EPAS, OTHER
 SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] ""  EPAS

The first SG_ definition spans multiple lines, but in this case no ; is required. So?

NB: Also note that both the above examples

  • are not explicitly mentioned as allowed or forbidden by the doc
  • are tolerated by CANdb++
  • are not tolerated by Kvaser that fails parsing

Even more

What about this, a single line containing 2 separate definitions?

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

We currently skip the second one, both CANdb++ and Kvaser parse both correctly. Having two items in a line is the flip side of the multiline support coin.

Bottom line

DBCParserLib is currently relying on some sort of line based formatting that is respected in most of the examples we found on the net, including the ones we use for unit tests. Moreover the spacing/new line is also applied by CANdb++ when it saves a dbc file. We are basically assuming that there is a value in the file formatting. The doc is not clear on this and even CANdb++ and Kvaser do behave differently. If we want to provide a full multiline support we then need to change the way we parse the file, stop reading line by line, start reading by block, convert the parser in a sort of state machine, look for starting tags in any place, not only at the beginning of the line.

So current issue spots a potential parsing issue that is nevertheless coherent with the current library approach that requires a dbc file to be gracefully formatted. Good or bad, this is currently a requirement for us. If we want to fix this, then we should fix it in a comprehensive way, I would avoid fixing each parser separately. The problem is not the single parser, the problem here is the overall parsing approach.

So to me this issue currently is a won't fix. If we want to support unformatted dbc files, then we need to (hopefully partially) rewrite the parsing approach not considering line as something valuable, meaningful or complete/single item.

Cheers Andrea

Adhara3 avatar Aug 28 '24 09:08 Adhara3

Hi @Uight, thanks for spotting the regex parsing issue, my focus was on the multiline issue and didn't spot the bug beside that. I've already submitted a PR:

  • VAL_ parsing regex has been modified to correctly read the last key-value pair as a separate matching group. The proposed change (replace \s+ with \s*) would cause the following line to be correctly matched, even if it's wrong:

    "VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used"3 "Not Used_"

  • Added a test that address the above use case

  • Added a test to check the parsing of a single key-value pair value table since the last pair is now treated as a separate regex match group

  • Removed the failing test you spotted above: you are right about that, the final " "(space) is not required

Regarding ";": it's a required final char as stated in vector dbc file format documentation

image

Whitehouse112 avatar Aug 28 '24 10:08 Whitehouse112

@Adhara3 i would think that vector mostly scans for the keywords. A bit like what i tried with the comment from 2 weeks ago but not line based but char based;

Uight avatar Aug 28 '24 15:08 Uight

@Adhara3 and @Whitehouse112

this is a 13 year old code lying around in one of our old testsoftwares in my company. I adjusted it for .net and to read the two dbc files taht are in this repo. The read is pretty much done like this:

string fileContent = File.ReadAllText(filePath);
string[] keywords = new[] {
    "VERSION",
    "FILTER",

    "NS_DESC_",
    "NS_",

    "CM_",

    "BA_DEF_DEF_REL_",
    "BA_DEF_REL_",
    "BA_REL_",
    "BA_DEF_SGTYPE_",
    "BA_SGTYPE_",
    "BA_DEF_DEF_",
    "BA_DEF_",
    "BA_",

    "CAT_DEF_",
    "CAT_",

    "SGTYPE_VAL_",
    "SGTYPE_",

    "SIGTYPE_VALTYPE_",

    "VAL_TABLE_",
    "VAL_",

    "SIG_GROUP_",
    "SIG_VALTYPE_",
    "SIG_TYPE_REF_",

    "EV_DATA_",
    "ENVVAR_DATA_",

    "BO_TX_BU_",
    "BO_",

    "BU_SG_REL_",
    "BU_EV_REL_",
    "BU_BO_REL_",
    "BU_",

    "SG_MUL_VAL_",
    "SG_",

    "BS_",
};

string pattern = $@"(?={string.Join("|", keywords.Select(Regex.Escape))})";

string[] parts = Regex.Split(fileContent, pattern, RegexOptions.IgnoreCase);

List<string> resultLines = new List<string>();

foreach (string part in parts)
{
    resultLines.Add(part.Trim());
}

i dont like that it reads the whole file in one swoop but with the pcs that run modern .net i would expect an issue even with big files. in fact i tried it with an 2.5MB file that im not allowed to share and it takes 1,8seconds for just the line readings. Its seems it takes around 1s for 1MB but its not fully linear. For the tesla dbc it takes 50ms.

Another thing i dont really like is that you have to order the keywords.

If this could be done in with a file stream instead i think this would be a good option. You can then parse the "lines" with the current parsers.

Another thing is stuff like this: CM_ SG_ 961 COUNTER_ALT "only increments on change"; which would parse as empty comment and then as signal which is total bullshit ;) There must be a logik combining keywords with the ";" end of line parsing in vectors solution

Uight avatar Aug 28 '24 16:08 Uight

There are tons of subtle problems with this approach, like the CM_ SG_ example you provided: basically the parser it's not just a matter of finding strings but find strings in a context, which is something Regen are not designed for. The SG_ part in your example is in the scope (context) of a CM_ so there is value in the order. Basically a state machine.

Loading the whole file should not be a problem, but you would lose the ability to work over a network. We switched to Stream so that you could parse a DBC while downloading it. We may say "who cares", anyway.

A

Adhara3 avatar Aug 28 '24 20:08 Adhara3

Closing as won't fix now

Adhara3 avatar Aug 29 '24 07:08 Adhara3

@Adhara3 i would still consider doing some improvments to the code for this.

I did some stuff in: https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport

In this case i moved the m_obersver.CurrentLine++ logic to the Nextline Provider an generally added some stuff in there. Im not to sure about the virtual line stuff i did in there because that would still require that ";" alway is used as a termination but i could image it can appear in a comment but im not sure if EBNF allows that at all.

However moving the trim and m_obersver.CurrentLine++ logic to the nextline provider is a cleaner solution than now and i would still want to see if we could adjust the NextLine provider in a similar way to what i did in my branch (which was based partly on this: https://stackoverflow.com/questions/842465/reading-a-line-from-a-streamreader-without-consuming I think some parsers could definitly profit from peaking the nextLine. E.g if nextLine is empty the definition is always ending there. and if the NextLine starts with another definition you also know that the current definition is finished. It doesnt mean that we would want to implement multiline support; or multi definitons per line support right away but i think it could be a preparation;

If using virtual lines like in my code this would actually be 1 line containing 2 virtual lines; VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

This could be parsed by peaking nextline and checking if its starts with an definition identifier. This is based on that in EBNF you can not have an definition identifier at line start if its not a definition (at least thats what i think) BO_ 1160 DAS_steeringControl: 4 NEO SG_ DAS_steeringControlType : 23|2@0+ (1,0) [0|0] "" EPAS SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] "" EPAS, OTHER SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] "" EPAS

The case below shows two things im not sure about: Would the comment end at *Item2; in this case? Would this comment fail parsing as second line of comment is starting with an identifier (if not this would mean comment is actually special case were you just read until ";"

CM_ EV_ envVarData "We would like to format this comment, with a sort of bulletlist: SG_ Signal is a Signal definition

  • Item 2;
  • Item 3

Ok, done!";

Uight avatar Aug 29 '24 08:08 Uight

I personally would not mess the code for a partial solution. But, do whatever you like, after all I'm not the owner.

A

Adhara3 avatar Aug 29 '24 10:08 Adhara3

@Adhara3 i believe i could make it work as a general solution. At least i like a challenge ;)

https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport

i did some stuff in this test branch moving multiline support fully into the NextLineProvider. I removed it from the comment. Also it supports multiple definitions per line.

Some tests fail atm: image These fail because they use a ArrayBasedNextline provider which are only defined in test and dont have the changes a made to the NextLineProvider.

image This fails because its a strange testcase. Could only happen if CM_ is the very last entry. Easy enough to fix by just checking line end to be ";". However returning false would still be strange as the line is a comment after all just ill formated.

And then theres this: image fails because it expexts line breaks atm which i remove when concatinatin multilines.

Overall it would be easy to fix these problems (2nd i would change the testcase). As i also moved the string.Trim() options to the NextLineProvider you could remove it from all LineParsers or better replace it with something that replaces "\r\n" with " " in that way all parsers would have multiline support. In the comments if you want to keep "\r\n" in there just dont remove them.

@Whitehouse112 and @Adhara3, ill add some of the cases Whitehouse112 mentioned in the comments above at some time. Are there other testcases you would like?

Uight avatar Aug 31 '24 09:08 Uight

@Uight I have no problem with you fixing, my concern is timeline, milestone, small increment releases and above all, have the time to manage/review/organize changes.

We now have: ext mux, immutability (with API changes), advanced reading, several fixes, multiline support and probably even more, I do not remember. That is huge

All of that is great stuff but I can't handle all this burden alone being this a spare time collaboration.

So please, slow down. Let's first release what we have now then we merge your multiline support if you will and release just that finally we create a branch for milesto API 2.0 in which I merge immutability, you put ext mux and then we see.

Can we agree on this? Thanks A

Adhara3 avatar Aug 31 '24 10:08 Adhara3

@Adhara3 its all fine for me. You dont have to review changes right away i would also not have a problem having to merge my stuff a few times. My timeline for needing the message stuff is probably end of this year early next year anyway. And with the other stuff i just do it to get more into programming again and its alway interesting to build stuff on codebases your not used to.

Anyway: i kind of finished my multiline branch. It successfully runs through all test including the new ones with the testcases you mentioned:

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

BO_ 1160 DAS_steeringControl: 4 NEO SG_ DAS_steeringControlType : 23|2@0+ (1,0) [0|0] "" EPAS SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] "" EPAS, OTHER SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] "" EPAS

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

All of that gets parsed correctly. The string for the value table code is currently checked against: Assert.That(dbc.Messages.First().Signals.First().ValueTableMap.Last().Value, Is.EqualTo("Not Used\r\nDefault value: 0x0"));

notice that the "\r\n" is still in there. I can easily fix that by replacing var cleanLine = line.Trim(' '); in every parse with something to replace "\r\n" with " ". My line trimming is done in the NextLineProvider anyway. However this causes problems with tests where there are leading " " which can not happen in the real code as the nextline provider allready cleans it up. I think i should adust these cases to not have leading " ";

But then theres are some really strange testcases (here in MessageLineParserTests.cs):

        [Test]
        public void OnlyPrefixIsIgnored()
        {
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();
            var messageLineParser = CreateParser();
            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            Assert.That(messageLineParser.TryParse("BO_ ", dbcBuilderMock.Object, nextLineProviderMock.Object), Is.False);
        }

i have no idea why this is not failing atm. The signal clearly starts with BO_ why is it not detected? When i remove the trim and replace it with the "\r\n" replace to " " it fails which is completly logical to me.

As for the comment parsing which i did first i allready changed it to this: which would be right in my opinion

        [Test]
        // Should parse as it is a comment but should be observed as error
        // This however would be catched previously by the IgnoreLineParser
        public void OnlyPrefixIsIgnored()
        {
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();

            var counter = 0;
            var failureObserverMock = new Mock<IParseFailureObserver>();
            failureObserverMock
                .Setup(observer => observer.CommentSyntaxError())
                .Callback(() => counter++);

            var commentLineParser = new CommentLineParser(failureObserverMock.Object);

            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            Assert.That(commentLineParser.TryParse("CM_ ", dbcBuilderMock.Object, nextLineProviderMock.Object), Is.True);
            Assert.That(counter, Is.EqualTo(1));
        }

Edit: theres a cheeky space at the end of MessageLineStarter = "BO_ "; which causes the timed string to missmatch as the final " " is missing. However the test is still flawed a bit i think

Also more testcases might be appropriate and if you find time you can do a code review whenever you like.

Uight avatar Aug 31 '24 13:08 Uight

Just a note: do not trust \r\n if you want to replace it, a file may have been saved with other newline (e.g Unix)

Cheers A

Adhara3 avatar Aug 31 '24 14:08 Adhara3

@Adhara3 i was wanting to use ReplacelineEndings() but i cant as its .not available in .net standard and .net462. but i would handle pretty much all according ro the remarks:

The list of recognized newline sequences is CR (U+000D), LF (U+000A), CRLF (U+000D U+000A), NEL (U+0085), LS (U+2028), FF (U+000C), and PS (U+2029). This list is given by the Unicode Standard, Sec. 5.8, Recommendation R4 and Table 5-2.

This method is guaranteed O(n * r) complexity, where n is the length of the input string, and where r is the length of replacementText.

Then i wanted to use a System.Environment.Newline but that would still work if the file was created on the same system. still searching atm. Also for a better way to do the netline parse check. there must be something faster that that.

btw i now changes all test and code and check against more dbc files from: https://github.com/jamiejones85/DBC-files the only fail is BU_ being empty in some cases which we have a ticket for.

Edit: Did this now:

        // Sequence of return codes was taken from the internals of "String.ReplaceLineEndings" method.
        private const string NewLineCharsExceptLineFeed = "\r\f\u0085\u2028\u2029\n";
        private static readonly string pattern = $"[{Regex.Escape(NewLineCharsExceptLineFeed)}]+";

        public static string ReplaceNewlinesWithSpace(this string input)
        {
            // Would like to use "String.ReplaceLineEndings" but its unavailable because of the target frameworks
            // Feel free to optimate
            return Regex.Replace(input, pattern, " ");
        }

and added a test for using \n only which works

Uight avatar Aug 31 '24 14:08 Uight

Got the multiline parser running with the method read to nextDefinition. (WIP) Some stuff is unclear atm:

Is it valid for the a comment to have a line ending on ";"?

var dbcString = @"CM_ SG_ 75 channelName ""This is the first line; this is the second line this is the third line"";";

This would fail with the current parser but should be possible according to @Adhara3's comment where he stated that ";" can occure anywhere and therefor not used as as definite syntax in a comment. If a comment can contain a ";" why not at the end?

And then theres some additional cases i didnt bother to implement (yet):

var dbcString = @"CM_ SG_ 75 channelName ""This is the first line this is the second line""; BO_ some correctly formatted message";

I assume this would be possible but very unlikely. (Atm if in multiline mode i dont check multidefinitions two) This applies to everything so also to something like this:

var dbcString = @"VAL_ 134 TEST_BuckleSwitch 0 ""Buckled "" 1 "" Unbuckle "" 2 ""Not Used"" 3 ""Not Used
Default value: 0x0""; VAL_ 1160 DAS_steeringControlType 1 ""ANGLE_CONTROL"" 3 ""DISABLED"" 0 ""NONE"" 2 ""RESERVED"" ";

This cases could be handled but are they even allowed?

var dbcString = @"CM_ SG_ 75 channelName ""This is is a comment containing a ; char""; BO_ some correctly formatted message";

reading the first ; in the line and seeing that its not proceeded by a keyword i assume the whole line is a comment;

Some other querk of this solution is visible in this case:

var dbcString = @"BU_ Node1, Node1

BO_ some correctly formatted message";

This would be a parsing fail due to duplicated nodes in line one. However the observer would observe the error in line two as the empty line is also read together with the first definition. (Needed for comments containing empty lines) remark: This in my opinion can not be handled without removing the LineProvider in big parts and just giving full control to the individual parsers or: We could agree that an empty line always terminates a definition except for comments in which case i could handle that explicitly in the comment parser;

Uight avatar Sep 01 '24 16:09 Uight

From the docs, this is comment definition

comment = 'CM_' (char_string |
         'BU_' node_name char_string |
         'BO_' message_id char_string |
         'SG_' message_id signal_name char_string |
         'EV_' env_var_name char_string)
         ';' ; 

So comment text is defined as a char_string which is defined as follows

char_string: an arbitrary string consisting of any printable characters except double hyphens ('"') and backslashes ('\'). Control
        characters like Line Feed, Horizontal Tab, etc. are tolerated,
        but their interpretation depends on the application.

The CANdb++ editor e.g. interprets the character combination ‘0x0D 0x0A’ as a Microsoft Windows specific newline
and displays this information accordingly in the comment dialog page of the objects. In contrast to this behavior in the
‘Unit’-property field of signals the newline couldn’t be interpreted meaningful. 

So any char seems to be tolerated except " and \, thus ; is valid (no limits on positioning are specified so I guess it could be at the end too) Plese note that in the spec the ; MUST close a message tag

In my opinion, as already stated before, support multiline isn't the right definition of what we need, the right one would be parse treating newlines and indenting as a human only thing which would imply supporting multiline and multi-definitions per line (which is the other side of the coin). To me any work that does not lead to these is quite worthless. After all we had just one mention about failing multiline definition and even going through all the dbcs in the repo didn't highlight a high occurrence rate, so the multiline definition is not more unlikely to happen that the multi def per line.

And when I say wothless I know that every improvement is welcome but, if the users pattern is:

  1. I see a failure in DbcParserLib
  2. I open the dbc with CANdb++
  3. I see there it works
  4. DbcParserLib is then bugged, I open an issue

then a whole core parser rewrite to have half of the problem fixed has a low the cost benefit ratio to me.

Current implementation is lacking a feature in a coherent way, which is bad but also good (the coherent part), so it could be listed among known issues knowing that the cases where this happen are quite few. On the other hand we are supporting comment text the right way, even more, the Vector CANdb++ way, which is good.

A

Adhara3 avatar Sep 02 '24 07:09 Adhara3

@Adhara3 i have it running in https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport its can parse all cases that are listed in here. The observe reports right error lines. Pretty much its working.

No idea what the current performance is compared to the old code didnt check that. (i would guess bad by the number of string compared => maybe improvable?)

You should maybe check out the code at some point but my opinion would be to probably not use it. I think youll get it when you see the code. I mean its readable but the NextLineProvider is a complex mess. (it provides all the lines allready put together for the other parsers to use). Atleast its pretty much only the NextLineProvider that had changes. Im not sure if anybody wants something like this in a open source code. Its just not maintainable. Maybe theres ways to make it more readable? but im not sure.

If we decide to not use it we could move it to a stale branch or something or just delete it. Maybe keep the testcases around if someone wants to try it at some point. And then at that point remove it from 1.8.

Edit: Its a bit better now ;) i removed some stuff that wasnt needed anymore

Another Edit: If we would use this i would also change

internal interface ILineParser { bool TryParse(string line, IDbcBuilder builder, INextLineProvider nextLineProvider); }

to:

internal interface ILineParser { bool TryParse(string line); //Only check id the line should be parsed by this parser; bool Parse((string line, IDbcBuilder builder) //Actually parses the line and returns false if not parsable (keep it similar so test changes arent to big) }

This would allow to cleanup LineParsing even more as you could Trim() and remove newlines chars dependend on the parser that is going to parse it meaning that each line parser can fully focus on parsing the line.

Uight avatar Sep 02 '24 20:09 Uight

Dear all,

I upgraded home page readme with specifically addressing this issue and there are ongoin discussions about how to improve parsing. I close this specific issue to avoid noise.

Adhara3 avatar Sep 08 '24 14:09 Adhara3