parse icon indicating copy to clipboard operation
parse copied to clipboard

UnescapeAttrVal

Open 89z opened this issue 4 years ago • 8 comments

I am looking for something to get attribute values without quotes. Currently I am using this:

package parse

import (
   "bytes"
   "html"
)

func getAttr(b []byte) []byte {                                     
   b = bytes.Trim(b, `'"`)
   s := html.UnescapeString(string(b))
   return []byte(s)
}                                                                               

I found the EscapeAttrVal function [1]:

EscapeAttrVal returns the escaped attribute value bytes without quotes.

but I am confused. It says "without quotes", but it actually adds quotes:

https://github.com/tdewolff/parse/blob/f1ab946d07f836657ec47972229c77c0efa3b3bd/html/util_test.go#L17-L25

  1. https://godocs.io/github.com/tdewolff/parse/v2/html#EscapeAttrVal

89z avatar Oct 10 '21 13:10 89z

You're right that the comment is inaccurate. I've fixed the documentation, but note that the function has a different purpose than what you're looking for. If you want to remove quotes, you'll need to look if the first character is a single or double quote, and remove the first and last characters. The value may contain HTML entities (such as ') which were originally present.

tdewolff avatar Oct 14 '21 00:10 tdewolff

I might be wrong on this, but I think its not possible for an attribute to start with a literal single or double quote. If an attribute starts with one of those, it must be interpreted as a quote character, not as a literal character. For one of those to start the attribute literally, it would need to be encoded as an entity, like you mentioned.

If I am right, then the check for the first character would be redundant.

89z avatar Oct 14 '21 00:10 89z

I'm a little confused. Are you using the HTML parser of this package to get attribute values? The HTML lexer will put everything after = as the attribute value, which might include quotes. E.g. alt="value" would have an attribute value of "value", and alt=other would have an attribute value of other. The HTML minifier will strip the first and last quotes (if they exist) and use the EscapeAttrVal function to optimize the attribute value by choosing single or double quotes, or none at all, whichever is shorter.

I'm not sure what value you feed into your getAttr function, the output from this HTML lexer? I see you already trim the single and double quotes, so what is it that you're looking for? Could you give me some examples?

tdewolff avatar Oct 14 '21 14:10 tdewolff

Are you using the HTML parser of this package to get attribute values?

Technically no, because by my defintion, this package cant currently do that. If you have an element like this:

<span class='user'>John Doe</span>

With JavaScript [1], you get this:

> temp1.getAttribute('class') == 'user';
true

That is to say, JavaScript (and other languages) dont consider surrounding quotes as part of the attribute value. They are merely a semantic tool to avoid parsing errors. Granted tdewolff/parse is not JavaScript, it is a Parser and/or Lexer, so it has different goals. Nevertheless, my goal is to return a value similar to what youd get from a programming language, so no surrounding quotes if they exist.

I'm not sure what value you feed into your getAttr function, the output from this HTML lexer?

The input to my function is the output from html#Lexer.AttrVal.

  1. https://developer.mozilla.org/docs/Web/API/Element/getAttribute

89z avatar Oct 14 '21 14:10 89z

So what is the problem with using bytes.Trim('") on html.Lexer.AttrVal? Do you have an example that doesn't work as expected?

tdewolff avatar Oct 14 '21 14:10 tdewolff

So what is the problem with using bytes.Trim on html.Lexer.AttrVal? Do you have an example that doesn't work as expected?

My code seems to work fine enough. I just thought, if the package already has a EscapeAttrVal, then maybe it could have an UnescapeAttrVal function too. That way I could just use the package function, instead of making my own. However if a function like that doesnt align with the goals of thie project, I understand.

89z avatar Oct 14 '21 14:10 89z

I see now, yes that sounds like a good idea! Generally, functions are implemented as they seem needed. I want to add that besides cutting of the quotes from the start and end, you probably want the HTML entities to be converted to ASCII characters to get the textual value? E.g. attr="val&#34;ue" would become val"ue and not val&#34;ue.

tdewolff avatar Oct 14 '21 15:10 tdewolff

you probably want the HTML entities to be converted to ASCII characters to get the textual value? E.g. attr="val&#34;ue" would become val"ue and not val&#34;ue.

Yes that is true. Currently I am doing that using UnescapeString [1]. Its in my function in the original post as well.

  1. https://godocs.io/html#UnescapeString

89z avatar Oct 14 '21 15:10 89z