goquery icon indicating copy to clipboard operation
goquery copied to clipboard

<noscript> causes selector to fail

Open nathan-osman opened this issue 9 years ago • 9 comments

Consider the following program:

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

const data = `<noscript><a href="http://example.org">click this link</a></noscript>`

func main() {
	d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
	a, ok := d.Find("noscript a").Attr("href")
	fmt.Printf("URL: '%s', %t\n", a, ok)
}

The expected output is:

URL: 'http://example.org', true

But instead the output is:

URL: '', false

Changing noscript to div in both the document and selector causes the expected output, so the problem seems to affect only <noscript> elements.

nathan-osman avatar Dec 05 '16 01:12 nathan-osman

After further investigation, the problem appeared to originate in Cascadia:

package main

import (
	"fmt"
	"strings"

	"github.com/andybalholm/cascadia"
	"golang.org/x/net/html"
)

const data = `<noscript><a href="http://example.org">click</a></noscript>`

func main() {
	n, err := html.Parse(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
	s, err := cascadia.Compile("noscript a")
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(len(s.MatchAll(n)))
}

Before I could file a bug there, however, I came across this: https://github.com/andybalholm/cascadia/issues/14

"The net/html parser parses the document as if javascript were enabled. Because of that, the contents of noscript elements are just a single text node, not parsed HTML elements."

Now it looks like the bug exists in the golang.org/x/net/html package. Indeed, there is an open bug there for this very problem: https://github.com/golang/go/issues/16318

Sadly, it hasn't been fixed yet. :cry:

nathan-osman avatar Dec 05 '16 02:12 nathan-osman

Hello Nathan,

Thanks for looking into this. Makes sense that this is at the html parser level, would be nice if it provided the option to set javascript on or off for parsing. I'll keep the issue open until some decision is made in the parser.

Martin

mna avatar Dec 05 '16 12:12 mna

just noticed the same issue :)

Arnold1 avatar Jan 08 '17 04:01 Arnold1

For those looking for a workaround, re-parsing the content of the noscript tag seems to do the trick.

s.Find("noscript").SetHtml(s.Find("noscript").Text())

machinae avatar Jun 28 '18 21:06 machinae

@machinae cool thanks i will try it :) do you have an example which i can run?

Arnold1 avatar Oct 23 '18 18:10 Arnold1

s in my example is any *goquery.Selection. Just add that line after loading the document

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

const data = `<noscript><a href="http://example.org">click this link</a></noscript>`
func main() {
	d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
        d.Find("noscript").SetHtml(d.Find("noscript").Text())
	a, ok := d.Find("noscript a").Attr("href")
	fmt.Printf("URL: '%s', %t\n", a, ok)
}

machinae avatar Oct 24 '18 17:10 machinae

@machinae wouldn't this set the contents of the first noscript as the text of all noscript tags combined? I would think getting the instance of the tag would be safer? (not tested)

d.Find("noscript").Each(func(i int, s *goquery.Selection) { 
    s.ReplaceWithHtml(s.Text())
})

(I don't use goquery so the above is just a guess)

xeoncross avatar Feb 08 '19 15:02 xeoncross

I resolved it with code below:

root := doc.Selection
root.Find(`noscript`).Each(func(i int, selection *goquery.Selection) {
	selection.SetHtml(selection.Text())
})

orestonce avatar Aug 02 '19 02:08 orestonce

Looks like there was a partial fix in the referenced issue, i.e. ParseOptionEnableScripting(bool) which would support disabling script emulation mode. From the last issue comment it only work when noscript is inside the head though.

mikestead avatar Jul 29 '21 00:07 mikestead