goquery icon indicating copy to clipboard operation
goquery copied to clipboard

How to achieve the effect of BeautifulSoup get_text?

Open chushuai opened this issue 2 years ago • 6 comments

How to achieve the effect of BeautifulSoup get_text?

def extract_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    text = soup.get_text(separator=" ", strip=True)
    return text

chushuai avatar Apr 13 '23 18:04 chushuai

Hello,

I'm not familiar with BeautifulSoup, what does this achieve? It seems like it would be something like https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Text, but with some text handling applied, space normalization or something?

Martin

mna avatar Apr 15 '23 21:04 mna

@mna Thank you for your reply, I want to do data extraction, using goquery can not achieve the effect similar to BeautifulSoup, I will give the comparison between the two below Example address: https://host7.bienvenidohosting.com:2096/

This is the result of goquery output, which contains a lot of space and js code image

This is the result of BeautifulSoup output, very simple and clean image

chushuai avatar Apr 17 '23 12:04 chushuai

BeautifulSoup get_text definition

def get_text(self, separator="", strip=False,
                 types=default):
        """Get all child strings of this PageElement, concatenated using the
        given separator.

        :param separator: Strings will be concatenated using this separator.

        :param strip: If True, strings will be stripped before being
            concatenated.

        :param types: A tuple of NavigableString subclasses. Any
            strings of a subclass not found in this list will be
            ignored. Although there are exceptions, the default
            behavior in most cases is to consider only NavigableString
            and CData objects. That means no comments, processing
            instructions, etc.

        :return: A string.
        """

chushuai avatar Apr 17 '23 12:04 chushuai

It's hard to tell from those screenshots but it looks like (and the function documentation seems to confirm this) it optionally trims each text node and concatenates them using the provided separator, and it ignores comments and some other nodes ("processing instructions", not sure what that means in this context).

Based on your screenshots, it looks like doing this would indeed get you something similar.

This is not supported in goquery out of the box, but it should be doable relatively easily using Contents() , Map(), strings.TrimSpace and strings.Join.

I wouldn't be opposed to add a top-level function (i.e. not a Selection method, as those are reserved for jquery API compatibility) that would do something similar to BeautifulSoup if anyone was interested in providing a PR. It should be general enough (i.e. support similar args to trim, join, and maybe a filter function to decide if a node's text is included or not). Probably more finer details to figure out.

But yeah, to answer your initial question, there's nothing equivalent but it should be possible using the methods I linked above.

Hope this helps, Martin

mna avatar Apr 17 '23 22:04 mna

@mna Thank you for your reply, This is the code I wrote, help me to see why many nested nodes do not parse out child nodes such as div, form

// TextAll returns the trimmed text contents of all the nodes in the selection, joined using the provided separator.
// It ignores comments and some other nodes ("processing instructions").
func TextAll(doc *goquery.Document, sep string) string {
	var texts []string
	// Slightly optimized vs calling Each: no single selection object created
	var f func(*html.Node)
	f = func(n *html.Node) {
		// Ignore script and style nodes
		if n.Type == html.ElementNode {
			switch n.Data {
			case "style", "script":
				return
			}
		}
		if n.Type == html.TextNode {
			if n.FirstChild == nil {
				text := strings.TrimSpace(n.Data)
				if len(text) > 0 {
					texts = append(texts, text)
					// Debugging information
					fmt.Println("+n.Type:", n.Type)
					fmt.Println("+n.FirstChild != nil:", n.FirstChild != nil)
					fmt.Println("+n.text:", text)
				}
			}
		} else {
			// Debugging information
			fmt.Println("-n.Type:", n.Type)
			fmt.Println("-n.text:", n.Data)
			fmt.Println("-n.FirstChild != nil:", n.FirstChild != nil)
		}
		// Recursively process child nodes
		if n.FirstChild != nil {
			for c := n.FirstChild; c != nil; c = c.NextSibling {
				f(c)
			}
		}
	}

	// Iterate over all nodes in the selection
	for _, n := range doc.Nodes {
		f(n)
	}
	// Join the texts slice using the provided separator
	return strings.Join(texts, sep)
}

image

image

chushuai avatar Apr 18 '23 04:04 chushuai