How to achieve the effect of BeautifulSoup get_text?
How to achieve the effect of BeautifulSoup get_text?
def extract_content(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator=" ", strip=True)
return text
Hello,
I'm not familiar with BeautifulSoup, what does this achieve? It seems like it would be something like https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Text, but with some text handling applied, space normalization or something?
Martin
@mna Thank you for your reply, I want to do data extraction, using goquery can not achieve the effect similar to BeautifulSoup, I will give the comparison between the two below Example address: https://host7.bienvenidohosting.com:2096/
This is the result of goquery output, which contains a lot of space and js code

This is the result of BeautifulSoup output, very simple and clean

BeautifulSoup get_text definition
def get_text(self, separator="", strip=False,
types=default):
"""Get all child strings of this PageElement, concatenated using the
given separator.
:param separator: Strings will be concatenated using this separator.
:param strip: If True, strings will be stripped before being
concatenated.
:param types: A tuple of NavigableString subclasses. Any
strings of a subclass not found in this list will be
ignored. Although there are exceptions, the default
behavior in most cases is to consider only NavigableString
and CData objects. That means no comments, processing
instructions, etc.
:return: A string.
"""
It's hard to tell from those screenshots but it looks like (and the function documentation seems to confirm this) it optionally trims each text node and concatenates them using the provided separator, and it ignores comments and some other nodes ("processing instructions", not sure what that means in this context).
Based on your screenshots, it looks like doing this would indeed get you something similar.
This is not supported in goquery out of the box, but it should be doable relatively easily using Contents() , Map(), strings.TrimSpace and strings.Join.
I wouldn't be opposed to add a top-level function (i.e. not a Selection method, as those are reserved for jquery API compatibility) that would do something similar to BeautifulSoup if anyone was interested in providing a PR. It should be general enough (i.e. support similar args to trim, join, and maybe a filter function to decide if a node's text is included or not). Probably more finer details to figure out.
But yeah, to answer your initial question, there's nothing equivalent but it should be possible using the methods I linked above.
Hope this helps, Martin
@mna Thank you for your reply, This is the code I wrote, help me to see why many nested nodes do not parse out child nodes such as div, form
// TextAll returns the trimmed text contents of all the nodes in the selection, joined using the provided separator.
// It ignores comments and some other nodes ("processing instructions").
func TextAll(doc *goquery.Document, sep string) string {
var texts []string
// Slightly optimized vs calling Each: no single selection object created
var f func(*html.Node)
f = func(n *html.Node) {
// Ignore script and style nodes
if n.Type == html.ElementNode {
switch n.Data {
case "style", "script":
return
}
}
if n.Type == html.TextNode {
if n.FirstChild == nil {
text := strings.TrimSpace(n.Data)
if len(text) > 0 {
texts = append(texts, text)
// Debugging information
fmt.Println("+n.Type:", n.Type)
fmt.Println("+n.FirstChild != nil:", n.FirstChild != nil)
fmt.Println("+n.text:", text)
}
}
} else {
// Debugging information
fmt.Println("-n.Type:", n.Type)
fmt.Println("-n.text:", n.Data)
fmt.Println("-n.FirstChild != nil:", n.FirstChild != nil)
}
// Recursively process child nodes
if n.FirstChild != nil {
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
}
// Iterate over all nodes in the selection
for _, n := range doc.Nodes {
f(n)
}
// Join the texts slice using the provided separator
return strings.Join(texts, sep)
}

