PSHTML icon indicating copy to clipboard operation
PSHTML copied to clipboard

Consider Get-PSHTMLDocument

Open Stephanevg opened this issue 6 years ago • 4 comments

It would be nice to have a function which could read a HTML page out, and send an object back, which could be developed further, or even converted to an PSHTML Powershell file (is that utopic?)

  1. The parsing

For that, we will need the ability to parse a HTML document.

This snippet might be an option to do so:

Add-Type -AssemblyName System.Xml.Linq
$txt=[IO.File]::ReadAllText("c:\myhtml.html")
$xml = [System.Xml.Linq.XDocument]::Parse($txt)
$ns='http://www.w3.org/1999/xhtml'
$divs=$cells = $xml.Descendants("{$ns}td")
  1. Create a PSHTML.Document object Once it is parsed (or while parsing) we could create for each html element the corrsponding PSHTML Object. This would assume that this issue is closed and implemented first -> https://github.com/Stephanevg/PSHTML/issues/218

Stephanevg avatar Jul 12 '19 14:07 Stephanevg

Hi, I think it's better to use something really dedicated to html. System.xml is xml focused. I tried with the small function i created in #218 and it fails when some "special" html syntaxes are used ( atom stuff .. ).

I tried with the htmlagilitypack and ... well it's html oriented html, and it's almost the same. it also works on pscore (6.2)

it's available here: https://html-agility-pack.net (download the nuget package, and unzip it somewhere )

[Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")
$html = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$html.LoadHtml($a)
$html.DocumentNode

LxLeChat avatar Aug 18 '19 21:08 LxLeChat

here is a working example with htmlagilitypack, and core pshtml with classes like in #218 first; loading htmlagilitypack [Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")

then, get html code from your favorite page, copy/paste it inside an html file fetch the content $a = get-content .\yourhtmlpage.html

and voila:

PS C:\Users\Lx> $x = get-pshtmldocument -html $a
PS C:\Users\Lx> $x

TagName  id Class Children
-------  -- ----- --------
                  {$null}
#comment          {}
html              {, }    


PS C:\Users\Lx> $x[2]

TagName id Class Children

PS C:\Users\Lx> $x[2].children[1].children

TagName  id         Class                                                        Children
-------  --         -----                                                        --------
script                                                                           {}
script                                                                           {var config = {     autoCapture: {             lineage: true     }... 
noscript                                                                         {}
div      headerArea uhf                                                          {headerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
div      page       hfeed site                                                   {single-wrapper, wrapper-footer}
div                 a2a_kit a2a_kit_size_32 a2a_floating_style a2a_default_style {, , }
script                                                                           {var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0... 
script                                                                           {(function (undefined) {var _targetWindow ="prefer-popup"; window.... 
script                                                                           {/*{literal}*/window.lightningjs||function(c){function g(b,d){d&&(... 
div      footerArea uhf                                                          {footerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
script                                                                           {//fix calendar hide when change month        var string = window.... 
script                                                                           {}
script                                                                           {window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","... 


PS C:\Users\Lx>

the function itself:

function get-pshtmldocument {
    param (
        $html
    )

    begin {

       function HtmlToPSHTMLClass {
            param(
                $node
            )

            If ( $node.nodetype -ne 'Text' ) {

                $plop = [htmlParentElement]::New()
                $plop.SetTagName($node.Name)
                $plop.Id = $node.Attributes.where({$_.name -eq 'id'}).Value
                $plop.Class = $node.Attributes.where({$_.name -eq 'class'}).Value

                If ( $node.hasChildNodes ) { 
                    foreach ( $n in $node.childnodes ) {
##some nodes are 'empty' so i did this ... maybe a bug ???
                        If ( $n.nodetype -eq 'Text' -and $n.InnerText.trim() -ne '' ) {
                            $child = $n.InnerText
                            $plop.AddChild( $child )
                        } elseif ( $n.nodetype -ne 'Text') {
                            $child = HtmlToPSHTMLClass -node $n
                            $plop.AddChild( $child )
                        }
                    }
                }
            }

            $plop
        } 

    }

    process {

        $document = New-Object -TypeName HtmlAgilityPack.HtmlDocument
        $document.LoadHtml($html)

        Foreach( $node in $document.DocumentNode.ChildNodes ) {
            HtmlToPSHTMLClass -node $node
        }

    }

    end {

    }

}

LxLeChat avatar Aug 19 '19 13:08 LxLeChat

A side note: The HTML Agility Pack (HAP) is MIT licenced. So we could strongly consider it...

Stephanevg avatar Aug 15 '20 06:08 Stephanevg

Another side note: It looks like Justin Grote already wrote a powershell implementation of the Agility Pack. PowerHTML (Under MIT as well)

Stephanevg avatar May 07 '24 09:05 Stephanevg