Unable to parse HTML

Open harshzalavadiya opened this issue 5 years ago • 1 comments

So I was trying to parse content from multiple document formats and turns out it works for other document formats pdf, doc etc. but not for html files somehow

below is the minimal example with sample html

main.go

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	// Attempt to read file
	txt, err := docconv.ConvertPath("test.html")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(txt.Body)
}

test.html

<!DOCTYPE html>
<html>
  <body>
    <h1>This is heading 1</h1>
    <h2>This is heading 2</h2>
    <h3>This is heading 3</h3>
    <h4>This is heading 4</h4>
    <h5>This is heading 5</h5>
    <h6>This is heading 6</h6>
  </body>
</html>

As of now output is blank

also I noticed that there's no release from 2019 feb so code.sajari.com might be sending older library is there any way to maybe pre-release? version or configure CI to do that

Jul 27 '20 14:07 harshzalavadiya

I have the same problem, in Ubuntu x64 and OSX arm M1 mac. No errors, no meta info or content.

May 13 '22 16:05 stuta