Using LoadAndSplit for PDF fails with streams not present
I walked through the code as part of a debugging session and tracked the problem down to the following: pdf.LoadAndSplit -> p.GetPlainText(fonts) -> (page.go) GetPlainText -> read.go Key(key string) Value -> page.go Interpret(strm, func -> ps.go Interpret -> read.go func (v Value) Reader() io.ReadCloser -> v.data.(stream) It is at this point when the error takes place. The pdf data is not stream.
Thanks for the report, could you include a test case or code snippet?
Here's a code snippet:
func LoadPdfFile(filename string) []schema.Document {
//var r io.ReaderAt
fileInfo, err := os.Stat(filename)
if err != nil {
log.Fatal(err)
}
file, err := os.Open(filename)
if err != nil {
panic(err)
}
defer file.Close()
pdf := documentloaders.NewPDF(file, fileInfo.Size())
chunkSize := textsplitter.WithChunkSize(1000)
chunkOverlap := textsplitter.WithChunkOverlap((0))
splitter := textsplitter.NewTokenSplitter(chunkSize, chunkOverlap)
pdfDocs, err := pdf.LoadAndSplit(context.Background(), splitter)
if err != nil {
panic(err)
}
for i, pdfDoc := range pdfDocs {
log.Printf("Page Number: %d Content: %s\n", i, pdfDoc.PageContent)
}
return pdfDocs
}
I hope this helps. Let me know if you need something else have code you want to test.
@sherwoodzern Hi, I was trying to reproduce the issue but failed to do so. Please can you share the pdf file on which you tried this and which resulted in error?
Uploading ESLII_print12_toc.pdf…
@jaylalakiya I've uploaded the pdf file. Let me know if I can be of any assistance. I'm looking forward to receiving your feedback.