langchaingo Using LoadAndSplit for PDF fails with streams not present

I walked through the code as part of a debugging session and tracked the problem down to the following: pdf.LoadAndSplit -> p.GetPlainText(fonts) -> (page.go) GetPlainText -> read.go Key(key string) Value -> page.go Interpret(strm, func -> ps.go Interpret -> read.go func (v Value) Reader() io.ReadCloser -> v.data.(stream) It is at this point when the error takes place. The pdf data is not stream.

Jan 16 '24 19:01 sherwoodzern

Thanks for the report, could you include a test case or code snippet?

Jan 16 '24 22:01 tmc

Here's a code snippet:

func LoadPdfFile(filename string) []schema.Document {

//var r io.ReaderAt
fileInfo, err := os.Stat(filename)
if err != nil {
	log.Fatal(err)
}
file, err := os.Open(filename)
if err != nil {
	panic(err)
}

defer file.Close()

pdf := documentloaders.NewPDF(file, fileInfo.Size())

chunkSize := textsplitter.WithChunkSize(1000)
chunkOverlap := textsplitter.WithChunkOverlap((0))

splitter := textsplitter.NewTokenSplitter(chunkSize, chunkOverlap)
pdfDocs, err := pdf.LoadAndSplit(context.Background(), splitter)

if err != nil {
	panic(err)
}

for i, pdfDoc := range pdfDocs {

	log.Printf("Page Number: %d Content: %s\n", i, pdfDoc.PageContent)
}
return pdfDocs

}

I hope this helps. Let me know if you need something else have code you want to test.

Jan 17 '24 01:01 sherwoodzern

@sherwoodzern Hi, I was trying to reproduce the issue but failed to do so. Please can you share the pdf file on which you tried this and which resulted in error?

Jan 19 '24 06:01 jaylalakiya

Uploading ESLII_print12_toc.pdf…

Jan 19 '24 14:01 sherwoodzern

@jaylalakiya I've uploaded the pdf file. Let me know if I can be of any assistance. I'm looking forward to receiving your feedback.

Jan 19 '24 14:01 sherwoodzern