langchaingo icon indicating copy to clipboard operation
langchaingo copied to clipboard

Using LoadAndSplit for PDF fails with streams not present

Open sherwoodzern opened this issue 2 years ago • 5 comments

I walked through the code as part of a debugging session and tracked the problem down to the following: pdf.LoadAndSplit -> p.GetPlainText(fonts) -> (page.go) GetPlainText -> read.go Key(key string) Value -> page.go Interpret(strm, func -> ps.go Interpret -> read.go func (v Value) Reader() io.ReadCloser -> v.data.(stream) It is at this point when the error takes place. The pdf data is not stream.

sherwoodzern avatar Jan 16 '24 19:01 sherwoodzern

Thanks for the report, could you include a test case or code snippet?

tmc avatar Jan 16 '24 22:01 tmc

Here's a code snippet:

func LoadPdfFile(filename string) []schema.Document {

//var r io.ReaderAt
fileInfo, err := os.Stat(filename)
if err != nil {
	log.Fatal(err)
}
file, err := os.Open(filename)
if err != nil {
	panic(err)
}

defer file.Close()

pdf := documentloaders.NewPDF(file, fileInfo.Size())

chunkSize := textsplitter.WithChunkSize(1000)
chunkOverlap := textsplitter.WithChunkOverlap((0))

splitter := textsplitter.NewTokenSplitter(chunkSize, chunkOverlap)
pdfDocs, err := pdf.LoadAndSplit(context.Background(), splitter)

if err != nil {
	panic(err)
}

for i, pdfDoc := range pdfDocs {

	log.Printf("Page Number: %d Content: %s\n", i, pdfDoc.PageContent)
}
return pdfDocs

}

I hope this helps. Let me know if you need something else have code you want to test.

sherwoodzern avatar Jan 17 '24 01:01 sherwoodzern

@sherwoodzern Hi, I was trying to reproduce the issue but failed to do so. Please can you share the pdf file on which you tried this and which resulted in error?

jaylalakiya avatar Jan 19 '24 06:01 jaylalakiya

Uploading ESLII_print12_toc.pdf…

sherwoodzern avatar Jan 19 '24 14:01 sherwoodzern

@jaylalakiya I've uploaded the pdf file. Let me know if I can be of any assistance. I'm looking forward to receiving your feedback.

sherwoodzern avatar Jan 19 '24 14:01 sherwoodzern