Bad PDF files which have junk characters before header and after EOF marker error unexpected character.
Hi! I found the Unexpected character error while parsing many of my PDFs. Here is one example of a PDF giving me that error: https://drive.google.com/file/d/1YXdN7TfwK87_5ekbUElYRFOkVLifKj1F/view?usp=sharing
julia> pdDocOpen("/home/diego/Downloads/Vernon et al. - 2018 - Pi-Pi contacts are an overlooked protein feature relevant to phase separation.pdf")
ERROR: Unexpected character
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] doc_trailer_update(ps::IOStream, doc::PDFIO.Cos.CosDocImpl)
@ PDFIO.Cos ~/.julia/packages/PDFIO/FcFZB/src/CosDoc.jl:399
[3] cosDocOpen(fp::String; access::Function)
@ PDFIO.Cos ~/.julia/packages/PDFIO/FcFZB/src/CosDoc.jl:141
[4] PDFIO.PD.PDDocImpl(fp::String; access::Function)
@ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDocImpl.jl:16
[5] pdDocOpen(filepath::String; access::Function)
@ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDoc.jl:77
[6] pdDocOpen(filepath::String)
@ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDoc.jl:77
[7] top-level scope
@ REPL[53]:1
My system is:
[c27321d9] Glob v1.3.0
[4d0d745f] PDFIO v0.1.12
[b8865327] UnicodePlots v1.3.0
julia> versioninfo(verbose=true)
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
Ubuntu 18.04.2 LTS (beaver-osp1-bowen X37)
uname: Linux 5.4.0-72-generic #80~18.04.1-Ubuntu SMP Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64
CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz:
speed user nice sys idle irq
#1-12 2500 MHz 957939 s 4753 s 202091 s 2197037 s 0 s
Memory: 15.245685577392578 GB (864.19140625 MB free)
Uptime: 521353.0 sec
Load Avg: 1.05 1.38 1.39
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
HOME = /home/diego
WINDOWPATH = 2
TERM = xterm-256color
PATH = /home/diego/.local/bin:/home/diego/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/diego/bin:/home/diego/.local/bin
I really appreciate any help you can provide.
Best regards,
The file is corrupt. The PDF file must start with %PDF and end with %%EOF. While some readers take a lenient stand on it, one cannot say that is the right approach. Anyway I fixed the file and uploading here for reference. fixed.pdf
Thank you so much for the quick answer. I am having this error with the 75% of my files. Would it be possible to have some keyword argument for allowing parsing this kind of files? Something like permisive=true, but being false by default?
These files are not according to the PDF spec. So technically, the behavior of a parser on corrupt files cannot be guaranteed and should not be fixed in a hurry. While I will keep in mind to update the parser to handle some bad files, I cannot make it a guranteed feature in the product. For now you can remove the bad MIME corruptions in the file manually and work with them.
Can be done easily with a binary preserving text editor like vi or emacs on Unix or Linux.
Thank you very much! There is no hurry at all :D
A fix in https://github.com/sambitdash/PDFIO.jl/commit/4a4f0713d840fa6db74b24a25ae4c35cf792d412