PyPDF4 Dealing with invalid bookmarks

I'm dealing with a collection of PDF files that contain invalid bookmarks: in Acrobat, they show in the bookmark tree, but the bookmark properties shows that they have no destination.

Using the getOutlines() method does not return these bookmarks at all. I was hoping that I could extract these invalid bookmarks and fix them with pyPDF4 (in my specific case, I would set them to the destination of the previous bookmark), but since they're not listed, that's not possible.

Is there a tweak I could apply to get these invalid bookmarks (and update them)?

Thanks!

R.

Sep 17 '20 07:09 rgoubet

Hi @mrgou, I've done quite some rewriting/fix in my fork https://github.com/pubpub-zz/PyPDF4 this is still in alpha. whl is available here: pypdf4-1.27.0PPzz_1-py2.py3-none-any.whl.zip If you still get some isssue, can you send an example for analysis.

Sep 17 '20 18:09 pubpub-zz

Hi @mrgou, Have you been able to test my whl. Do you have any feed back? else can you send me an example with invalid bookmarks ?

Sep 24 '20 20:09 pubpub-zz

Sorry for the delay in responding. I've tested your whl but couldn't see any difference. Invalid bookmarks seem to be skipped entirely. So they are not returned by the following code:

def printBookmarkTree(bookmark_list):
    i = 0
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            i += 1
            # print(i)
            printBookmarkTree(item)
        else:
            page = reader.getDestinationPageNumber(item) + 1
            print('\t' * i + item.title + '\t' + str(page))
            i -= 1

printBookmarkTree(reader.getOutlines())

By the way, it seems that I had to import PyPDF4 under the name pypdf:

import pypdf as PyPDF4

Not sure whether this is intended.

Unfortunately, I am not able to share an example file, as they are highly confidential. I can't even say how to reproduce the issue: how the issue occured is what I'm trying to find out in the first place

Oct 01 '20 15:10 rgoubet

No problem for delay and your problem to share documents First about the pypdf renaming. It is something that was already in claird fork : this is a choice that was already implemented in claird's fork. about your problem I propse you the following code:

def get_outlines1(self, node=None, _outlines=None):
        if _outlines is None:
            _outlines = []
            catalog = self.root_object
            # get the outline dictionary and named destinations
            if "/Outlines" in catalog:
                try:
                    lines = catalog["/Outlines"]
                except PdfReadError:
                    # This occurs if the /Outlines object reference is
                    # incorrect for an example of such a file, see
                    # https://unglueit-files.s3.amazonaws.com/ebf/7552c42e9280b4476e59e77acc0bc812.pdf
                    # so continue to load the file without the Bookmarks
                    return _outlines
                if "/First" in lines:
                    node = lines["/First"]
        if node is None:
            return _outlines
        # see if there are any more outlines
        while True:
            outline = self._build_outline(node)
            if outline:
                _outlines.append((node,outline))
            # check for sub-outlines
            if "/First" in node:
                sub_outlines = []
                get_outlines1(self,node["/First"], sub_outlines)
                if sub_outlines:
                    _outlines.append(sub_outlines)
            if "/Next" not in node:
                break
            node = node["/Next"]
        return _outlines
def flatten_and_whiten_outlines(ar,prefix,blanking,line_=0):
    for a in ar:
        if isinstance(a,list):
            flatten_and_whiten_outlines(a,prefix+"  ",blanking,line_+1)
        else:
            #print(a)
            if(blanking):
                a[0]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                a[1]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                try:
                    a[1]["/Dest"]["/Title"]=pypdf.TextStringObject("****BLANKDEST***")
                except:
                    pass
            print("$%d"%line_,prefix,a.__repr__())
            line_+=1

you define directly those two functions directly in the shell You can the call the two function one after the other

ar=get_outlines1(***pdf_object***)

flatten_and_whiten_outlines(ar,"",False)

This will output the raw data data from the pdf and then the decrypted data. This should help you to troubleshoot. I've also implemented some code to blank data, you may find usefull to share some data Hope this help to improve the library

Oct 04 '20 10:10 pubpub-zz