Error dealing with pdf files

Open soham-aiplanet opened this issue 1 year ago • 1 comments

This is my code

def process_google_drive_documents(folder_url: str, service_account_cred: dict):
    source = ab.get_source(
        "source-google-drive",
        config={
            "folder_url": folder_url,
            "credentials": {
                "auth_type": "Service",
                "service_account_info": json.dumps(service_account_cred),
            },
            "streams": [
                {
                    "name": "pdf_loader_stream",
                    "globs": ["**"],
                    "format": {"filetype": "unstructured"},
                }
            ],
        },
    )

    source.check()
    source.select_all_streams()
    read_result = source.read()

And here's the error - [Document(page_content='', metadata={'_ab_source_file_last_modified': '2023-11-28T19:43:49.000000Z', '_ab_source_file_url': 'TermPaper.docx', 'document_key': 'TermPaper.docx', '_ab_source_file_parse_error': "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=TermPaper.docx message=\n**********************************************************************\n Resource \x1b[93mpunkt_tab\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt_tab')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt_tab/english/\x1b[0m\n\n Searched in:\n - '/home/soham/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/share/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n", '_airbyte_raw_id': '01JAQ6ZEB720CS3BNHYVMKFQEC', '_airbyte_extracted_at': datetime.datetime(2024, 10, 21, 9, 36, 50, 530000), '_airbyte_meta': {}, 'last_modified': '2024-10-21T15:06:52.694685'})]

Any idea how to resolve this ?

Oct 28 '24 07:10 soham-aiplanet

Hi @soham-aiplanet ! I see the error message you encountered, and I believe it has to do with a missing resource in the Natural Language Toolkit (NLTK) library. The error appears because the punkt tokenizer is needed to parse text in the document, but it’s not currently available in your environment. To resolve this, please install punkt by running:

import nltk
nltk.download('punkt')

After installing it, try running the code again, and the error should be resolved. Please let me know if this works or if you run into any other issues.

Nov 03 '24 17:11 pinaak-goel