[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :(

Open MatteoRiva95 opened this issue 1 year ago • 0 comments

Hello everyone,

I was used to launch "scipdf_parser" on Google Colab and it worked so well! Today I tried to launch it again with the same commands, but it does not work anymore! Please, please can someone help me? :(

Here is the code I was using:

from google.colab import drive drive.mount('/content/drive')

!pip install git+https://github.com/titipata/scipdf_parser

!python -m spacy download en_core_web_sm

import subprocess subprocess.Popen("bash serve_grobid.sh", shell=True)

(It now returns: <Popen: returncode: None args: 'bash serve_grobid.sh'>)

!bash serve_grobid.sh

(It now returns: Error: Docker is not installed. Please install Docker before running Grobid.)

import scipdf import os import pandas as pd import warnings from bs4.builder import XMLParsedAsHTMLWarning warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)

files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500] for idx, filename in enumerate(files_to_process, start=1): if filename.lower().endswith(".pdf"): try: pmid = filename.split(".")[0] percorso_file_csv = f"/content/drive/My Drive/CSV/{pmid}.csv" dizionario = scipdf.parse_pdf_to_dict(f'/content/drive/My Drive/PDF/{filename}') sections_content = [f"{key}: {value}" for section in dizionario.get('sections', []) for key, value in section.items()] #references_content = [f"{key}: {value}" for reference in dizionario.get('references', []) for key, value in reference.items()] #figures_content = [f"{key}: {value}" for figure in dizionario.get('figures', []) for key, value in figure.items()]

        content = {
            "Title": f"{dizionario.get('title', '')}\n",
            "Authors": f"{dizionario.get('authors', '')}\n",
            "Publication date": f"{dizionario.get('pub_date', '')}\n",
            "Abstract": f"{dizionario.get('abstract', '')}\n",
            "Sections": "\n".join(sections_content),
            #"References": "\n".join(references_content),
            #"Figures": "\n".join(figures_content),
            "Doi": f"{dizionario.get('doi', '')}"
        }

        df = pd.DataFrame([[pmid, content_str]])

        df.to_csv(percorso_file_csv, index=False, header=["pmid", "content"])
        print(f"Il file CSV è stato creato con successo per il file {filename}")
        print(f"File numero {idx}")

    except Exception as e:
        print(e)
        continue

(It now returns:

OSError Traceback (most recent call last) in <cell line: 9>() 7 8 # Elabora solo i primi 10 file ----> 9 files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500] 10 for idx, filename in enumerate(files_to_process, start=1): 11 if filename.lower().endswith(".pdf"):

OSError: [Errno 5] Input/output error: '/content/drive/My Drive/PDF/')

What is going on?

Thank you so much in advance!!

Mar 05 '24 11:03 MatteoRiva95