[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :(
Hello everyone,
I was used to launch "scipdf_parser" on Google Colab and it worked so well! Today I tried to launch it again with the same commands, but it does not work anymore! Please, please can someone help me? :(
Here is the code I was using:
from google.colab import drive drive.mount('/content/drive')
!pip install git+https://github.com/titipata/scipdf_parser
!python -m spacy download en_core_web_sm
import subprocess subprocess.Popen("bash serve_grobid.sh", shell=True)
(It now returns: <Popen: returncode: None args: 'bash serve_grobid.sh'>)
!bash serve_grobid.sh
(It now returns: Error: Docker is not installed. Please install Docker before running Grobid.)
import scipdf import os import pandas as pd import warnings from bs4.builder import XMLParsedAsHTMLWarning warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)
files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500] for idx, filename in enumerate(files_to_process, start=1): if filename.lower().endswith(".pdf"): try: pmid = filename.split(".")[0] percorso_file_csv = f"/content/drive/My Drive/CSV/{pmid}.csv" dizionario = scipdf.parse_pdf_to_dict(f'/content/drive/My Drive/PDF/{filename}') sections_content = [f"{key}: {value}" for section in dizionario.get('sections', []) for key, value in section.items()] #references_content = [f"{key}: {value}" for reference in dizionario.get('references', []) for key, value in reference.items()] #figures_content = [f"{key}: {value}" for figure in dizionario.get('figures', []) for key, value in figure.items()]
content = {
"Title": f"{dizionario.get('title', '')}\n",
"Authors": f"{dizionario.get('authors', '')}\n",
"Publication date": f"{dizionario.get('pub_date', '')}\n",
"Abstract": f"{dizionario.get('abstract', '')}\n",
"Sections": "\n".join(sections_content),
#"References": "\n".join(references_content),
#"Figures": "\n".join(figures_content),
"Doi": f"{dizionario.get('doi', '')}"
}
df = pd.DataFrame([[pmid, content_str]])
df.to_csv(percorso_file_csv, index=False, header=["pmid", "content"])
print(f"Il file CSV è stato creato con successo per il file {filename}")
print(f"File numero {idx}")
except Exception as e:
print(e)
continue
(It now returns:
OSError Traceback (most recent call last)
OSError: [Errno 5] Input/output error: '/content/drive/My Drive/PDF/')
What is going on?
Thank you so much in advance!!