Problem with umlaut in exported project ZIP
Describe the bug
Importing an export fails with the following error:
Error importing project: NullPointerException: Source document must be specified
We are trying to export from one Inception instance and import into a clean newly deployed second instance but it unfortunately fails.
Both instances have the same version deployed but the old one used HSQLDB while the new one uses MariaDB.
To Reproduce
Use the webinterface to export a project in one instance, use the webinterface to import it in the other instance.
Expected behavior Expected the export/import over the webinterface to work.
Screenshots No screenshots, but a stacktrace:
And this is the docker compose file we used to set up inception:
networks:
web:
external: true
inception-net:
services:
db:
image: "mariadb:10.7"
environment:
- MYSQL_RANDOM_ROOT_PASSWORD=yes
- MYSQL_DATABASE=inception
- MYSQL_USER=${DBUSER:-inception}
- MYSQL_PORT=3306
- MYSQL_PASSWORD=${DBPASSWORD:-ADIFFERENTPASSWORDOFCOURSE}
volumes:
- /inception/data/inception/db-data:/var/lib/mysql
command: ["--character-set-server=utf8mb4", "--collation-server=utf8mb4_bin"]
healthcheck:
test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost", "-p${DBPASSWORD:-ADIFFERENTPASSWORDOFCOURSE}", "-u${DBUSER:-inception}"]
interval: 20s
timeout: 10s
retries: 10
networks:
inception-net:
app:
image: "${INCEPTION_IMAGE:-ghcr.io/inception-project/inception}:${INCEPTION_VERSION:-29.9}"
environment:
- INCEPTION_DB_DIALECT=org.hibernate.dialect.MariaDB103Dialect
- INCEPTION_DB_DRIVER=org.mariadb.jdbc.Driver
- INCEPTION_DB_URL=jdbc:mariadb://db:3306/inception?useSSL=false&useUnicode=true&characterEncoding=UTF-8
- INCEPTION_DB_USERNAME=${DBUSER:-inception}
- INCEPTION_DB_PASSWORD=${DBPASSWORD:-ADIFFERENTPASSWORDOFCOURSE}
- JAVA_OPTS=-Dspring.jpa.properties.hibernate.dialect.storage_engine=innodb
volumes:
- /inception/data/inception/app-data:/export
depends_on:
db:
condition: service_healthy
restart: unless-stopped
labels:
- traefik.enable=true
- traefik.http.routers.inception.entryPoints=websecure # entrypoint
- traefik.http.routers.inception.rule=Host(`myurl`) # host
- traefik.http.routers.inception.service=inception # service to target
- traefik.http.routers.inception.tls=true # use tls
- traefik.http.routers.inception.tls.certResolver=lets-encrypt
- traefik.http.services.inception.loadbalancer.server.port=8080
- traefik.http.services.inception.loadbalancer.passhostheader=true
networks:
web:
inception-net:
Please complete the following information:
- Version and build ID: INCEpTION -- 29.0 (2023-08-08 14:34:53, build 86d5bbd2)
- OS: Linux
- Browser: Firefox
I was able to copy the /export folder from the old instance. Now I'm trying to write a script that accesses the HSQLDB and then ports the data into the MariaDB.. hacky but may work.
I think better go the way of exporting the projects and importing them again.
Now regarding your problem:
Please unzip the exported project file that is giving you problems.
Please check the folder source from the ZIP. For every file inside that folder, there should be an entry in the source_documents section of the exportedprojectXXXX .json file that is also in the ZIP.
I would guess that the error occurs because there is some kind of mismatch between the folder content and the JSON file content...
Okay, I tested this with the following command:
diff <(ls -1 source/ | sort) <(jq -r '.source_documents[].name' < exportedproject.json | sort)
This diff showed me entries like this:
19c19
< f├╝r jetzt.pdf
---
> für jetzt.pdf
This looks like an encoding issue with the file names in the source directory when it comes to umlauts. For all the other files there are no differences.
What I noticed is that during the import, the directories corresponding to the IDs in the table in the database got created but no documents were actually uploaded. I initially thought it may be a permissions issue with the docker volume mount but that doesn't seem to be the case...
The code is written so the directories get created first... don't ask why, I have no idea :) In the beginning, there were directories I guess.
So where is the umlaut broken? In the filename or in the JSON?
In the filename.
Ok. Please try renaming the file inside the ZIP and then try to import again. Best do this with a tool that does not require you to extract the ZIP file. If you need to extract the ZIP file, make sure that you do not accidentally introduce a top-level folder in the new ZIP when re-compressing.