opencloud icon indicating copy to clipboard operation
opencloud copied to clipboard

Non-UTF-8 HTML files fail to download/sync properly

Open thedanbob opened this issue 2 months ago • 9 comments

Describe the bug

Uploaded HTML files that are not UTF-8 do not get downloaded or synced accurately.

Steps to reproduce

  1. Upload an HTML file with non-UTF-8 characters (example: test.htm)
  2. Download the file and compare checksums (for the example file: original 8d1c3bb3587e137d3671f1759b5ef488b2db67ca, downloaded fa06ac170eb3f0c9051bde3551633014fc36a8b0)
  3. Alternatively, attempt to sync the file using the desktop app. The file will fail to sync as the checksums do not match

Expected behavior

Files should always download byte-for-byte as they were uploaded.

Actual behavior

Non-UTF-8 characters are being scrubbed from the HTML file and replaced with �.

Setup

I'm running opencloudeu/opencloud-rolling:latest in podman, not the compose setup (I only need file syncing).

OC_URL=<snip>
OC_OIDC_ISSUER=<snip>
OC_OIDC_CLIENT_ID=<snip>
OC_EXCLUDE_RUN_SERVICES=idp
PROXY_TLS=false
PROXY_CSP_CONFIG_FILE_LOCATION=/etc/opencloud/csp.yaml
PROXY_OIDC_REWRITE_WELLKNOWN=true
PROXY_USER_OIDC_CLAIM=preferred_username
PROXY_USER_CS3_CLAIM=username
PROXY_AUTOPROVISION_ACCOUNTS=true
GRAPH_USERNAME_MATCH=none
PROXY_ROLE_ASSIGNMENT_DRIVER=oidc
PROXY_ROLE_ASSIGNMENT_OIDC_CLAIM=opencloudRoles
WEB_UI_CONFIG_FILE=/etc/opencloud/web.config.json
WEB_ASSET_APPS_PATH=/etc/opencloud/assets

Additional context

I tried downloading the file directly from the OpenCloud instance via ip:port (bypassing my reverse proxy) and it still had the issue, so the problem must be within OpenCloud. Also, the file stored on the server matches the uploaded file so the issue is specifically when downloading.

thedanbob avatar Nov 11 '25 17:11 thedanbob

@rhafer @dragotin @dragonchaser

This needs to be confirmed IMHO.

micbar avatar Nov 11 '25 19:11 micbar

In case it's helpful, here are two more files that reproduce the issue, as well as one that doesn't for some reason (even though it's from the same source):

broken1.htm broken2.htm working.htm

thedanbob avatar Nov 11 '25 21:11 thedanbob

I can see that the htm files are encoded with windows-1252:

  <meta http-equiv="content-type" content="text/html; charset=windows-1252">

And there are some "invisible" characters: The character U+00a0 is invisible 0a is LF / linefeed.

ai produced this test script:

#!/usr/bin/env bash
#
# webdav-encoding-test.sh
#
# Verify that a WebDAV server does NOT alter the byte‑level encoding
# (windows‑1252) nor the SHA‑256 hash of an HTML file that contains
# deliberately invalid UTF‑8 sequences.

set -euo pipefail   # abort on error, undefined var, or failed pipe

# ---------- helpers ----------
usage() {
    grep '^#' "$0" | cut -c4-
    exit 1
}
log()   { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*"; }
die()   { log "ERROR: $*" >&2; exit 1; }

# ---------- parse arguments ----------
USER='' PASS='' TOKEN='' BASEURL='' SUBDIR='' INSECURE=0

while getopts ":u:p:U:t:d:kh" opt; do
    case $opt in
        u) USER=$OPTARG ;;
        p) PASS=$OPTARG ;;
        U) BASEURL=$OPTARG ;;
        t) TOKEN=$OPTARG ;;
        d) SUBDIR=$OPTARG ;;
        k) INSECURE=1 ;;
        h) usage ;;
        \?) die "Invalid option: -$OPTARG" ;;
        :) die "Option -$OPTARG requires an argument." ;;
    esac
done

[[ -z $BASEURL ]] && die "Base URL (-U) is mandatory."

# Normalise URL – ensure it ends with a single slash
BASEURL="${BASEURL%/}/"

# Build the final remote URL (remote directory + file name)
REMOTE_PATH="${BASEURL}${SUBDIR}"
REMOTE_PATH="${REMOTE_PATH%/}/"   # make sure there is exactly one slash at the end

# ---------- constants ----------
FILENAME="win1252‑test‑$(date +%s).html"
LOCAL_ORIG="$(mktemp /tmp/${FILENAME}.orig.XXXXXX)"
LOCAL_DOWN="$(mktemp /tmp/${FILENAME}.down.XXXXXX)"

# Bytes that are illegal in UTF‑8 but legal in Windows‑1252
INVALID_BYTES=$'\x0A\x81\x8D\x8F\x90\x9D'

# ---------- create the test HTML ----------
cat >"$LOCAL_ORIG" <<'EOF'
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="windows-1252">
    <title>WebDAV encoding test</title>
</head>
<body>
<p>This file is encoded in Windows‑1252 and contains deliberately invalid UTF‑8 bytes.</p>
EOF

# Append a plain LF (0x0A) followed by the illegal bytes
printf '\n' >>"$LOCAL_ORIG"
printf '%b' "$INVALID_BYTES" >>"$LOCAL_ORIG"

log "Created test file: $LOCAL_ORIG (size: $(stat -c%s "$LOCAL_ORIG") bytes)"

# ---------- compute original hash ----------
ORIG_HASH=$(openssl dgst -sha256 "$LOCAL_ORIG" | awk '{print $2}')
log "Original SHA‑256: $ORIG_HASH"

# ---------- sanity‑check the original encoding ----------
ORIG_ENC=$(file -b --mime-encoding "$LOCAL_ORIG")
log "Original file reports MIME‑encoding: $ORIG_ENC"
[[ $ORIG_ENC != "windows-1252" && $ORIG_ENC != "binary" ]] && \
    log "NOTE: 'file' reports '$ORIG_ENC'. This is fine as long as the bytes are unchanged."

# ---------- build curl option string ----------
# If -k/--insecure was given, add -k to every curl invocation.
CURL_INSECURE_OPTS=()
if (( INSECURE )); then
    CURL_INSECURE_OPTS+=("-k")
    log "Insecure mode enabled – curl will skip TLS certificate verification."
fi

# ---------- upload via WebDAV ----------
UPLOAD_URL="${REMOTE_PATH}${FILENAME}"
log "Uploading to $UPLOAD_URL …"

if [[ -n $TOKEN ]]; then
    curl -sS -X PUT "${CURL_INSECURE_OPTS[@]}" \
         -H "Authorization: Bearer $TOKEN" \
         --data-binary @"$LOCAL_ORIG" \
         "$UPLOAD_URL"
else
    curl -sS -X PUT "${CURL_INSECURE_OPTS[@]}" \
         -u "$USER:$PASS" \
         --data-binary @"$LOCAL_ORIG" \
         "$UPLOAD_URL"
fi
log "Upload finished."

# ---------- download the same file ----------
log "Downloading back to $LOCAL_DOWN …"
if [[ -n $TOKEN ]]; then
    curl -sS "${CURL_INSECURE_OPTS[@]}" \
         -H "Authorization: Bearer $TOKEN" \
         -o "$LOCAL_DOWN" "$UPLOAD_URL"
else
    curl -sS "${CURL_INSECURE_OPTS[@]}" \
         -u "$USER:$PASS" \
         -o "$LOCAL_DOWN" "$UPLOAD_URL"
fi
log "Download finished (size: $(stat -c%s "$LOCAL_DOWN") bytes)."

# ---------- compute downloaded hash ----------
DOWN_HASH=$(openssl dgst -sha256 "$LOCAL_DOWN" | awk '{print $2}')
log "Downloaded SHA‑256: $DOWN_HASH"

# ---------- compare hashes ----------
if [[ "$ORIG_HASH" != "$DOWN_HASH" ]]; then
    die "HASH MISMATCH! The file was altered during the WebDAV round‑trip."
else
    log "✅ Hashes match – byte‑wise integrity preserved."
fi

# ---------- verify charset declaration ----------
if grep -i -q '<meta[^>]*charset=["'\'']\?windows-1252["'\'']\?' "$LOCAL_DOWN"; then
    log "✅ Charset meta‑tag still present."
else
    die "Charset meta‑tag missing or altered in the downloaded file."
fi

# ---------- verify the illegal bytes are still there ----------
# Extract the last five bytes (the illegal sequence) and display them as hex.
LAST_LINE_HEX=$(tail -c +$(($(stat -c%s "$LOCAL_DOWN") - 5)) "$LOCAL_DOWN" | xxd -p -c 0)
log "Last 6 bytes (hex) of the downloaded file: $LAST_LINE_HEX"

EXPECTED_HEX=$(printf '%s' "$INVALID_BYTES" | xxd -p -c 0)
if [[ "$LAST_LINE_HEX" != "$EXPECTED_HEX" ]]; then
    die "Invalid‑byte sequence corrupted (expected $EXPECTED_HEX, got $LAST_LINE_HEX)."
else
    log "✅ Invalid Windows‑1252 bytes survived unchanged."
fi

# ---------- clean‑up ----------
log "All checks passed. Cleaning up temporary files."
rm -f "$LOCAL_ORIG" "$LOCAL_DOWN"

log "✅ Test completed successfully."
exit 0

and it passes nicely for me:

❯ bash ./webdav-encoding-test.sh -u dennis -p demo -U https://opencloud-server:9200/remote.php/webdav -k
[14:51:57] Created test file: /tmp/win1252‑test‑1762955517.html.orig.RS64Zx (size: 231 bytes)
[14:51:57] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[14:51:57] Original file reports MIME‑encoding: unknown-8bit
[14:51:57] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[14:51:57] Insecure mode enabled – curl will skip TLS certificate verification.
[14:51:57] Uploading to https://opencloud-server:9200/remote.php/webdav/win1252‑test‑1762955517.html …
[14:51:57] Upload finished.
[14:51:57] Downloading back to /tmp/win1252‑test‑1762955517.html.down.zfwCXB …
[14:51:57] Download finished (size: 231 bytes).
[14:51:57] Downloaded SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[14:51:57] ✅ Hashes match – byte‑wise integrity preserved.
[14:51:57] ✅ Charset meta‑tag still present.
[14:51:57] Last 6 bytes (hex) of the downloaded file: 0a818d8f909d
[14:51:57] ✅ Invalid Windows‑1252 bytes survived unchanged.
[14:51:57] All checks passed. Cleaning up temporary files.
[14:51:57] ✅ Test completed successfully.

You may need to enable basic auth for this or use the t option to pass in a bearer auth token from your browser.

@thedanbob can you run this test script?

That being said, ... when renaming those htm files to txt so I can edit them, the GET response sends a Content-Type: text/plain; charset=UTF-8 header ... maybe the desktop client expexts that encoding then? cc @TheOneRing

butonic avatar Nov 12 '25 13:11 butonic

I ran the script using both the external domain and on my server using the internal IP:port and the results were the same both times:

> % ./webdav-encoding-test.sh -U 'https://<snip>/remote.php/dav/spaces/<snip>' -u <snip> -p <snip>
[09:33:06] Created test file: /tmp/win1252‑test‑1762961586.html.orig.nNqVzd (size: 231 bytes)
[09:33:06] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[09:33:06] Original file reports MIME‑encoding: unknown-8bit
[09:33:06] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[09:33:06] Uploading to https://<snip>/remote.php/dav/spaces/<snip>/win1252‑test‑1762961586.html …
[09:33:07] Upload finished.
[09:33:12] Downloading back to /tmp/win1252‑test‑1762961586.html.down.aSr4OT …
[09:33:13] Download finished (size: 241 bytes).
[09:33:13] Downloaded SHA‑256: 3e2f324e91f1d5da06fcc4460a4cf9aa4389468ffaf638fd161cddf8d97b9e32
[09:33:13] ERROR: HASH MISMATCH! The file was altered during the WebDAV round‑trip.
> % ./webdav-encoding-test.sh -U 'http://10.89.0.33:9200/remote.php/dav/spaces/<snip>' -u <snip> -p <snip>
[09:34:16] Created test file: /tmp/win1252‑test‑1762961656.html.orig.0hK4tw (size: 231 bytes)
[09:34:16] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[09:34:16] Original file reports MIME‑encoding: unknown-8bit
[09:34:16] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[09:34:16] Uploading to http://10.89.0.33:9200/remote.php/dav/spaces/<snip>/win1252‑test‑1762961656.html …
[09:34:16] Upload finished.
[09:34:21] Downloading back to /tmp/win1252‑test‑1762961656.html.down.QbKwhr …
[09:34:21] Download finished (size: 241 bytes).
[09:34:21] Downloaded SHA‑256: 3e2f324e91f1d5da06fcc4460a4cf9aa4389468ffaf638fd161cddf8d97b9e32
[09:34:21] ERROR: HASH MISMATCH! The file was altered during the WebDAV round‑trip.

When I examine the downloaded files, the invalid bytes have been replaced with �.

thedanbob avatar Nov 12 '25 15:11 thedanbob

It seems like only HTML files specifically are affected. When I changed the script to write a .txt file with the same contents it passed.

~Edit: In case it's significant, I couldn't get the /remote.php/webdav endpoint to work which is why I used /remote.php/dav/spaces/<id>. Maybe that uses a slightly different code path which triggers the bug?~

Edit 2: never mind, I tried /remote.php/webdav again and it worked but I still got the hash mismatch.

Edit 3: I tried running the test for as many different text-based file formats as I could think of (i.e. writing the same content with a different file extension). The only file types that were affected were .html and .htm, and only if they actually contained HTML. If I made sure there were no valid HTML fragments in the file (for example, removing all the opening tags) then the issue didn't occur.

thedanbob avatar Nov 12 '25 15:11 thedanbob

This is making less sense all the time. I tried spinning up a new opencloud instance using the directions here on my local machine -- passes the test script. Yet if I try downloading one of the test files within the running container on my server it gets altered. So I guess it has to be my configuration? But I tried disabling as much of my config as possible and nothing made a difference.

thedanbob avatar Nov 12 '25 21:11 thedanbob

Here are debug logs from downloading one of the offending files: download.log

The last two lines (copied below) are the most interesting because they show the content length changing between the internal request to /data (231 bytes) and the originating external request (241 bytes). Unfortunately, I don't know Go well enough to figure out where the /data request originates and what might happen after that.

Nov 17 13:42:37 potados systemd-opencloud[313364]: {"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"opencloud/yaGSSaCTYi-000031","traceid":"65ae1abf18015025e979cc35898d57f6","remote-addr":"68.97.210.9","method":"GET","status":200,"path":"/data","duration":6.957201,"bytes":231,"time":"2025-11-17T19:42:37Z","line":"github.com/opencloud-eu/opencloud/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
Nov 17 13:42:37 potados systemd-opencloud[313364]: {"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"opencloud/yaGSSaCTYi-000029","traceid":"57bb311fac20c4b308989d8c92733ae5","remote-addr":"::ffff:10.13.13.3","method":"GET","status":200,"path":"/remote.php/webdav/win1252-test-1763408517.html","duration":172.318565,"bytes":241,"time":"2025-11-17T19:42:37Z","line":"github.com/opencloud-eu/opencloud/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}

thedanbob avatar Nov 17 '25 22:11 thedanbob

We discussed the issue briefly, and the opinion is also that it is something within your configuration. However, we do not know what yet, and that does not feel good.

A few questions:

  1. Can you elaborate about your server: Hardware, Operating System, Filesystem, Docker?
  2. Is a virus scanning app involved somewhere?
  3. Any system proxies or vpns involved?

Also, can you check again that the file "at rest" in OpenCloud is not altered at all? You wrote that above, but that is so important.

And - the difference between the download from within the container and from external, how do they differ in the end? Could you diff? Is it just different encoding?

It also seems that the internal request uses IPv4 while the external IPv6 - whatever that can mean for this problem.

dragotin avatar Nov 18 '25 09:11 dragotin

However, we do not know what yet, and that does not feel good.

Exactly how I feel! The actual issue isn't that big a deal and I could work around it, but the fact that it doesn't make sense is bugging me.

My server is an Asrock A300 with an AMD Ryzen 5 3400G and 32 GB RAM running Arch Linux. I'm running podman rather than docker, here's the command:

/usr/bin/podman run --name systemd-%N --replace --rm --cgroups=split --hostname opencloud --ip 10.89.0.33 --add-host (oidc host):10.0.0.2 --network systemd-primary --sdnotify=conmon -d --user 1023:1023 -v /srv/docker/opencloud:/etc/opencloud -v /mnt/zfs/opencloud:/var/lib/opencloud --label io.containers.autoupdate=local --env GRAPH_USERNAME_MATCH=none --env OC_EXCLUDE_RUN_SERVICES=idp --env OC_OIDC_CLIENT_ID=b2de8dfe-2732-4a89-8f4d-fe3f27e59597 --env OC_OIDC_ISSUER=https://(oidc host) --env OC_URL=https://(opencloud host)--env PROXY_AUTOPROVISION_ACCOUNTS=true --env PROXY_CSP_CONFIG_FILE_LOCATION=/etc/opencloud/csp.yaml --env PROXY_OIDC_REWRITE_WELLKNOWN=true --env PROXY_ROLE_ASSIGNMENT_DRIVER=oidc --env PROXY_ROLE_ASSIGNMENT_OIDC_CLAIM=opencloudRoles --env PROXY_TLS=false --env PROXY_USER_CS3_CLAIM=username --env PROXY_USER_OIDC_CLAIM=preferred_username --env WEB_ASSET_APPS_PATH=/etc/opencloud/assets --env WEB_UI_CONFIG_FILE=/etc/opencloud/web.config.json docker.io/opencloudeu/opencloud-rolling:latest

The configuration folder is on an ext4 SSD while the storage folder is on a ZFS pool. No virus scanning. I have haproxy sitting in front of opencloud, but I get the same issue if I download the file within the running container, bypassing any proxies.

I can confirm that the file is uploaded and stored correctly on disk. The issue appears to only occur on download.

Here is a partial diff of a changed file (the rest of the diff is identical):

000000d0: e280 9138 2062 7974 6573 2e3c 2f70 3e0a  ...8 bytes	000000d0: e280 9138 2062 7974 6573 2e3c 2f70 3e0a  ...8 bytes
000000e0: 0a0a 818d 8f90 9d                        .......    |	000000e0: 0a0a efbf bdef bfbd efbf bdef bfbd efbf  ..........
							                                  >	000000f0: bd                                       .

The final five bytes 81 8d 8f 90 9d are each replaced by efbfbd which is the UTF-8 replacement character �. It's like the file is run through a string parser at some point and turned into valid UTF-8, but only if it's a legitimate HTML file.

I just noticed that there is a difference between GET and HEAD requests: with GET, the file is changed and the content length increases accordingly. But HEAD (e.g. curl -I) returns the correct original content length.

thedanbob avatar Nov 18 '25 12:11 thedanbob

I solved it by accident while experimenting with a second OpenCloud instance on my server. Apparently all I needed to do was add --add-host external.domain.com:10.0.0.2 to the podman run command (where external.domain.com is the public domain name and 10.0.0.2 is my server's IP address). The second instance wouldn't let me upload at all without that, but somehow it worked on my original instance except for corrupting particular file downloads 🤷‍♂️

Edit: Just in case someone else runs into this issue, I can confirm that OpenCloud was not responsible for the altered downloads. I was proxying my instance through Cloudflare, and was still seeing the issue off my home network until just now when I switched off the proxying.

thedanbob avatar Nov 18 '25 17:11 thedanbob