Non-UTF-8 HTML files fail to download/sync properly
Describe the bug
Uploaded HTML files that are not UTF-8 do not get downloaded or synced accurately.
Steps to reproduce
- Upload an HTML file with non-UTF-8 characters (example: test.htm)
- Download the file and compare checksums (for the example file: original
8d1c3bb3587e137d3671f1759b5ef488b2db67ca, downloadedfa06ac170eb3f0c9051bde3551633014fc36a8b0) - Alternatively, attempt to sync the file using the desktop app. The file will fail to sync as the checksums do not match
Expected behavior
Files should always download byte-for-byte as they were uploaded.
Actual behavior
Non-UTF-8 characters are being scrubbed from the HTML file and replaced with �.
Setup
I'm running opencloudeu/opencloud-rolling:latest in podman, not the compose setup (I only need file syncing).
OC_URL=<snip>
OC_OIDC_ISSUER=<snip>
OC_OIDC_CLIENT_ID=<snip>
OC_EXCLUDE_RUN_SERVICES=idp
PROXY_TLS=false
PROXY_CSP_CONFIG_FILE_LOCATION=/etc/opencloud/csp.yaml
PROXY_OIDC_REWRITE_WELLKNOWN=true
PROXY_USER_OIDC_CLAIM=preferred_username
PROXY_USER_CS3_CLAIM=username
PROXY_AUTOPROVISION_ACCOUNTS=true
GRAPH_USERNAME_MATCH=none
PROXY_ROLE_ASSIGNMENT_DRIVER=oidc
PROXY_ROLE_ASSIGNMENT_OIDC_CLAIM=opencloudRoles
WEB_UI_CONFIG_FILE=/etc/opencloud/web.config.json
WEB_ASSET_APPS_PATH=/etc/opencloud/assets
Additional context
I tried downloading the file directly from the OpenCloud instance via ip:port (bypassing my reverse proxy) and it still had the issue, so the problem must be within OpenCloud. Also, the file stored on the server matches the uploaded file so the issue is specifically when downloading.
@rhafer @dragotin @dragonchaser
This needs to be confirmed IMHO.
In case it's helpful, here are two more files that reproduce the issue, as well as one that doesn't for some reason (even though it's from the same source):
I can see that the htm files are encoded with windows-1252:
<meta http-equiv="content-type" content="text/html; charset=windows-1252">
And there are some "invisible" characters: The character U+00a0 is invisible 0a is LF / linefeed.
ai produced this test script:
#!/usr/bin/env bash
#
# webdav-encoding-test.sh
#
# Verify that a WebDAV server does NOT alter the byte‑level encoding
# (windows‑1252) nor the SHA‑256 hash of an HTML file that contains
# deliberately invalid UTF‑8 sequences.
set -euo pipefail # abort on error, undefined var, or failed pipe
# ---------- helpers ----------
usage() {
grep '^#' "$0" | cut -c4-
exit 1
}
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*"; }
die() { log "ERROR: $*" >&2; exit 1; }
# ---------- parse arguments ----------
USER='' PASS='' TOKEN='' BASEURL='' SUBDIR='' INSECURE=0
while getopts ":u:p:U:t:d:kh" opt; do
case $opt in
u) USER=$OPTARG ;;
p) PASS=$OPTARG ;;
U) BASEURL=$OPTARG ;;
t) TOKEN=$OPTARG ;;
d) SUBDIR=$OPTARG ;;
k) INSECURE=1 ;;
h) usage ;;
\?) die "Invalid option: -$OPTARG" ;;
:) die "Option -$OPTARG requires an argument." ;;
esac
done
[[ -z $BASEURL ]] && die "Base URL (-U) is mandatory."
# Normalise URL – ensure it ends with a single slash
BASEURL="${BASEURL%/}/"
# Build the final remote URL (remote directory + file name)
REMOTE_PATH="${BASEURL}${SUBDIR}"
REMOTE_PATH="${REMOTE_PATH%/}/" # make sure there is exactly one slash at the end
# ---------- constants ----------
FILENAME="win1252‑test‑$(date +%s).html"
LOCAL_ORIG="$(mktemp /tmp/${FILENAME}.orig.XXXXXX)"
LOCAL_DOWN="$(mktemp /tmp/${FILENAME}.down.XXXXXX)"
# Bytes that are illegal in UTF‑8 but legal in Windows‑1252
INVALID_BYTES=$'\x0A\x81\x8D\x8F\x90\x9D'
# ---------- create the test HTML ----------
cat >"$LOCAL_ORIG" <<'EOF'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="windows-1252">
<title>WebDAV encoding test</title>
</head>
<body>
<p>This file is encoded in Windows‑1252 and contains deliberately invalid UTF‑8 bytes.</p>
EOF
# Append a plain LF (0x0A) followed by the illegal bytes
printf '\n' >>"$LOCAL_ORIG"
printf '%b' "$INVALID_BYTES" >>"$LOCAL_ORIG"
log "Created test file: $LOCAL_ORIG (size: $(stat -c%s "$LOCAL_ORIG") bytes)"
# ---------- compute original hash ----------
ORIG_HASH=$(openssl dgst -sha256 "$LOCAL_ORIG" | awk '{print $2}')
log "Original SHA‑256: $ORIG_HASH"
# ---------- sanity‑check the original encoding ----------
ORIG_ENC=$(file -b --mime-encoding "$LOCAL_ORIG")
log "Original file reports MIME‑encoding: $ORIG_ENC"
[[ $ORIG_ENC != "windows-1252" && $ORIG_ENC != "binary" ]] && \
log "NOTE: 'file' reports '$ORIG_ENC'. This is fine as long as the bytes are unchanged."
# ---------- build curl option string ----------
# If -k/--insecure was given, add -k to every curl invocation.
CURL_INSECURE_OPTS=()
if (( INSECURE )); then
CURL_INSECURE_OPTS+=("-k")
log "Insecure mode enabled – curl will skip TLS certificate verification."
fi
# ---------- upload via WebDAV ----------
UPLOAD_URL="${REMOTE_PATH}${FILENAME}"
log "Uploading to $UPLOAD_URL …"
if [[ -n $TOKEN ]]; then
curl -sS -X PUT "${CURL_INSECURE_OPTS[@]}" \
-H "Authorization: Bearer $TOKEN" \
--data-binary @"$LOCAL_ORIG" \
"$UPLOAD_URL"
else
curl -sS -X PUT "${CURL_INSECURE_OPTS[@]}" \
-u "$USER:$PASS" \
--data-binary @"$LOCAL_ORIG" \
"$UPLOAD_URL"
fi
log "Upload finished."
# ---------- download the same file ----------
log "Downloading back to $LOCAL_DOWN …"
if [[ -n $TOKEN ]]; then
curl -sS "${CURL_INSECURE_OPTS[@]}" \
-H "Authorization: Bearer $TOKEN" \
-o "$LOCAL_DOWN" "$UPLOAD_URL"
else
curl -sS "${CURL_INSECURE_OPTS[@]}" \
-u "$USER:$PASS" \
-o "$LOCAL_DOWN" "$UPLOAD_URL"
fi
log "Download finished (size: $(stat -c%s "$LOCAL_DOWN") bytes)."
# ---------- compute downloaded hash ----------
DOWN_HASH=$(openssl dgst -sha256 "$LOCAL_DOWN" | awk '{print $2}')
log "Downloaded SHA‑256: $DOWN_HASH"
# ---------- compare hashes ----------
if [[ "$ORIG_HASH" != "$DOWN_HASH" ]]; then
die "HASH MISMATCH! The file was altered during the WebDAV round‑trip."
else
log "✅ Hashes match – byte‑wise integrity preserved."
fi
# ---------- verify charset declaration ----------
if grep -i -q '<meta[^>]*charset=["'\'']\?windows-1252["'\'']\?' "$LOCAL_DOWN"; then
log "✅ Charset meta‑tag still present."
else
die "Charset meta‑tag missing or altered in the downloaded file."
fi
# ---------- verify the illegal bytes are still there ----------
# Extract the last five bytes (the illegal sequence) and display them as hex.
LAST_LINE_HEX=$(tail -c +$(($(stat -c%s "$LOCAL_DOWN") - 5)) "$LOCAL_DOWN" | xxd -p -c 0)
log "Last 6 bytes (hex) of the downloaded file: $LAST_LINE_HEX"
EXPECTED_HEX=$(printf '%s' "$INVALID_BYTES" | xxd -p -c 0)
if [[ "$LAST_LINE_HEX" != "$EXPECTED_HEX" ]]; then
die "Invalid‑byte sequence corrupted (expected $EXPECTED_HEX, got $LAST_LINE_HEX)."
else
log "✅ Invalid Windows‑1252 bytes survived unchanged."
fi
# ---------- clean‑up ----------
log "All checks passed. Cleaning up temporary files."
rm -f "$LOCAL_ORIG" "$LOCAL_DOWN"
log "✅ Test completed successfully."
exit 0
and it passes nicely for me:
⯠bash ./webdav-encoding-test.sh -u dennis -p demo -U https://opencloud-server:9200/remote.php/webdav -k
[14:51:57] Created test file: /tmp/win1252‑test‑1762955517.html.orig.RS64Zx (size: 231 bytes)
[14:51:57] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[14:51:57] Original file reports MIME‑encoding: unknown-8bit
[14:51:57] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[14:51:57] Insecure mode enabled – curl will skip TLS certificate verification.
[14:51:57] Uploading to https://opencloud-server:9200/remote.php/webdav/win1252‑test‑1762955517.html …
[14:51:57] Upload finished.
[14:51:57] Downloading back to /tmp/win1252‑test‑1762955517.html.down.zfwCXB …
[14:51:57] Download finished (size: 231 bytes).
[14:51:57] Downloaded SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[14:51:57] ✅ Hashes match – byte‑wise integrity preserved.
[14:51:57] ✅ Charset meta‑tag still present.
[14:51:57] Last 6 bytes (hex) of the downloaded file: 0a818d8f909d
[14:51:57] ✅ Invalid Windows‑1252 bytes survived unchanged.
[14:51:57] All checks passed. Cleaning up temporary files.
[14:51:57] ✅ Test completed successfully.
You may need to enable basic auth for this or use the t option to pass in a bearer auth token from your browser.
@thedanbob can you run this test script?
That being said, ... when renaming those htm files to txt so I can edit them, the GET response sends a Content-Type: text/plain; charset=UTF-8 header ... maybe the desktop client expexts that encoding then? cc @TheOneRing
I ran the script using both the external domain and on my server using the internal IP:port and the results were the same both times:
> % ./webdav-encoding-test.sh -U 'https://<snip>/remote.php/dav/spaces/<snip>' -u <snip> -p <snip>
[09:33:06] Created test file: /tmp/win1252‑test‑1762961586.html.orig.nNqVzd (size: 231 bytes)
[09:33:06] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[09:33:06] Original file reports MIME‑encoding: unknown-8bit
[09:33:06] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[09:33:06] Uploading to https://<snip>/remote.php/dav/spaces/<snip>/win1252‑test‑1762961586.html …
[09:33:07] Upload finished.
[09:33:12] Downloading back to /tmp/win1252‑test‑1762961586.html.down.aSr4OT …
[09:33:13] Download finished (size: 241 bytes).
[09:33:13] Downloaded SHA‑256: 3e2f324e91f1d5da06fcc4460a4cf9aa4389468ffaf638fd161cddf8d97b9e32
[09:33:13] ERROR: HASH MISMATCH! The file was altered during the WebDAV round‑trip.
> % ./webdav-encoding-test.sh -U 'http://10.89.0.33:9200/remote.php/dav/spaces/<snip>' -u <snip> -p <snip>
[09:34:16] Created test file: /tmp/win1252‑test‑1762961656.html.orig.0hK4tw (size: 231 bytes)
[09:34:16] Original SHA‑256: c8c646818aef558a67c5e0e6c8d20e200f33689fb7a17304c05df6da20a82547
[09:34:16] Original file reports MIME‑encoding: unknown-8bit
[09:34:16] NOTE: 'file' reports 'unknown-8bit'. This is fine as long as the bytes are unchanged.
[09:34:16] Uploading to http://10.89.0.33:9200/remote.php/dav/spaces/<snip>/win1252‑test‑1762961656.html …
[09:34:16] Upload finished.
[09:34:21] Downloading back to /tmp/win1252‑test‑1762961656.html.down.QbKwhr …
[09:34:21] Download finished (size: 241 bytes).
[09:34:21] Downloaded SHA‑256: 3e2f324e91f1d5da06fcc4460a4cf9aa4389468ffaf638fd161cddf8d97b9e32
[09:34:21] ERROR: HASH MISMATCH! The file was altered during the WebDAV round‑trip.
When I examine the downloaded files, the invalid bytes have been replaced with �.
It seems like only HTML files specifically are affected. When I changed the script to write a .txt file with the same contents it passed.
~Edit: In case it's significant, I couldn't get the /remote.php/webdav endpoint to work which is why I used /remote.php/dav/spaces/<id>. Maybe that uses a slightly different code path which triggers the bug?~
Edit 2: never mind, I tried /remote.php/webdav again and it worked but I still got the hash mismatch.
Edit 3: I tried running the test for as many different text-based file formats as I could think of (i.e. writing the same content with a different file extension). The only file types that were affected were .html and .htm, and only if they actually contained HTML. If I made sure there were no valid HTML fragments in the file (for example, removing all the opening tags) then the issue didn't occur.
This is making less sense all the time. I tried spinning up a new opencloud instance using the directions here on my local machine -- passes the test script. Yet if I try downloading one of the test files within the running container on my server it gets altered. So I guess it has to be my configuration? But I tried disabling as much of my config as possible and nothing made a difference.
Here are debug logs from downloading one of the offending files: download.log
The last two lines (copied below) are the most interesting because they show the content length changing between the internal request to /data (231 bytes) and the originating external request (241 bytes). Unfortunately, I don't know Go well enough to figure out where the /data request originates and what might happen after that.
Nov 17 13:42:37 potados systemd-opencloud[313364]: {"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"opencloud/yaGSSaCTYi-000031","traceid":"65ae1abf18015025e979cc35898d57f6","remote-addr":"68.97.210.9","method":"GET","status":200,"path":"/data","duration":6.957201,"bytes":231,"time":"2025-11-17T19:42:37Z","line":"github.com/opencloud-eu/opencloud/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
Nov 17 13:42:37 potados systemd-opencloud[313364]: {"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"opencloud/yaGSSaCTYi-000029","traceid":"57bb311fac20c4b308989d8c92733ae5","remote-addr":"::ffff:10.13.13.3","method":"GET","status":200,"path":"/remote.php/webdav/win1252-test-1763408517.html","duration":172.318565,"bytes":241,"time":"2025-11-17T19:42:37Z","line":"github.com/opencloud-eu/opencloud/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
We discussed the issue briefly, and the opinion is also that it is something within your configuration. However, we do not know what yet, and that does not feel good.
A few questions:
- Can you elaborate about your server: Hardware, Operating System, Filesystem, Docker?
- Is a virus scanning app involved somewhere?
- Any system proxies or vpns involved?
Also, can you check again that the file "at rest" in OpenCloud is not altered at all? You wrote that above, but that is so important.
And - the difference between the download from within the container and from external, how do they differ in the end? Could you diff? Is it just different encoding?
It also seems that the internal request uses IPv4 while the external IPv6 - whatever that can mean for this problem.
However, we do not know what yet, and that does not feel good.
Exactly how I feel! The actual issue isn't that big a deal and I could work around it, but the fact that it doesn't make sense is bugging me.
My server is an Asrock A300 with an AMD Ryzen 5 3400G and 32 GB RAM running Arch Linux. I'm running podman rather than docker, here's the command:
The configuration folder is on an ext4 SSD while the storage folder is on a ZFS pool. No virus scanning. I have haproxy sitting in front of opencloud, but I get the same issue if I download the file within the running container, bypassing any proxies.
I can confirm that the file is uploaded and stored correctly on disk. The issue appears to only occur on download.
Here is a partial diff of a changed file (the rest of the diff is identical):
000000d0: e280 9138 2062 7974 6573 2e3c 2f70 3e0a ...8 bytes 000000d0: e280 9138 2062 7974 6573 2e3c 2f70 3e0a ...8 bytes
000000e0: 0a0a 818d 8f90 9d ....... | 000000e0: 0a0a efbf bdef bfbd efbf bdef bfbd efbf ..........
> 000000f0: bd .
The final five bytes 81 8d 8f 90 9d are each replaced by efbfbd which is the UTF-8 replacement character �. It's like the file is run through a string parser at some point and turned into valid UTF-8, but only if it's a legitimate HTML file.
I just noticed that there is a difference between GET and HEAD requests: with GET, the file is changed and the content length increases accordingly. But HEAD (e.g. curl -I) returns the correct original content length.
I solved it by accident while experimenting with a second OpenCloud instance on my server. Apparently all I needed to do was add --add-host external.domain.com:10.0.0.2 to the podman run command (where external.domain.com is the public domain name and 10.0.0.2 is my server's IP address). The second instance wouldn't let me upload at all without that, but somehow it worked on my original instance except for corrupting particular file downloads 🤷♂️
Edit: Just in case someone else runs into this issue, I can confirm that OpenCloud was not responsible for the altered downloads. I was proxying my instance through Cloudflare, and was still seeing the issue off my home network until just now when I switched off the proxying.