python-database-sanitizer Newline characters cause rows in PostgreSQL table to be broken inadvertently.

This issue replaces #30.

The issue is that user-inputted data that includes these newline characters:

\u2028
\u2029
\x85

causes the dump to think that the line is actually split into more than one. The result is that the dump raises:

ValueError("Mismatch between column names and values.")

To solve it I added the following to the Python processes:

    process = subprocess.Popen(
        (
            "pg_dump",
            # Force output to be UTF-8 encoded.
            "--encoding=utf-8",
            # Quote all table and column names, just in case.
            "--quote-all-identifiers",
            # Luckily `pg_dump` supports DB URLs, so we can just pass it the
            # URL as argument to the command.
            "--dbname",
            url.geturl().replace('postgis://', 'postgresql://'),
         ) + tuple(extra_params),
        stdout=subprocess.PIPE,
    )

    # Remove newline characters.
    process = subprocess.Popen(
        "sed $'s/\u2028/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\u2029/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\x85/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)

I'd be happy to add as a PR if it's helpful, or is there a better way to handle the issue?

Jun 17 '20 02:06 YPCrumble

I had a similar issue in mysql. See if this fix would work https://github.com/andersinno/python-database-sanitizer/pull/29

Dec 10 '21 10:12 azin634

@azin634 this seems to help with the first two types of newlines, but not all. I'm now getting this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 1: invalid continuation byte

Dec 05 '22 17:12 YPCrumble