python-database-sanitizer icon indicating copy to clipboard operation
python-database-sanitizer copied to clipboard

Newline characters cause rows in PostgreSQL table to be broken inadvertently.

Open YPCrumble opened this issue 5 years ago • 2 comments

This issue replaces #30.

The issue is that user-inputted data that includes these newline characters:

  • \u2028
  • \u2029
  • \x85

causes the dump to think that the line is actually split into more than one. The result is that the dump raises:

ValueError("Mismatch between column names and values.")

To solve it I added the following to the Python processes:

    process = subprocess.Popen(
        (
            "pg_dump",
            # Force output to be UTF-8 encoded.
            "--encoding=utf-8",
            # Quote all table and column names, just in case.
            "--quote-all-identifiers",
            # Luckily `pg_dump` supports DB URLs, so we can just pass it the
            # URL as argument to the command.
            "--dbname",
            url.geturl().replace('postgis://', 'postgresql://'),
         ) + tuple(extra_params),
        stdout=subprocess.PIPE,
    )

    # Remove newline characters.
    process = subprocess.Popen(
        "sed $'s/\u2028/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\u2029/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\x85/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)

I'd be happy to add as a PR if it's helpful, or is there a better way to handle the issue?

YPCrumble avatar Jun 17 '20 02:06 YPCrumble

I had a similar issue in mysql. See if this fix would work https://github.com/andersinno/python-database-sanitizer/pull/29

azin634 avatar Dec 10 '21 10:12 azin634

@azin634 this seems to help with the first two types of newlines, but not all. I'm now getting this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 1: invalid continuation byte

YPCrumble avatar Dec 05 '22 17:12 YPCrumble