census-postgres-scripts icon indicating copy to clipboard operation
census-postgres-scripts copied to clipboard

geoheader for 2009_5yr has an invalid character encoding

Open peterwang998 opened this issue 9 years ago • 9 comments

After unzipping the 2009_5yr manually, it turns out a file in the geoheader has an incorrect encoding:

(I also removed the -q flag to see where it failed for importing geoheader)

psql:create_geoheader.sql:1: NOTICE:  table "geoheader" does not exist, skipping
Creating geoheader comments
Creating temp geoheader
psql:create_tmp_geoheader.sql:1: NOTICE:  table "tmp_geoheader" does not exist, skipping
Creating temp import tables
Importing geoheader
COPY 4946
COPY 7117
COPY 14397
COPY 32732
COPY 2707
COPY 12598
COPY 1899
COPY 15952
COPY 11356
psql:import_geoheader.sql:10: ERROR:  invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x6e 0x63
CONTEXT:  COPY tmp_geoheader, line 965
Failed importing geoheader.

peterwang998 avatar Nov 09 '16 21:11 peterwang998

I'm pretty sure these files are all latin-1.

iandees avatar Nov 09 '16 21:11 iandees

Good call!

WARNING: CHANGING ENCODING PREVENTS THE LOADING OF METADATA (which is encoded in UTF-8)

Just found this for future reference: https://www.census.gov/geo/maps-data/data/tiger/char_encoding.html

Until 2014, character encoding was in Latin-1 Starting from 2015, character encoding changed to UTF-8

For future reference, if anyone else is having this issue, the safe way to resolve this is to drop the database and then create a new database with the correct encoding (LATIN1)

or use

sudo -u postgres psql -c "update pg_database set encoding = pg_char_to_encoding('LATIN1') where datname = 'census'"

if you're ok with an adhoc solution

peterwang998 avatar Nov 10 '16 19:11 peterwang998

If you're trying to get at the data, you can just download the SQL dumps here: http://censusreporter.tumblr.com/post/73727555158/easier-access-to-acs-data

iandees avatar Nov 10 '16 19:11 iandees

Unfortunately not all the tables are avaliable as a snapshot.

peterwang998 avatar Nov 10 '16 19:11 peterwang998

Now i'm getting a duplicate key value error:

COPY 12012
COPY 47601
COPY 6296
COPY 5721
COPY 9645
COPY 2541
COPY 23230
COPY 1571
COPY 29396
COPY 14580
COPY 34314
COPY 17149
COPY 9524
COPY 19446
COPY 12131
COPY 15782
COPY 10689
COPY 12362
COPY 12445
COPY 37573
COPY 27692
Parsing geoheader
psql:parse_tmp_geoheader.sql:55: ERROR:  duplicate key value violates unique constraint "geoheader_pkey"
DETAIL:  Key (geoid)=(04000US49) already exists.
Failed parsing geoheader.

peterwang998 avatar Nov 10 '16 19:11 peterwang998

What tables aren't available in the snapshot?

iandees avatar Nov 10 '16 19:11 iandees

acs2009_5yr

peterwang998 avatar Nov 10 '16 19:11 peterwang998

I think you're running into the problems I ran into and didn't have time to work through earlier (since Census Reporter wasn't going to use 2009 data).

I would be extremely happy to review a pull request that fixes this, though!

iandees avatar Nov 11 '16 13:11 iandees

I'll try my best, not sure if I can find the duplicate key though....

peterwang998 avatar Nov 11 '16 17:11 peterwang998