0

I am parsing out a report form amazon, splits up lines into fields and then creates a mysql upload. Data I believe is originally iso-8859-1. Data uploads fine to mysql unless it has some special characters in it like an Ä or ®. If that happens I get an error like pymysql.err.InternalError: (1366, "Incorrect string value: '\\xAE Kids...' for column 'item-name' at row 74") & TypeError: can only concatenate str (not "bytearray") to str. I can hack around it by doing a replacing the bytes but I dont want to build a giant list plus I really want to store the proper values. I tried changing my mysql character sets and collations but that didnt seem to fix. I feel like the fix is a simple fix but i am already been trying things for a few hours.

report_as_dict = report.parsed
report_as_dict = report_as_dict.replace(b' \r\n', b'\r\n')  # remove black space at end

 multi_line_rebuild=list()
    for line in line_split[1:]:
        field_split = line.split(b'\t')
        logger.debug('Field Split : %s', field_split)
        field_split = [x.replace(b'\x92', b'') for x in field_split]  # removes single quotes
        field_split = [x.replace(b'\xA0', b'') for x in field_split]  # removes (
        field_split = [x.replace(b'\xAE', b'') for x in field_split]  # removes @
        field_split = [x.replace(b'\xCD', b'l') for x in field_split]  # replaces l with ' with l
        field_split = [x.replace(b'\xE4', b'a') for x in field_split]  # replaces a with two dots with a

        multi_line_rebuild.append(field_split)

 ....


 run_query_with_warnings(query_string, field_split=multi_line_rebuild)

Function

def run_query_with_warnings(warn_type, query_string, **kargs):

db = MySQLdb.connect(host=cred.host, user=cred.user, password=cred.password, db=cred.db, port=cred.port)
cursor = db.cursor()
cursor.executemany(query_string, kargs['field_split'])
1
  • Each table is utf8 / utf8_general_ci Commented Mar 1, 2020 at 12:26

2 Answers 2

1

The code is trying to write text encoded as ISO-8859-1 to tables set to expect UTF-8.

There are two solutions:

  • Set the charset argument on the connection to latin1 (this is the same as ISO-8859-1) and let the connection handle re-encoding the bytes to UTF-8

    db = MySQLdb.connect(host=cred.host, user=cred.user, password=cred.password, 
                         db=cred.db, port=cred.port, charset='latin1')
    
  • decode the encoded bytes to str and let the connection perform the encoding.

    report_as_dict = report.parsed.decode('ISO-8859-1')
    

If the code is doing nothing apart from writing the bytes directly to the database then the first option is fine; if the bytes are undergoing further manipulations then decoding to str will keep things simple.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks... if all my data is ISO-8859-1 would it make sense to change the data type of my tables?
@personalt UTF-8 supports all unicode codepoints, whereas ISO-8859-1 only support 256 of them, so in general I'd be inclined to stick with UTF-8. On the other hand, if you know your data's encoding will never change, then changing the table's encoding might be worthwhile.
0

The client is working with latin1 encoding (92, etc). The table would like to have the utf8 encoding (E28099) for that "right single quotation mark". You can achieve that by telling MySQL that the client is using latin1 in the connection parameters, and having the column be utf8 (or utf8mb4).

The former is something like

db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME,
              charset="utf8", use_unicode=True)

Also check whether or not you should change the beginning of your source to

# -*- coding: utf-8 -*-

But... I am worried. Are you really using right quote, registered sign (AE), I-acute, and a-double-dot? Or is this merely the beginning of some other mess? Sometimes multiple bytes in a row are 'bad'. To further analyze your situation, please get the hex for more than just one byte, and/or provide what characters you think the text should include.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.