Home » Mysql » SQLAlchemy result for UTF-8 column is of type 'str', why?

SQLAlchemy result for UTF-8 column is of type 'str', why?

Posted by: admin November 30, 2017 Leave a comment

Questions:

I have a SQL query that I execute like this with an SQLAlchemy engine:

result = engine.execute('SELECT utf_8_field FROM table')

The database is MySQL and the column type is TEXT with UTF-8 encoding. The type of the returned utf_8_field is “str”, even if I set the option convert_unicode=True when creating the engine. What happens now is that if I have a character like ‘é’ in my string (which is not in 7-bit ASCII, but is in the extended ASCII set), I get a UnicodeDecodeError when trying to execute this:

utf_8_field.encode("utf-8")

The exact error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)

When looking into this, I found that str.encode do not support the extended ASCII character set! I find this really strange, but that’s another question.

What I don’t understand is why SQLAlchemy is not giving me a unicode string. I was previously using DB-API and that was working fine. I also don’t have SQLAlchemy table objects for my tables yet, that’s why I’m using an execute command.

Any idea?

Answers:

If you want the data converted automatically, you should specify the charset when you create the engine:

create_engine('mysql+mysqldb:///mydb?charset=utf8')

Setting use_unicode alone won’t tell sqlalchemy which charset to use.

Questions:
Answers:

To convert from an UTF-8 bytestring to a unicode object, you need to decode:

utf_8_field.decode('utf8')

Also, when executing a raw SELECT through .execute, SQLAlchemy has no metadata to work out that your query is returning utf-8 data, so it is not converting this information to unicode for you.

In other words, convert_unicode only works if you use the SQLAlchemy SQL expression API or the ORM functionality.

EDIT: As pointed out, your data is not even UTF-8 encoded; 0xe9 in UTF-8 would indicate a character between \u9000 and \u9fff, which are CJK unified ideographs while you said it was a latin-1 character, whose UTF-8 code would start with 0xc3. This is probably ISO-8859-1 (latin-1) or similar instead:

>>> u'é'.encode('ISO-8859-1')
'\xe9'

The conclusion then is to tell SQLAlchemy to connect with a different character set, using the charset=utf8 parameter, as pointed out by @mata.