I have a SQL query that I execute like this with an SQLAlchemy engine:
result = engine.execute('SELECT utf_8_field FROM table')
The database is MySQL and the column type is TEXT with UTF-8 encoding. The type of the returned utf_8_field is “str”, even if I set the option convert_unicode=True when creating the engine. What happens now is that if I have a character like ‘é’ in my string (which is not in 7-bit ASCII, but is in the extended ASCII set), I get a UnicodeDecodeError when trying to execute this:
The exact error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
When looking into this, I found that str.encode do not support the extended ASCII character set! I find this really strange, but that’s another question.
What I don’t understand is why SQLAlchemy is not giving me a unicode string. I was previously using DB-API and that was working fine. I also don’t have SQLAlchemy table objects for my tables yet, that’s why I’m using an execute command.
If you want the data converted automatically, you should specify the charset when you create the engine:
use_unicode alone won’t tell sqlalchemy which charset to use.
To convert from an UTF-8 bytestring to a unicode object, you need to decode:
Also, when executing a raw
.execute, SQLAlchemy has no metadata to work out that your query is returning utf-8 data, so it is not converting this information to unicode for you.
In other words,
convert_unicode only works if you use the SQLAlchemy SQL expression API or the ORM functionality.
EDIT: As pointed out, your data is not even UTF-8 encoded;
0xe9 in UTF-8 would indicate a character between
\u9fff, which are CJK unified ideographs while you said it was a latin-1 character, whose UTF-8 code would start with
0xc3. This is probably
ISO-8859-1 (latin-1) or similar instead:
>>> u'é'.encode('ISO-8859-1') '\xe9'
The conclusion then is to tell SQLAlchemy to connect with a different character set, using the
charset=utf8 parameter, as pointed out by @mata.