Home » Python » UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Posted by: admin November 1, 2017 Leave a comment

Questions:

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.


Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <[email protected]>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to sell all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

Answers:

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note:
This solution will strip out (ignore) the characters in question returning the string without them.
Only use this if your need is to strip them not convert them.

Alternatively, use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, "r",encoding='utf-8', errors='ignore') as fdata:

Questions:
Answers:
>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ

Questions:
Answers:

This type of issue crops up for me now that I’ve moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

Questions:
Answers:

I had same problem with UnicodeDecodeError and i solved it with this line.
Don’t know if is the best way but it worked for me.

str = str.decode('unicode_escape').encode('utf-8')

Questions:
Answers:

Just in case of someone has the same problem. I’am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.