Unicode

2013/04/26 Python

License: (CC 3.0) BY-NC-SA

Example

Unicode object

>>> s1 = r'幸福'
>>> type(s1)
<type 'str'>
>>> s1
'\xe5\xb9\xb8\xe7\xa6\x8f'
>>> s2 = u'快乐'
>>> type(s2)
<type 'unicode'>
>>> s2
u'\u5feb\u4e50'

encode

encode means transform an unicode object to a str object

>>> s1.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
>>> s2.encode('utf-8')
'\xe5\xbf\xab\xe4\xb9\x90'

decode

decode means transform a str to an unicode object, please note that i’m using linux system, and the default charset is utf-8, if you’r using windows or other charset, please change utf-8 to your default charset

>>> s1.decode('utf-8')
u'\u5e78\u798f'
>>> s2.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

decode a str in the code of utf-8 to unicode, please note that if you’re using utf-8 directly, it will change the backslash to a unichar instead of treat it as unicode escape, so you must specify it formaly.

>>> s3 = r'\u5e78\u798f'
>>> s3.decode('utf-8')
u'\\u5e78\\u798f'
>>> print s3.decode('utf-8')
\u5e78\u798f
>>> s3.decode('unicode-escape')
u'\u5e78\u798f'
>>> print s3.decode('unicode-escape')
幸福

if your str contains escaped character, you should use raw-unicode-escape instead of raw-unicode-escape

>>> s3 = '\\n\u5e78\u798f'
>>> s3.decode('unicode-escape')
u'\n\u5e78\u798f'
>>> s3.decode('raw-unicode-escape')
u'\\n\u5e78\u798f'
>>> print s3.decode('unicode-escape')

幸福
>>> print s3.decode('raw-unicode-escape')
\n幸福

Search

    Table of Contents