程式扎記: [Python 文章收集] 在 Python 2.x 處理 Unicode 字串

2016年10月20日星期四

[Python 文章收集] 在 Python 2.x 處理 Unicode 字串

Source From Here
Preface
寫過 Python 的人應該都遇過下面這個錯誤吧，這是 Python 2.x 典型的編碼錯誤訊息:

UnicodeEncodeError: 'ascii' codec cant decode byte 0xe6 in position 0:
ordinal not in range(128)

相對於其他程式語言而言，Python 2.x 對於編碼的處理較不易讓新手理解，偏偏處理 CJK 一定得用 Unicode。本文用簡單的範例示範如何在 Python 2.x 處理 Unicode 字串。

在 Python 2.x 處理 Unicode 字串

程式碼內出現非 ascii 字元
Python 2.x 預設的編碼是 ascii，如果程式碼(含註解)內出現中文的話，會在編譯時產生錯誤。在程式碼的檔案開頭加上下面這行就能成功編譯：

view plaincopy to clipboardprint?
# -*- coding: utf-8 -*-  

Python 2.x 的「unicode 型態字串」與「str 型態字串」
Python 2.x 中，字串分為「unicode 型態」與「str 型態」兩種，

>>> str_name = '劉德華'
>>> print '1', str_name, type(str_name)
1 劉德華
>>> uni_name = u'劉德華' // 藉由在字串前面加上 u ，建立一個內容為 '金城武' 的 python 「unicode 物件」
>>> print '2', uni_name, type(uni_name)
2 劉德華

此時 uni_name 的資料型態是 python 的「unicode 物件」，並非「str 物件」故當對 uni_name 這個變數做「str 物件」的操作時會出現錯誤（例如與另一個「str 物件」相加）：

>>> print str(uni_name)
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> print uni_name + "中文"
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

我們可以對 uni_name 這個變數做「unicode 物件」操作（例如與另一個「unicode 物件」相加）：

>>> print '3', uni_name + u'中文'
3 劉德華中文
// 相對的，str_name 是 python 「str 物件」，故做「str 物件」的操作時不會出現錯誤（例如與另一個「str 物件」相加）
>>> print '4', str_name + '中文'
4 劉德華中文

Python 2.x 的 encode([encoding]) 與 decode([encoding])
python 有個 method 叫做 encode([encoding_], [errors='strict']) 這個方法可以將「unicode 物件」轉換成以 encoding_ 方式編碼的「str 物件」:

// 剛剛的 uni_name 變數原本是「unicode 物件」
// 用 .encode('utf-8') 將其以 utf-8 編碼方式轉換為「str 物件」
>>> uni_name = u'劉德華'
>>> new_name = uni_name.encode('utf-8')
>>> print '5', new_name, type(new_name)
5 劉德華

// new_name 已經是「str 物件」，做「str 物件」的操作時不會出現錯誤（例如與另一個「str 物件」相加）
>>> print '6', new_name + '中文'
6 劉德華中文

同樣的道理，我們也可以用 decode([encoding_]) 將「str 物件」還原成「unicode 物件」

>>> original_unicode_form = new_name.decode('utf-8')
>>> print '7', original_unicode_form, type(original_unicode_form)
7 劉德華

# 之後就可對此變數「unicode 物件」操作（例如與另一個「unicode 物件」相加）
>>> print '7', original_unicode_form, type(original_unicode_form)
7 劉德華
>>> print '8', original_unicode_form + u'略懂略懂'
8 劉德華略懂略懂
>>> print '8', original_unicode_form + '略懂略懂'
8
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

Python 2.x 字串操作 Unicode code print
pyhton 的「unicode 物件」除了在操作時不用擔心編碼問題外，也可以直接插入字元的 unicode code print，例如：

# 註1. 在 python 中，以 "\uXXXX" 表示 unicode code print 的 U+XXXX
# 例如 '\u5566' 代表 U+5566
# 註2. http://www.charbase.com/5566-unicode-cjk-unified-ideograph
# 註3. \u6211 = 我, \u672C = 本, \u4EBA = 人, \u5566 = 啦
>>> print '9', original_unicode_form + u"\u6211\u672C\u4EBA\u5566"
9 劉德華我本人啦

在 Python 2.x 處理 Unicode 字串 - 檔案 I/O

1. open(file)
讀取檔案時，預設會以「str 型態」讀進資料:

# python 預設的讀檔方式會將資料讀取成 python 的「str 物件」型態
>>> file_handler = open('test.txt', 'r')
>>> for line in file_handler: print("%s %s" % (line.rstrip(), type(line)))
...
出師表
諸葛亮
>>> file_handler.close()

2. codecs.open(file, encoding)
用 codecs module 讀寫檔案時可指定 encoding，可以「unicode 型態」讀進資料

# import codecs 後，可善用 codecs.open(encoding) 的 encoding 參數，
# 若設定正確，則 python 會自動在讀取資料時轉換成 python 的「unicode 物件」型態
>>> file_handler = codecs.open('test.txt', 'r', encoding='utf-8')
>>> for line in file_handler: print("%s\t%s" % (line.rstrip(), type(line)))
...
出師表
諸葛亮
>>> file_handler.close()

3. json.load(), json.loads()
當使用 json.loads 讀取 json 資料時，回傳的結果會是「unicode 物件」型態:

>>> import json
>>> file_handler = open('test_json.txt', 'r')
>>> data = json.loads(file_handler.read())
>>> title = data['title']
>>> author = data['author']
>>> print title, type(title)
出師表
>>> print author, type(author)
諸葛亮
>>> file_handler.close()

在 Python 2.x 處理 Unicode 字串 - 結論

1. type() 看字串型態
當出現亂碼時，用 type() 看看該變數是「unicode 物件」還是「str 物件」，然後用 encode() 或 decode() 將其轉成你要的型態。

2. encode() 與 decode()

Anyway, all you have to remember for your to-and-fro Unicode conversions is:
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
In both cases, you need to specify the encoding that will be used. – tzot

*「unicode 物件」透過 encode(encoding) 變成「str 物件」(i.e. a sequence of bytes)
*「str 物件」透過 decode(encoding) 變成「unicode 物件」
* encode() 和 decode() 也能用來轉換其他編碼:

>>> str_name = '金城武'
>>> print str_name, type(str_name)
金城武

>>> base64_name = str_name.encode('base64')
>>> print 'Base64 of', str_name, 'is', base64_name
Base64 of 金城武 is 6YeR5Z+O5q2m

>>> print base64_name.decode('base64')
金城武

3. I/O 輸入輸出
如同 Unicode In Python, Completely Demystified 建議的，記住三個原則：

* Decode early
* Unicode everywhere
* Encode late

並使用 codecs.open(file, encoding)

Supplement
* Python Doc - Unicode HOWTO

程式扎記

標籤

2016年10月20日星期四

[Python 文章收集] 在 Python 2.x 處理 Unicode 字串

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2016年10月20日 星期四

[Python 文章收集] 在 Python 2.x 處理 Unicode 字串

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2016年10月20日星期四