Source From Here
Agenda
To follow along easily, it would help if you understand concept of unicode, encoding and decoding in general. Please refer to our last blog to understand the basics of unicode and encoding. This post assumes you use Python 2.7 and this will not be useful if you are using Python 3.
Basics
Make sure your terminal encoding is set to utf-8.
As discussed in last post, Unicode is just a standard which gives codepoint for different characters. You cannot store codepoint of a character on disk. Codepoint of the character must be encoded using some encoding scheme before it can be stored in a file.
Codepoints are integers. eg: Codepoint of character 'a' is U+0061 which is integer 97. This codepoint has a different binary representation in different encoding schemes. Or other way of saying it is, this codepoint has different byte sequence in different encoding schemes. And the byte sequence gets written to disk when we write 'a' to a file.
Codepoint of 'ä' is U+00E4, which is integer 228. This codepoint has a different binary representation, or byte sequence, in different encoding schemes.
Usually binary representation will not be shown to you. The binary representation would be converted to a hexadecimal number in the output. eg: In 'utf-8' encoding, 'ä' is represented by '11000011 10100100'. But most of the times you will see it's hexadecimal equivalent which is 'c3a4', written as '\xc3\xa4'.
Python has two different datatypes. One is 'unicode' and other is 'str'. Type 'unicode' is meant for working with codepoints of characters. Type 'str' is meant for working with encoded binary representation of characters.
A 'unicode' object needs to be converted to 'str' object before Python can write the character to a file. A 'unicode' object needs to be converted to 'str' object for the character to be printed.
Python 'unicode' and 'str' type
We will use a character which has different binary representation in different encoding schemes. ä is one such character. This character is called 'LATIN SMALL LETTER A WITH DIAERESIS'. Codepoint for this character is U+00E4. You can check it at http://www.utf8-chartable.de/
The way to define a Unicode codepoint is:
'>
A unicode starts with 'u' followed by quote and the codepoint has to be preceded by '\u'. Let's define a 'str':
UnicodeEncodeError
Let's try to convert 'unicode' to 'str':
When 'encode()' is called, by default ascii encoding scheme is used. So 'encode()' is equivalent to 'encode('ascii')'. ascii can only encode characters whose codepoint is less than 128. uni_latin_a represents a character whose codepoint is greater than 128. And so we get a UnicodeEncodeError.
utf-8 encoding scheme can encode codepoints greater than 128. Let's use 'utf-8' to encode uni_latin_a:
So, utf-8 representation of codepoint 'U+00E4' is '\xc3\xa4'. You can also verify it at the table provided at http://www.utf8-chartable.de/.
A 'unicode' cannot be written to a file.
A 'unicode' object must be encoded to get it's binary representation, and then encoded binary representation gets written to the file.
Python is trying to do implicit encoding here. Python can only write 'str' to a file. Since we are passing a 'unicode' to write, python tries to convert the 'unicode' into 'str'. Internally Python runs f.write(uni_latin_a.encode('ascii')) which cause a UnicodeEncodeError.
Encode uni_latin_a using utf-8 so it can be written:
UnicodeDecodeError
UnicodeDecodeError will usually happen when you try to process something read from a file.
We just wrote a utf-8 encoded character to 'uni_latin_a.txt'. Let's read this file.
Several times it makes sense to work with Unicode internally and in such case we will need to convert the read value into unicode. Decoding is the process of converting an encoded representation into Unicode codepoint.
'.decode()' is meant to convert from 'str' to 'unicode'. Always use '.decode()' on a 'str'. Never use it on 'unicode' object. Let's try converting read_str_latin_a to a 'unicode' object:
When '.decode()' is called, Python default thinks that the string was encoded using 'ascii'. So it tries to find the Unicode codepoint which corresponds to this encoded representation. In ascii, no Unicode codepoint corrresponds to '\xc3\xa4' and so an error is raised.
We already know that encoding was done using 'utf-8' when writing to the file. So use 'utf-8' with decode().
Suppose we did not know that the file content was encoded with 'utf-8'. In that case, we could have tried decoding it with latin-5 or any other encoding scheme. Suppose we try latin-5:
In encoding scheme 8859, U+00C3 when encoded gives hexadecimal '\xc3' and U+00A4 when encoded gives hexadecimal '\xa4'. So when '\xc3\xa4' is decoded, it gives back codepoints U+00C3 and U+00A4. Codepoint U+00C3 means 'Ã' and codepoint U+00A4 means '¤'. And that's what we see in output.
That's why it's important to know the encoding of a file otherwise we will read it wrong.
Takeaway
Supplement
* Unicode HOWTO
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言