程式扎記: [ Ruby Gossip ] Basic : 內建型態與操作

Source From Here
Preface
在 Ruby 1.8，程式中所有字串，其實都是原始位元組集合，如果原始碼中撰寫了非西歐字元，可支援的編碼僅有 Shift-JIS（啟動ruby時指定-Ks）、EUC-JP（啟動ruby時指定-Ke）、UTF-8（啟動ruby時指定-Ku）。如果要在原始碼中撰寫中文，方式之一是使用 Big5 的原始碼檔案，並在啟動 ruby 時指定 -Ks 或 -Ke，使用與Big5編碼類似的 Shift-JIS或 EUC-JP，或者是使用UTF-8的原始碼檔案，並在啟動 ruby 時，指定 -Ku，如此才可以取得字串正確的位元組集合。

由於在Ruby 1.8中，程式中所有字串都是原始位元組集合，因此在有許多場合，必須要親自處理編碼轉換。例如在Ruby 1.8中，若為UTF-8原始碼檔案，啟動ruby時指定 -Ku：
- main.rb

view plaincopy to clipboardprint?
puts "良".length  

程式會顯示 3 而不是 1，這是因為 Ruby 1.8 中字串的 length 或 size 方法，傳回的都是位元組長度而不是字元長度，為了得到字元長度，有幾種不同的方式，像是使用規則表示式（Regular Expression）搭配 API 來處理。網路上有不少文件在討論 Ruby 1.編碼方面的問題，亦有不少程式碼支援編碼轉換。

接下來談的，都是Ruby 1.9.2有關中文編碼的處理。

關於編碼
在Ruby原始碼中如果要撰寫中文，可於檔案開頭用註解提示直譯器檔案編碥為何，例如在作業系統預設編碼為 Big5的環境，使用Big5格式編輯檔案，可以如下：
- main2.rb

view plaincopy to clipboardprint?
# encoding: Big5  
puts "哈囉! 良葛格!"  

檔案一開始的註解還可以寫為：

view plaincopy to clipboardprint?
# coding: Big5  

或寫為：

view plaincopy to clipboardprint?
# -*- coding: Big5 -*-  

或寫為：

view plaincopy to clipboardprint?
#!/usr/bin/env ruby -w  
# encoding: Big5  

如果主控台顯示編碼為Big5，則執行結果如下

# ruby main2.rb
哈囉! 良葛格!

在 Ruby 1.9 中，字串的size或length方法，傳回的是字元數而不是位元組數目。

如果在作業系統預設編碼為Big5的環境，使用UTF-8格式編輯檔案如下：
- main3.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
puts "哈囉! 良葛格!"  

如果主控台顯示編碼為Big5，則執行結果如下:

# ruby main.rb
?历?! ?航???

出現亂碼了？直譯器會使用 UTF-8來解釋字串（如果沒設定 #encoding，預設就是US-ASCII），字串編碼預設就會是 UTF-8，STDOUT 預設只會忠實地將字串的位元組輸出，如果你的主控台不是用 UTF-8 顯示，那麼 UTF-8 的位元組輸出就會是亂碼。如果想要字串使用 UTF-8，而輸出的外部編碼使用 Big5，方法之一是使用字串的 encode 方法，將文字編碼為 Console 的編碼，再給 puts 等方法輸出。例如在中文 Windows 中的主控台，可以指定為 Big5：
- main4.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
puts "哈囉! 良葛格!".encode("Big5")  

如此執行時就可以正確看到中文，不過這樣必須每個字串都使用encode方法，另一個方法是使用 STDOUT 的 set_encoding 方法。例如:
- main5.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
STDOUT.set_encoding("Big5")  
puts "哈囉! 良葛格!"  

這會設定 STDOUT 的外部編碼（External encoding）為 Big5，如此執行時會自動將字串編碼轉換為指定的 STDOUT 外部編碼，就可以正確顯示中文，還有方法是執行 ruby 時，指定 -EBig5:UTF-8，-E 是指Encoding，冒號前是指定外部編碼，冒號後是指定內部編碼（Internal encoding）。例如：

# ruby -EBig5:UTF-8 main.rb
哈囉! 良葛格!

到這邊可以知道，在Ruby中會有三個關於編碼的資訊，內部編碼、字串編碼與外部編碼。

若有以下程式：
- main6.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
puts STDOUT.internal_encoding  
puts STDOUT.external_encoding  
puts "哈囉! 良葛格!".encoding  

執行時若沒有指定 -EBig5:UTF-8 選項，STDOUT 並沒有外部編碼與內部編碼資訊，而字串編碼會是 UTF-8，執行時若有指定 -EBig5:UTF-8 選項，STDOUT 外部編碼會是 Big5，內部編碼會是 UTF-8，而字串編碼會是 UTF-8:

# ruby main.rb

UTF-8

# ruby -EBig5:UTF-8 main.rb
UTF-8
Big5
UTF-8

若使用 STDOUT.set_encoding("Big5", "UTF-8")，STDOUT 外部編碼會是 Big5，內部編碼會是 UTF-8。可以使用字串的 encoding 取得字串編碼，使用 bytesize 取得字串使用的位元組長度，使用 bytes 取得位元組。例如:
- main7.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
text = "哈囉"  
puts text.encoding  
puts text.bytesize  
puts "%X %X %X %X %X %X" % text.bytes.to_a, "\n"  
  
text = text.encode("Big5")  
puts text.encoding  
puts text.bytesize  
puts "%X %X %X %X" % text.bytes.to_a  

執行結果如下:

# ruby main7.rb
UTF-8
6
E5 93 88 E5 9B 89

Big5
4
AB A2 C5 6F

在 irb 中，會使用作業系統預設編碼作為字串編碼，可以使用 force_encoding 來改變編碼，但不變動原字串位元組內容，可以使用 valid_encoding? 來看看位元組與編碼是否合法，可以使用 Encoding.compatible? 來測試兩個字串編碼是否相容，如果不相容就傳回nil，也就不可以使用 + 串接在一起。例如:

在Ruby 1.8中，字串的 each 方法傳回的是位元組，在 Ruby 1.9 中，each 已經被移除，改用更明確的 each_line、each_byte、each_char 等方法.

在開啟檔案時，如果沒有指定檔案物件的外部編碼與內部編碼，則檔案物件外部編碼會使用作業系統預設編碼，檔案物件取得的字串也會使用外部編碼，檔案物件內部編碼則沒有設定。例如:
- main8.rb

view plaincopy to clipboardprint?
# Encoding: UTF-8  
print "File name: "  
name = gets.chomp  
file = open(name, "r")  
printf("file.external_encoding=%s\n", file.external_encoding)  
printf("file.internal_encoding=%s\n", file.internal_encoding)  
printf("file.gets.encoding=%s\n", file.gets.encoding)  
printf("\"哈囉\".encoding=%s\n", "哈囉".encoding)  
file.close  

執行結果如下:

> ruby main8.rb
File name: main8.rb
file.external_encoding=ASCII-8BIT
file.internal_encoding=
file.gets.encoding=ASCII-8BIT
"哈囉".encoding=UTF-8

可以在讀入檔案時，指定檔案物件外部編碼，取得的字串編碼與檔案物件外部編碼相同。例如:
- main9.rb

view plaincopy to clipboardprint?
# Encoding: UTF-8  
print "File name: "  
name = gets.chomp  
file = open(name, "r:utf-8")  
printf("file.external_encoding=%s\n", file.external_encoding)  
printf("file.internal_encoding=%s\n", file.internal_encoding)  
printf("file.gets.encoding=%s\n", file.gets.encoding)  
printf("\"哈囉\".encoding=%s\n", "哈囉".encoding)  
file.close  

也可以同時指定檔案物件外部編碼與內部編碼，如果指定了檔案物件內部編碼，則取得的字串編碼與檔案物件內部編碼相同:

view plaincopy to clipboardprint?
# Encoding: UTF-8  
print "File name: "  
name = gets.chomp  
file = open(name, "r:big5:utf-8")  
printf("file.external_encoding=%s\n", file.external_encoding)  
printf("file.internal_encoding=%s\n", file.internal_encoding)  
printf("file.gets.encoding=%s\n", file.gets.encoding)  
printf("\"哈囉\".encoding=%s\n", "哈囉".encoding)  
file.close  

執行結果:

> ruby main10.rb
File name: test.txt
file.external_encoding=Big5
file.internal_encoding=UTF-8
file.gets.encoding=UTF-8
"哈囉".encoding=UTF-8

如果是寫出資料至檔案，預設就是忠實地將字串的位元組資料寫至檔案中，也可以指輸出至檔案時的編碼，例如:
- main11.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
print "File name: "  
name = gets.chomp  
open(name, "w:big5") do |file|  
    text = "哈囉"  
    puts text.encoding          # 顯示 UTF-8  
    puts file.external_encoding # 顯示 Big5  
    file.print text  
end  

如上指定之後，雖然字串編碼為UTF-8，但輸出至檔案時會編碼為Big5，你打開檔案時，看到的就會是Big5編碼的正確中文，也可以同時指定外部編碼與內部編碼。例如:
- main12.rb

view plaincopy to clipboardprint?
# encoding: UTF-8  
print "File name: "  
name = gets.chomp  
open(name, "w:big5:utf-8") do |file|  
    text = "哈囉"  
    puts text.encoding          # 顯示 UTF-8  
    puts file.external_encoding # 顯示 Big5  
    puts file.internal_encoding # 顯示 UTF-8  
    file.print text  
end  

可以使用以下設定輸入輸出物件的預設外部編碼與內部編碼:

view plaincopy to clipboardprint?
Encoding.default_external = Encoding.find("Big5")  
Encoding.default_internal = Encoding.find("UTF-8")   

先前執行ruby指令時使用-EBig5:UTF-8的選項，其實也就是在指定輸入輸出物件的外部編碼與內部編碼。如果想知道Ruby 1.9所支援的編碼，可以透過 Encoding.name_list 得知。

程式扎記

標籤

2014年10月7日星期二

[ Ruby Gossip ] Basic : 內建型態與操作 - 關於編碼

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年10月7日 星期二