程式扎記: [ Java 代碼範本 ] 判斷檔案編碼

2012年10月15日星期一

[ Java 代碼範本 ] 判斷檔案編碼 - java.nio.charset.CharsetDecoder

來源自這裡
功能說明:
當你拿到一個文字檔, 除了透過文字編輯器告訴你是什麼編碼, 有沒有辦法透過代碼告訴你該文字檔的編碼? 可以試試 java.nio.charset.CharsetDecoder.
底下的代碼使用該類別對文字檔的內容進行 decoding, 如果可以成功使用某種編碼完整的 decoding 該文字檔, 便猜測該文字檔使用該種編碼:

view plaincopy to clipboardprint?
public Charset detectCharset(File f, String[] charsets) {  
  
       Charset charset = null;  
  
       // charsets 是我們定義的 編碼 矩陣, 包括 UTF8, BIG5 etc.  
       for (String charsetName : charsets) {  
           charset = detectCharset(f, Charset.forName(charsetName));  
           if (charset != null) {  
               break;  
           }  
       }  
       System.out.printf("\t[Test] Using '%s' encoding!\n", charset);  
       return charset;  
   }  
  
   private Charset detectCharset(File f, Charset charset) {  
       try {  
           BufferedInputStream input = new BufferedInputStream(new FileInputStream(f));  
  
           CharsetDecoder decoder = charset.newDecoder();  
           decoder.reset();  
  
           byte[] buffer = new byte[512];  
           boolean identified = false;  
           while ((input.read(buffer) != -1) && (!identified)) {  
               identified = identify(buffer, decoder);  
           }  
  
           input.close();  
  
           if (identified) {  
               return charset;  
           } else {  
               return null;  
           }  
  
       } catch (Exception e) {  
           return null;  
       }  
   }  

範例代碼:
底下代碼對文字檔 "example.txt" 或 "example_utf8.txt" 進行編碼的偵測:

view plaincopy to clipboardprint?
package test;  
  
import java.io.BufferedInputStream;  
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileNotFoundException;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.nio.ByteBuffer;  
import java.nio.charset.CharacterCodingException;  
import java.nio.charset.Charset;  
import java.nio.charset.CharsetDecoder;  
  
public class CharsetDetector {  
    public Charset detectCharset(File f, String[] charsets) {  
  
        Charset charset = null;  
  
        // charsets 是我們定義的 編碼 矩陣, 包括 UTF8, BIG5 etc.  
        for (String charsetName : charsets) {  
            charset = detectCharset(f, Charset.forName(charsetName));  
            if (charset != null) {  
                break;  
            }  
        }  
        System.out.printf("\t[Test] Using '%s' encoding!\n", charset);  
        return charset;  
    }  
  
    private Charset detectCharset(File f, Charset charset) {  
        try {  
            BufferedInputStream input = new BufferedInputStream(new FileInputStream(f));  
  
            CharsetDecoder decoder = charset.newDecoder();  
            decoder.reset();  
  
            byte[] buffer = new byte[512];  
            boolean identified = false;  
            while ((input.read(buffer) != -1) && (!identified)) {  
                identified = identify(buffer, decoder);  
            }  
  
            input.close();  
  
            if (identified) {  
                return charset;  
            } else {  
                return null;  
            }  
  
        } catch (Exception e) {  
            return null;  
        }  
    }  
  
    private boolean identify(byte[] bytes, CharsetDecoder decoder) {  
        try {  
            decoder.decode(ByteBuffer.wrap(bytes));  
        } catch (CharacterCodingException e) {  
            return false;  
        }  
        return true;  
    }  
  
    public static void main(String[] args) {  
        File f = new File("example.txt");  
  
        String[] charsetsToBeTested = {"UTF-8", "big5", "windows-1253", "ISO-8859-7"};  
  
        CharsetDetector cd = new CharsetDetector();  
        Charset charset = cd.detectCharset(f, charsetsToBeTested);  
  
        if (charset != null) {  
            try {  
                InputStreamReader reader = new InputStreamReader(new FileInputStream(f), charset);  
                int c = 0;  
                while ((c = reader.read()) != -1) {  
                    System.out.print((char)c);  
                }  
                reader.close();  
            } catch (FileNotFoundException fnfe) {  
                fnfe.printStackTrace();  
            }catch(IOException ioe){  
                ioe.printStackTrace();  
            }  
  
        }else{  
            System.out.println("Unrecognized charset.");  
        }  
    }  
}  

如果使用文字檔 example.txt (big5 編碼), 可以得到輸出:

[Test] Using 'Big5' encoding!
這是中文

如果是文字檔 example_utf8.txt (utf8 編碼), 可以得到輸出:

[Test] Using 'UTF-8' encoding!
這是中文

補充說明:
該代碼是使用 try-and-error 進行猜測, 但是有可能多種編碼可以同時對某個文件進行 decoding (不對的編碼會出現亂碼), 因此在安排 "編碼矩陣時" 應該把常出現的編碼擺在前面, 以避免使用 "不恰當的編碼" 卻能夠正常 decoding 文字檔造成的亂碼.

* [ Java常見問題 ] Java讀帶有BOM的UTF-8文件亂碼原因及解決方法
* [ Java 常見問題 ] Handle UTF8 file with BOM

程式扎記

標籤

2012年10月15日星期一

[ Java 代碼範本 ] 判斷檔案編碼 - java.nio.charset.CharsetDecoder

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月15日 星期一

[ Java 代碼範本 ] 判斷檔案編碼 - java.nio.charset.CharsetDecoder

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2012年10月15日星期一