功能說明:
當你拿到一個文字檔, 除了透過文字編輯器告訴你是什麼編碼, 有沒有辦法透過代碼告訴你該文字檔的編碼? 可以試試 java.nio.charset.CharsetDecoder.
底下的代碼使用該類別對文字檔的內容進行 decoding, 如果可以成功使用某種編碼完整的 decoding 該文字檔, 便猜測該文字檔使用該種編碼:
- public Charset detectCharset(File f, String[] charsets) {
- Charset charset = null;
- // charsets 是我們定義的 編碼 矩陣, 包括 UTF8, BIG5 etc.
- for (String charsetName : charsets) {
- charset = detectCharset(f, Charset.forName(charsetName));
- if (charset != null) {
- break;
- }
- }
- System.out.printf("\t[Test] Using '%s' encoding!\n", charset);
- return charset;
- }
- private Charset detectCharset(File f, Charset charset) {
- try {
- BufferedInputStream input = new BufferedInputStream(new FileInputStream(f));
- CharsetDecoder decoder = charset.newDecoder();
- decoder.reset();
- byte[] buffer = new byte[512];
- boolean identified = false;
- while ((input.read(buffer) != -1) && (!identified)) {
- identified = identify(buffer, decoder);
- }
- input.close();
- if (identified) {
- return charset;
- } else {
- return null;
- }
- } catch (Exception e) {
- return null;
- }
- }
底下代碼對文字檔 "example.txt" 或 "example_utf8.txt" 進行編碼的偵測:
- package test;
- import java.io.BufferedInputStream;
- import java.io.File;
- import java.io.FileInputStream;
- import java.io.FileNotFoundException;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.nio.ByteBuffer;
- import java.nio.charset.CharacterCodingException;
- import java.nio.charset.Charset;
- import java.nio.charset.CharsetDecoder;
- public class CharsetDetector {
- public Charset detectCharset(File f, String[] charsets) {
- Charset charset = null;
- // charsets 是我們定義的 編碼 矩陣, 包括 UTF8, BIG5 etc.
- for (String charsetName : charsets) {
- charset = detectCharset(f, Charset.forName(charsetName));
- if (charset != null) {
- break;
- }
- }
- System.out.printf("\t[Test] Using '%s' encoding!\n", charset);
- return charset;
- }
- private Charset detectCharset(File f, Charset charset) {
- try {
- BufferedInputStream input = new BufferedInputStream(new FileInputStream(f));
- CharsetDecoder decoder = charset.newDecoder();
- decoder.reset();
- byte[] buffer = new byte[512];
- boolean identified = false;
- while ((input.read(buffer) != -1) && (!identified)) {
- identified = identify(buffer, decoder);
- }
- input.close();
- if (identified) {
- return charset;
- } else {
- return null;
- }
- } catch (Exception e) {
- return null;
- }
- }
- private boolean identify(byte[] bytes, CharsetDecoder decoder) {
- try {
- decoder.decode(ByteBuffer.wrap(bytes));
- } catch (CharacterCodingException e) {
- return false;
- }
- return true;
- }
- public static void main(String[] args) {
- File f = new File("example.txt");
- String[] charsetsToBeTested = {"UTF-8", "big5", "windows-1253", "ISO-8859-7"};
- CharsetDetector cd = new CharsetDetector();
- Charset charset = cd.detectCharset(f, charsetsToBeTested);
- if (charset != null) {
- try {
- InputStreamReader reader = new InputStreamReader(new FileInputStream(f), charset);
- int c = 0;
- while ((c = reader.read()) != -1) {
- System.out.print((char)c);
- }
- reader.close();
- } catch (FileNotFoundException fnfe) {
- fnfe.printStackTrace();
- }catch(IOException ioe){
- ioe.printStackTrace();
- }
- }else{
- System.out.println("Unrecognized charset.");
- }
- }
- }
如果是文字檔 example_utf8.txt (utf8 編碼), 可以得到輸出:
補充說明:
該代碼是使用 try-and-error 進行猜測, 但是有可能多種編碼可以同時對某個文件進行 decoding (不對的編碼會出現亂碼), 因此在安排 "編碼矩陣時" 應該把常出現的編碼擺在前面, 以避免使用 "不恰當的編碼" 卻能夠正常 decoding 文字檔造成的亂碼.
* [ Java常見問題 ] Java讀帶有BOM的UTF-8文件亂碼原因及解決方法
* [ Java 常見問題 ] Handle UTF8 file with BOM
沒有留言:
張貼留言