程式扎記

前言 :
在學校的 IR 專案中, 需要對文件進行判斷是否文本為中文或是英文. 因此拜完 Google 大神後整理下面方法.

字串長度比較 :
先來看看下面的代碼 :

view plaincopy to clipboardprint?
String s1 = "我是中國人";  
String s2 = "imchinese";  
String s3 = "im中國人";  
System.out.println(s1+":"+new String(s1).length());  
System.out.println(s2+":"+new String(s2).length());   
System.out.println(s3+":"+new String(s3).length());  

你會得到下面輸出 :

我是中國人:5
imchinese:9
im中國人:5

字串裏如果有雙位元組的字元 java 就把每個字元都按雙位元組編碼,如果都是單字節的字元就按單字節編碼. 而使用函示 length() 得到的是編碼完後的字串長度. 因此如果你的字串只有英文, 則字串的 byte 數應該會等於字串編碼後的長度 ; 如果字串中含有中文, 則字串的 byte 數會大於字串編碼後的長度. 因此你可以用下面代碼判斷字串中有無中文 :

view plaincopy to clipboardprint?
System.out.println((s1.getBytes().length == s1.length())?"s1無中文":"s1有中文");  
System.out.println((s2.getBytes().length == s2.length())?"s2無中文":"s2有中文");  
System.out.println((s3.getBytes().length == s3.length())?"s3無中文":"s3有中文");  

執行結果為 :

s1有中文
s2無中文
s3有中文

使用正則表示式取出中文 :
這邊是透過中文在編碼的範圍判斷, 如果某個字元落在這個範圍中, 便可以斷定該字元是中文. 而這邊使用 CJK Unified Ideographs (U+4E00..U+9FA5) 作為判斷依據. 更多說明可以參考這裡. 所以我們代碼可以如下撰寫 :

view plaincopy to clipboardprint?
String str = "測試中文aA123";  
for(int i=0; i
{  
    String test = str.substring(i, i+1);  
    if(test.matches("[\\u4E00-\\u9FA5]+"))  
    {  
        System.out.printf("\t[Info] %s -> 中文!\n", test);  
    }  
    else  
    {  
        System.out.printf("\t[Info] %s\n", test);  
    }  
}  

執行結果如下 :

[Info] 測 -> 中文!
[Info] 試 -> 中文!
[Info] 中 -> 中文!
[Info] 文 -> 中文!
[Info] a
[Info] A
[Info] 1
[Info] 2
[Info] 3

參考來源 :
* (JAVA)判斷String是否有中文字
* 306Doc : 使用正則表示式取出中文
* Java 判斷檔名字串只有英文 / 數字, 而無中文字
* 完整的CJK Unicode范围（5.0版）

CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.

程式扎記

標籤

2012年5月12日星期六

[ Java 常見問題 ] 判斷 String 是否有中文字

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年5月12日 星期六