程式扎記: [ Java 套件 ] jsoup

2013年12月3日星期二

[ Java 套件 ] jsoup - Java HTML Parser

Preface:
最近為了要研究 JS (JavaScript) , 第一步便是要擷取Web 頁面上 JS 的 code. 不加思索的便直接寫了 regular expression 來完成這個需求. 假設有頁面如下:
- Test.html

其中要擷取的 JS 便是:

window.location="http://localhost/FF/redir/r/windowsloc2.html";

那究竟要怎麼取出這段 JS 呢? 簡單, 一行 regular expression 搞定:

view plaincopy to clipboardprint?
// pageBody 為上面的頁面內容.  
Pattern jsPtn = Pattern.compile("<script.*?>(.*)</script>", Pattern.DOTALL);  
Matcher mth = jsPtn.matcher(pageBody);  
if(mth.find())  
{  
    System.out.printf("\t[Info] JS:\n%s\n", mth.group(1).trim());  
}  

執行結果如下:

[Info] JS:
window.location="http://localhost/FF/redir/r/windowsloc2.html";

但後來丟到線上系統去跑, 發現一堆 JS 雖然存在, 但是都沒有擷取到...Orz. 一個沒有抓到的頁面部分代碼如下:

看起來好像跟上面的 Test.html 很像, 但是用上面的 regular expression 卻抓不到! 原因是 "script" 變成大寫了 "SCRIPT" ("language" 也變成大寫), 難道要去改上面的 regular expression 去滿足各種可能的 script 標籤? (Script/scrIpt/ScripT...). 但直覺應該有類似的套件可以處理這件工作, 幸運地在拜完 Google 大神後, 找到了 Jsoup - Java HTML Parser!

jsoup: Java HTML Parser
當然首先要來看看這個套件到底要做什麼, 能做什麼, 怎麼用. 首先是要做什麼, 在官網上有簡單說明如下:

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
* Scrape and parse HTML from a URL, file, or string
* Find and extract data, using DOM traversal or CSS selectors
* Manipulate the HTML elements, attributes, and text
* Clean user-submitted content against a safe white-list, to prevent XSS attacks
* Output tidy HTML

上面提到許多的功能, 而 "Parse HTML from string/Extract data using DOM traversal" 恰好能滿足我的. 接著要來看怎麼用這個套件.

Parsed From File
首先如果你要處理的內容都已經存到檔案中, 則你可以使用下面的方法解析頁面:

view plaincopy to clipboardprint?
File pageFile = new File("data/windowsloc.html");     
Document doc = Jsoup.parse(pageFile, "UTF-8");  
Elements jsElms = doc.getElementsByTag("script");  
Iterator iter = jsElms.iterator();  
while(iter.hasNext())  
{  
    Element e = iter.next();  
    String langAttr = e.attr("language").toLowerCase().trim();  
    if(langAttr.equals("javascript")||langAttr.isEmpty())  
        System.out.printf("\t[Info] JS:\n%s\n", e.childNode(0).toString().trim());  
}  

Parsed From String
如果你的頁面是直接從網路下載過來, 此時如果還在記憶體, 則可以將之存放在字串並利用 API:parse(String html, String baseUri) 來完成:

view plaincopy to clipboardprint?
// Page content is saved in variable "PageBody"  
Document doc = Jsoup.parse(PageBody, "http://localhost/FF/");  

Parsed From URL
如果你有的是網址, 也可以透過 API:connect(String url) 取得 Connection 物件; 並利用該物件上的方法 get() 取得對應該 URL 的 Document 物件:

view plaincopy to clipboardprint?
Document doc = Jsoup.connect("http://example.com/").get();  

Supplement:
* Extract data - Use DOM methods to navigate a document
* Extract data - http://jsoup.org/cookbook/extracting-data/selector-syntax
* Extract data - Extract attributes, text, and HTML from elements
* Extract data - Working with URLs
* Extract data - Example program: list links

程式扎記

標籤

2013年12月3日星期二

[ Java 套件 ] jsoup - Java HTML Parser

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年12月3日 星期二

[ Java 套件 ] jsoup - Java HTML Parser

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2013年12月3日星期二