2013年9月4日 星期三

[ Java 代碼範本 ] HttpClient 4.x : 處理 meta refresh 的 redirection

Preface: 
使用 HttpClient 4.x 版本的 Http Get 如果要處理 status code 是 301(Moved Permanently)/302 (Found) 的 redirection 時, 需要透過方法 setRedirectStrategy() 修改DefaultRedirectStrategy 的行為, 底下為範例代碼: 
  1. DefaultHttpClient httpClient = new DefaultHttpClient();  
  2. httpClient.setRedirectStrategy(new DefaultRedirectStrategy() {                
  3.     @Override  
  4.     public HttpUriRequest getRedirect(HttpRequest request,  
  5.                   HttpResponse response,  
  6.                   HttpContext context)  
  7.                     throws ProtocolException{                 
  8.         HttpUriRequest redirect = super.getRedirect(request, response, context);  
  9.         redirtList.add(redirect.getURI().toString());                 
  10.         System.out.printf("\t[Test] Redirect to '%s'\n", redirect.getURI());  
  11.         return redirect;  
  12.     }  
  13.   
  14.     @Override  
  15.        public boolean isRedirected(HttpRequest request, HttpResponse resp, HttpContext context)  {                                
  16.         boolean isRedirect=false;                     
  17.            if(!isRedirect)  
  18.            {  
  19.             try {  
  20.                 isRedirect = super.isRedirected(request, resp, context);  
  21.             } catch (ProtocolException e) {                          
  22.                 e.printStackTrace();  
  23.             }  
  24.            }  
  25.            if(!isRedirect) {  
  26.                int responseCode = resp.getStatusLine().getStatusCode();  
  27.                if (responseCode == 301 || responseCode == 302) {  
  28.                    isRedirect = true;  
  29.                }  
  30.            }                                  
  31.            return isRedirect;                                         
  32.        }  
  33.       
  34.    });  
如此設定上面的 httpClient 在處理 HttpUriRequest 便會自動幫你 redirect 到最後的 page 去. 接著為了方便, 下面的函示都是之後會用到的, 首先是判斷頁面使用的 Charset: 
  1. public static String Charset(HttpResponse resp, String def)  
  2. {         
  3.     ContentType contentType = ContentType.getOrDefault(resp.getEntity());  
  4.     Charset charSet = contentType.getCharset();      
  5.     if(charSet!=nullreturn charSet.name();  
  6.     else return def;  
  7. }  
接著是透過 HttpResponse 取得頁面的內容: 
  1.    public synchronized static String GetPageBody(HttpResponse resp, String cs) throws IOException {  
  2.     //String cs = Charset(resp, "utf-8");  
  3.     Header contentEncoding = resp.getFirstHeader("Content-Encoding");  
  4.     InputStream instream = resp.getEntity().getContent();  
  5.     if (contentEncoding != null && contentEncoding.getValue().equalsIgnoreCase("gzip")) {  
  6.         instream = new GZIPInputStream(instream);  
  7.         BufferedReader br = new BufferedReader(new InputStreamReader(instream, cs));  
  8.         StringBuffer pageBodyBuf = new StringBuffer();  
  9.         String line;  
  10.         while((line=br.readLine())!=null) pageBodyBuf.append(String.format("%s\n", line));  
  11.         return pageBodyBuf.toString().trim();             
  12.     } else {                  
  13.         return EntityUtils.toString( resp.getEntity(), cs );              
  14.     }  
  15. }  
因此你便能使用下面的代碼取得 "http://pp.ceromulta.cl/" 的頁面內容: 
  1. HttpGet get = new HttpGet("http://pp.ceromulta.cl/");  
  2. HttpResponse resp = httpClient.execute(get);  
  3. String pageBody = HttpKit.GetPageBody(resp, "utf-8");  
  4. System.out.printf("\t[Info] Page Body:\n%s\n", pageBody);  
執行結果: 
[Info] Page Body:
301 Moved Permanently


<meta http-equiv="refresh" content="0; URL=Login.php?webapps/mpp/home">
 
而我們的重點在於上面的紅體字 meta http-equiv="refresh", 透過它可以設定頁面自動 redirect. 而這種類型的 redirect 是 HttpClient 無法處理的 redirect. 因此下面的範例代碼在於說明如何處理這類型的 redirect

範例代碼: 
第一步我們定義了一個 Pattern 來捕捉這樣語法的出現: 
  1. Pattern redirPtn = Pattern.compile("<(?:META|meta|Meta) (?:HTTP-EQUIV|http-equiv)=\"refresh\".*URL=(.*)\">");  

然後當我們發現有 meta refresh 的語法出現, 透過一個 while loop 一直處理 refresh URL 並幫我們導向 refresh URL 指定的位置, 直到該頁面沒有 meta refresh. 完整代碼如下: 
  1. Pattern redirPtn = Pattern.compile("<(?:META|meta|Meta) (?:HTTP-EQUIV|http-equiv)=\"refresh\".*URL=(.*)\">");  
  2. HttpContext localContext = new BasicHttpContext();  
  3. HttpGet get = new HttpGet("http://pp.ceromulta.cl/");         
  4. HttpResponse resp = httpClient.execute(get, localContext);  
  5. String pageBody = GetPageBody(resp, "utf-8");  
  6. Matcher mth = redirPtn.matcher(pageBody);  
  7. while(mth.find())  
  8. {  
  9.     String rurl = mth.group(1);  
  10.     if(!rurl.startsWith("http:")) {  
  11.         HttpUriRequest currentReq = (HttpUriRequest) localContext.getAttribute(ExecutionContext.HTTP_REQUEST);  
  12.         HttpHost currentHost = (HttpHost)  localContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);  
  13.         String rdu = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());                       
  14.         rurl = String.format("%s%s", rdu, rurl);                                                                                      
  15.     }                     
  16.     System.out.printf("\t[Info] Refresh URL=%s...\n", rurl);      
  17.     get = new HttpGet(rurl);  
  18.     pageBody = GetPageBody(httpClient.execute(get, localContext), "utf-8");  
  19.     mth = redirPtn.matcher(pageBody);  
  20. }  
  21. System.out.printf("\t[Info] Final Page Body:\n%s\n", pageBody);  
執行結果如下, 你可以發現總共經過 3 次 redirect (前兩次為 301,302), 而最後一個 redirect 是透過 meta refresh: 
[Test] Redirect to 'http://pp.ceromulta.cl/d3993414284beca7c9d20faab65c690f' #第一次 redirect
[Test] Redirect to 'http://pp.ceromulta.cl/d3993414284beca7c9d20faab65c690f/' #第二次 redirect 
[Info] Refresh URL=http://pp.ceromulta.cl/d3993414284beca7c9d20faab65c690f/Login.php?webapps/mpp/home... #第三次 redirect
[Info] Final Page Body:


...

Supplement: 
Meta 十大功用

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...