2017年7月26日 星期三

[ Python 常見問題 ] Web Crawler HTTP Error 403:Forbidden

Source From Here 
Question 
Check below code: 
>>> import urllib2 
>>> url = 'https://threatpost.com/threatpost-news-wrap-june-23-2017/126503/' 
>>> resp = urllib2.urlopen(url) 
Traceback (most recent call last): 
File "", line 1, in  
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen 
return opener.open(url, data, timeout) 
File "/usr/lib64/python2.7/urllib2.py", line 437, in open 
response = meth(req, response) 
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response 
'http', request, response, code, msg, hdrs) 
File "/usr/lib64/python2.7/urllib2.py", line 475, in error 
return self._call_chain(*args) 
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain 
result = func(*args) 
File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default 
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
 
urllib2.HTTPError: HTTP Error 403: Forbidden

How-To 
This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib2 uses something like python urllib/3.3.0, it's easily detected). Check below sample code: 
>>> headers = {'User-Agent': 'Mozilla/5.0'} 
>>> req = urllib2.Request(url, headers=headers) // Here using header to pretend to be 'Mozilla' agent 
>>> resp = urllib2.urlopen(req) 
>>> resp.code 
200 
>>> resp.msg 
'OK' 
>>> resp.headers.keys() 
['x-xss-protection', 'x-cache', 'content-security-policy', 'x-content-type-options', 'transfer-encoding', 'strict-transport-security', 'vary', 'x-cache-group', 'x-cacheable', 'server', 'x-pass-why', 'connection', 'link', 'x-ua-compatible', 'cache-control', 'date', 'x-frame-options', 'x-type', 'content-type'] 
>>> page = resp.read() // Read the content of HTML page 
>>> resp.close()


Supplement 
* Stackoverflow - HTTP error 403 in Python 3 Web Scraping 
  1. from urllib.request import Request, urlopen  
  2. req = Request('http://www.cmegroup.com/trading/products/', headers={'User-Agent': 'Mozilla/5.0'})  
  3. webpage = urlopen(req).read()  

* urllib3 - User Guide


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...