程式扎記: [ Python 常見問題 ] Web Crawler HTTP Error 403:Forbidden

2017年7月26日星期三

[ Python 常見問題 ] Web Crawler HTTP Error 403:Forbidden

Source From Here
Question
Check below code:

>>> import urllib2
>>> url = 'https://threatpost.com/threatpost-news-wrap-june-23-2017/126503/'
>>> resp = urllib2.urlopen(url)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

How-To
This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib2 uses something like python urllib/3.3.0, it's easily detected). Check below sample code:

>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> req = urllib2.Request(url, headers=headers) // Here using header to pretend to be 'Mozilla' agent
>>> resp = urllib2.urlopen(req)
>>> resp.code
200
>>> resp.msg
'OK'
>>> resp.headers.keys()
['x-xss-protection', 'x-cache', 'content-security-policy', 'x-content-type-options', 'transfer-encoding', 'strict-transport-security', 'vary', 'x-cache-group', 'x-cacheable', 'server', 'x-pass-why', 'connection', 'link', 'x-ua-compatible', 'cache-control', 'date', 'x-frame-options', 'x-type', 'content-type']
>>> page = resp.read() // Read the content of HTML page
>>> resp.close()

Supplement
* Stackoverflow - HTTP error 403 in Python 3 Web Scraping

view plain copy to clipboard print ?

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/trading/products/', headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()

* urllib3 - User Guide

程式扎記

標籤

2017年7月26日星期三

[ Python 常見問題 ] Web Crawler HTTP Error 403:Forbidden

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年7月26日 星期三

[ Python 常見問題 ] Web Crawler HTTP Error 403:Forbidden

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年7月26日星期三