Source From Here Question Check below code:
>>> import urllib2
>>> url = 'https://threatpost.com/threatpost-news-wrap-june-23-2017/126503/'
>>> resp = urllib2.urlopen(url)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
How-To This is probably because of
mod_security or some similar server security feature which blocks known spider/bot user agents (
urllib2 uses something like python urllib/3.3.0, it's easily detected). Check below sample code:
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> req = urllib2.Request(url, headers=headers) // Here using header to pretend to be 'Mozilla' agent
>>> resp = urllib2.urlopen(req)
>>> resp.code
200
>>> resp.msg
'OK'
>>> resp.headers.keys()
['x-xss-protection', 'x-cache', 'content-security-policy', 'x-content-type-options', 'transfer-encoding', 'strict-transport-security', 'vary', 'x-cache-group', 'x-cacheable', 'server', 'x-pass-why', 'connection', 'link', 'x-ua-compatible', 'cache-control', 'date', 'x-frame-options', 'x-type', 'content-type']
>>> page = resp.read() // Read the content of HTML page
>>> resp.close()
Supplement *
Stackoverflow - HTTP error 403 in Python 3 Web Scraping
- from urllib.request import Request, urlopen
- req = Request('http://www.cmegroup.com/trading/products/', headers={'User-Agent': 'Mozilla/5.0'})
- webpage = urlopen(req).read()
*
urllib3 - User Guide
沒有留言:
張貼留言