2017年7月18日 星期二

[ Python 常見問題 ] Python HTMLParser: UnicodeDecodeError

Source From Here 
Question 
I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser. The information is there if I just print it out. 
  1. from HTMLParser import HTMLParser  
  2. import urllib  
  3. import chardet  
  4.   
  5. class search_youtube(HTMLParser):  
  6.   
  7.     def __init__(self, search_terms):  
  8.         HTMLParser.__init__(self)  
  9.         self.track_ids = []  
  10.         for search in search_terms:  
  11.             self.__in_result = False  
  12.             search = urllib.quote_plus(search)  
  13.             query = 'http://youtube.com/results?search_query='  
  14.             page = urllib.urlopen(query + search).read()  
  15.             try:  
  16.                 self.feed(page)  
  17.             except UnicodeDecodeError:  
  18.                 encoding = chardet.detect(page)['encoding']  
  19.                 if encoding != 'unicode':  
  20.                     page = page.decode(encoding)  
  21.                     page = page.encode('ascii''ignore')  
  22.                 self.feed(page)  
  23.                 print 'success'  
  24.   
  25. searches = ['telepopmusik breathe']  
  26. results = search_youtube(searches)  
  27. print results.track_ids  
Here's the output: 
  1. Traceback (most recent call last):  
  2.   File "test.py", line 27, in   
  3.     results = search_youtube(searches)  
  4.   File "test.py", line 23, in __init__  
  5.     self.feed(page)  
  6.   File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed  
  7.     self.goahead(0)  
  8.   File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead  
  9.     k = self.parse_starttag(i)  
  10.   File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag  
  11.     attrvalue = self.unescape(attrvalue)  
  12.   File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape  
  13.     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)  
  14.   File "/usr/lib/python2.6/re.py", line 151, in sub  
  15.     return _compile(pattern, 0).sub(repl, string, count)  
  16. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)  
How-To 
It is UTF-8, indeed. This works: 
  1. from HTMLParser import HTMLParser  
  2. import urllib  
  3.   
  4. class search_youtube(HTMLParser):  
  5.   
  6.     def __init__(self, search_terms):  
  7.         HTMLParser.__init__(self)  
  8.         self.track_ids = []  
  9.         for search in search_terms:  
  10.             self.__in_result = False  
  11.             search = urllib.quote_plus(search)  
  12.             query = 'http://youtube.com/results?search_query='  
  13.             connection = urllib.urlopen(query + search)  
  14.             encoding = connection.headers.getparam('charset')  
  15.             if encoding:  
  16.                 page = connection.read().decode(encoding)  
  17.             else:  
  18.                 page = connection.read()  
  19.             self.feed(page)  
  20.             print 'success'  
  21.   
  22. searches = ['telepopmusik breathe']  
  23. results = search_youtube(searches)  
  24. print results.track_ids  
You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...