2019年3月16日 星期六

[ Python 常見問題 ] Parse the relative link and absolute link using python

Source From Here 
Question 
One testing HTML source is: 
And , I want to get a relative link address first: 
and find it's absolute path. 

How-To 
Use urllib.parse.urljoin() to join the base url and src. Below is the Example, using requests and BeautifulSoup: 
- test.py 
  1. from urllib.parse import urljoin  
  2. import requests  
  3. from bs4 import BeautifulSoup  
  4.   
  5. base_url = 'http://www.ragalahari.com'  
  6. url = 'http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx'  
  7.   
  8. soup = BeautifulSoup(requests.get(url).content, 'html.parser')  
  9.   
  10. for img in soup.find_all('img', src=True):  
  11.     src = img.get('src')  
  12.     if not src.startswith('http'):  
  13.         src = urljoin(base_url, src)  
  14.   
  15.     print(src)  
Then below is the execution result: 


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...