程式扎記: [ Python 常見問題 ] Parse the relative link and absolute link using python

2019年3月16日星期六

[ Python 常見問題 ] Parse the relative link and absolute link using python

Source From Here
Question
One testing HTML source is:

view plaincopy to clipboardprint?
'/test/abc.com' />  

And , I want to get a relative link address first:

view plaincopy to clipboardprint?
/test/abc.com  

and find it's absolute path.

How-To
Use urllib.parse.urljoin() to join the base url and src. Below is the Example, using requests and BeautifulSoup:
- test.py

view plaincopy to clipboardprint?
from urllib.parse import urljoin  
import requests  
from bs4 import BeautifulSoup  
  
base_url = 'http://www.ragalahari.com'  
url = 'http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx'  
  
soup = BeautifulSoup(requests.get(url).content, 'html.parser')  
  
for img in soup.find_all('img', src=True):  
    src = img.get('src')  
    if not src.startswith('http'):  
        src = urljoin(base_url, src)  
  
    print(src)  

Then below is the execution result:

https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=gcxwp1IWh910mh
http://icdn.raagalahari.com/ragalaharilogo.png
http://imgcdn.raagalahari.com/nov2014/starzone/kaj...jal-agarwal-memu-saitham1t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kaj...jal-agarwal-memu-saitham2t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kaj...jal-agarwal-memu-saitham3t.jpg
...

程式扎記

標籤

2019年3月16日星期六

[ Python 常見問題 ] Parse the relative link and absolute link using python

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2019年3月16日 星期六

[ Python 常見問題 ] Parse the relative link and absolute link using python

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2019年3月16日星期六