程式扎記

Source From Here
Question
Python has str.find() and str.rfind() to get the index of a substring in a string.

I'm wondering whether there is something like str.find_all() which can return all found indexes (not only the first from the beginning or the first from the end). For example:

view plaincopy to clipboardprint?
string = "test test test test"  
  
print string.find('test') # 0  
print string.rfind('test') # 15  
  
#this is the goal  
print string.find_all('test') # [0,5,10,15]  

How-To
There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expression module re:

>>> import re
>>> ts = 'test test test test'
>>> [m.start() for m in re.finditer('test', ts)]
[0, 5, 10, 15]

If you want to find overlapping matches, lookahead will do that:

>>> [m.start() for m in re.finditer('(?=tt)', 'ttt')]
[0, 1]
>>> [m.start() for m in re.finditer('tt', 'ttt')]
[0]

If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:

view plaincopy to clipboardprint?
search = 'tt'  
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]  
#[1]  

re.finditer returns a generator, so you could change the [] in the above to () to get a generator instead of a list which will be more efficient if you're only iterating through the results once.

Source From Here
Preface
Python 相較於其他程式語言，以能夠用簡潔的語法來達到相同的運算聞名，本篇要來教大家的Python Unpacking技巧，就是讓您在操作像 串列 (List)、元組 (Tuple) 及 字典 (Dictionary) 等可疊代的資料型態時，能夠用簡潔且有效率的語法來取出其中的元素，進而做其他更複雜的運算。Unpacking 顧名思義就是將打包好的資料取出來，這邊整理了五個使用的情境，包含了：

* List Unpacking (串列開箱)
* Tuple Unpacking (元組開箱)
* For-Loop Unpacking (迴圈開箱)
* Swapping Variables Value (交換變數值)
* Unpacking Operator (開箱運算子)

List Unpacking (串列開箱)
在一般的情況下，我們要存取 串列 (List) 中的資料並且指派給變數時，通常會像下面範例這樣做：

>>> names = ["Mike", "Peter", "John"]
>>> print("first={}; second={}; third={}".format(names[0], names[1], names[2]))
first=Mike; second=Peter; third=John

這種寫法當我們要指派的變數越多時，就會顯得沒有效率，這時候就能夠 使用 Unpacking 的技巧，將串列 (List) 中的資料指派給多個變數，如下範例:

>>> first, second, third = names
>>> print(f'first={first}; second={second}; third={third}')
first=Mike; second=Peter; third=John

這樣程式碼是不是簡潔多了呢。範例中有一個需要特別要注意的地方，串列 (List) 中的資料個數要與變數的個數一致，不然會發生例外錯誤:

>>> first, second = names
Traceback (most recent call last):
File "", line 1, in
ValueError: too many values to unpack (expected 2)

當串列中有大量的資料時，則可以獨立定義所需的變數個數來透過 Unpacking 的方式來指派資料，剩下的元素則可以使用 * 符號來進行打包的動作，如下範例：

>>> letters = list("ABCDRY")
>>> first, second, *others = letters
>>> print(f"first={first}; second={second}; others={others}")
first=A; second=B; others=['C', 'D', 'R', 'Y']

另一種變形的應用就是可取得串列 (List) 中第一個元素及最後一個元素，剩下的同樣可以用 * 符號打包起來，如下範例：

>>> first, *others, last = letters
>>> print(f"first={first}; others={others}; last={last}")
first=A; others=['B', 'C', 'D', 'R']; last=Y

Tuple Unpacking (元組開箱)
串列 (List) 的 Unpacking 技巧同樣可以使用於 元組 (Tuple)，除了可以將資料指派給多個變數外，也可以彈性的運用 * 符號來處理大量的資料，如下範例:

>>> names = ('Mike', 'Peter', 'John', 'Jack')
>>> first, second, third, fourth = names
>>> first, *others, last = names
>>> print(f'first={first}; others={others}; last={last}')
first=Mike; others=['Peter', 'John']; last=Jack

For-Loop Unpacking (迴圈開箱)
舉例來說，當我們透過 Python 的 For-Loop 迴圈讀取 串列 (List) 中的元素時，想要同時取得元素的索引值及資料，這時候可以搭配 enumerate() 方法及 Unpacking 的技巧來達成，如下範例：

>>> names = ("Mike", "Peter", "John")
>>> for name in enumerate(names):
... print(name)
...
(0, 'Mike')
(1, 'Peter')
(2, 'John')

從範例中的執行結果可以看到，在每一次迴圈的讀取時，enumerate() 方法會回傳一個元組(Tuple) 資料型態，我們就可以直接在 For-Loop 迴圈的地方 Unpacking 元組 (Tuple) 的資料給兩個變數，來達到同時取得元素索引值及資料的效果，如下範例：

>>> for index, name in enumerate(names):
... print(f"{index}) {name}")
...
0) Mike
1) Peter
2) John

另一個使用情境就是在透過 Python 的 For-Loop 迴圈讀取 字典 (Dictionary) 中的元素時，我們使用的 items() 方法也是回傳一個 元組 (Tuple) 資料型態，所以同樣我們也可以利用 Unpacking 的技巧來同時取得鍵 (Key) 及值 (Value)，如下範例：

>>> heights = {"Mike": 170, "Peter": 165}
>>> for name, height in heights.items():
... print(f"{name} with height as {height}")
...
Mike with height as 170
Peter with height as 165

Swapping Variables Value (交換變數值)
再來介紹一個 Unpacking 的使用情境，如果我們要互換兩個變數的值，在不使用 Unpacking 的技巧時，我們會這樣做：

>>> a = 15; b = 20
>>> c = a; a = b; b = c
>>> print(f'a={a}; b={b}')
a=20; b=15

首先多定義一個變數 c，然後把 a 的值先指派給 c，接著把b的值指派給 a，最後再把 c (當初 a 的值) 指派給 b，這樣就達到了兩個變數值互換的效果。而我們使用了Python 的 Unpacking 技巧後，只需寫一行，如下範例：

>>> a = 15; b = 20
>>> a, b = b, a
>>> print(f'a={a}; b={b}')
a=20; b=15

Unpacking Operator (開箱運算子)
Unpacking Operator 分為：

* 符號：可用於任何可疊代的 (Iterable) 物件。
** 符號：只能用於 字典 (Dictionary) 物件。

主要用途為分解 可疊代的 (Iterable) 物件元素，在進行建立或合併時非常的實用。首先來看 * 符號的使用範例：

>>> values = [*range(10)]
>>> print(f'values: {values}')
values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> combined = [*values, *"Python"]
>>> print(f'combined: {combined}')
combined: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'P', 'y', 't', 'h', 'o', 'n']

接著我們來看 ** 符號運用於 字典 (Dictionary) 資料型態的範例：

>>> heights = {"Mike": 170, "Peter": 165}
>>> weights = {"Mike": 65, "John": 70}
>>> combined = {**heights, **weights, "other": 1}
>>> print(f'combined: {combined}')
combined: {'Mike': 65, 'Peter': 165, 'John': 70, 'other': 1} // "Mike" is overwrite from 170 to 65

從執行結果中可以看到，當合併 字典(Dictionary) 時，相同的鍵(Key) 會被之後出現的鍵(Key) 覆蓋，所以只印出 Mike 的體重而沒有身高。

Supplement
* 如何使用 Python 進行字串格式化

Source From Here
Question

How-To

view plaincopy to clipboardprint?
#!/usr/bin/env python3  
class Solution:  
    def diStringMatch(self, S):  
        r'''  
                    IDID  
        IDID    -> 02143  
                    III  
        III     -> 0123  
                    DDI  
        DDI     -> 3214  
  
        Solution:  
        Find ranges of D and reverse the numbers within that range.  
        For example:  
        1) Find range of D  
        IDID => [(1, 2), (3, 4)]  
  
        2) Reverse the D range  
                       --       --  
            (1, 2) => 01234 -> 02134  
                         --       --  
            (3, 4) => 02134 -> 02143  
        '''  
        s_len = len(S)  
        plist = list(range(s_len + 1))  
  
  
        # 1) Collecting reverse tuple  
        rt_list = []  
        i = 0  
        while i < len(S):  
            if S[i] == 'D':  
                dpair = [i]  
                for j in range(i+1, s_len):  
                    if S[j] == 'D':  
                        i += 1  
                    else:  
                        break  
  
                i += 1  
                dpair.append(i)  
  
                rt_list.append(dpair)  
            else:  
                i += 1  
  
        # 2) Work on reverse tuple  
        for rt in rt_list:  
            plist[rt[0]:rt[1]+1] = plist[rt[0]:rt[1]+1][::-1]  
  
        print('S={}; rt_list={}; plist={}'.format(S, rt_list, plist))  
        return plist  

Supplement
* FAQ - How do I reverse a part (slice) of a list in Python?

>>> alist = list(range(10))
>>> alist
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> alist[1:5] = alist[1:5][::-1] # Reverse 1, 2, 3, 4 -> 4, 3, 2, 1
>>> alist
[0, 4, 3, 2, 1, 5, 6, 7, 8, 9]

Source From Here
Preface
這個小節延續輕鬆學習 Python：透過 API 擷取網站資料、輕鬆學習 Python：透過解析 HTML 擷取網站資料 討論如何使用 Python 從第三種來源：網頁透過 selenium 套件 操控瀏覽器來擷取 HTML（全名為 HyperText Markup Language）格式的資料源，selenium 除了具備操控瀏覽器的功能還內建有以 XPath（提供在 XML/HTML 資料中以 XML 節點找尋特定資料位置的定位方法）或 CSS Selector（提供在 HTML 資料中以層疊樣式表找尋特定資料位置的定位方法）為基礎的資料解析函數。

完整的 Jupyter Notebook 內容可以參考這裡.

關於 Selenium
我們打算求助可以操控瀏覽器的 Selenium，這是一個瀏覽器自動化的解決方案，主要應用於網頁程式測試目的；在資料科學團隊中被運用於解決擷取網站資料所碰到的問題，例如面對到需要登入、填寫表單或者點選按鈕後才會顯示出資料的網站。Python 可以透過 Selenium WebDriver 呼叫瀏覽器驅動程式，再由瀏覽器驅動程式去呼叫瀏覽器。Selenium WebDriver 對 Google Chrome 與 Mozilla Firefox 兩個主流瀏覽器的支援最好，為了確保使用上不會碰到問題，建議都使用最新版的瀏覽器、瀏覽器驅動程式與模組。

下載瀏覽器
前往官方網站下載最新版的瀏覽器。

* Google Chrome
* Mozilla Firefox

安裝瀏覽器驅動程式與 Selenium
前往官方網站下載最新版的 瀏覽器驅動程式，Chrome 瀏覽器的驅動程式名稱為 ChromeDriver，Firefox 瀏覽器的驅動程式名稱為 geckodriver。

* ChromeDriver
* geckodriver

下載完成以後解壓縮在熟悉的路徑讓後續的指派較為方便, 以我自身為範例, 下載並解壓縮後, 執行檔的位置在 "C:\tmp\chromedriver.exe".

接著在終端機安裝 Selenium 模組:

# pip install selenium

接著底下測試用程式碼透過 ChromeDriver 操控 Chrome 瀏覽器前往 IMDB.com 並將首頁的網址印出再關閉瀏覽器:

view plaincopy to clipboardprint?
from selenium import webdriver  
  
imdb_home = "https://www.imdb.com/"  
driver = webdriver.Chrome(executable_path="C:/tmp/chromedriver.exe") # Use Chrome  
driver.get(imdb_home)  
print(driver.current_url)  
driver.close()  

底下是使用 firefox 瀏覽器的範例:

view plaincopy to clipboardprint?
from selenium import webdriver  
  
imdb_home = "https://www.imdb.com/"  
driver = webdriver.Firefox(executable_path="YOURGECKODRIVERPATH") # Use Firefox  
driver.get(imdb_home)  
print(driver.current_url)  
driver.close()  

盤點手動操控的動作順序與 selenium 函數
測試完畢確認可以利用 Python 啟動 Chrome 以及 Firefox 瀏覽器之後，接著是盤點從 IMDB.com 前往指定電影資訊頁面過程中，手動用滑鼠、鍵盤所操控的動作：

1. 前往 IMDB.com
2. 在搜尋欄位輸入電影名稱
3. 點選搜尋按鈕
4. 將搜尋結果限縮在「電影」
5. 點選搜尋符合度最高的連結
6. 來到指定電影資訊頁面

前往 IMDB.com

在搜尋欄位輸入電影名稱

點選搜尋按鈕

將搜尋結果限縮在「電影」

點選搜尋符合度最高的連結

然後盤點會使用到的 Selenium WebDriver 方法：

* driver.get() ：前往 IMDB.com 首頁
* driver.find_element_by_xpath() 或 driver.find_element_by_css_selector() ：定位搜尋欄位、搜尋按鈕與搜尋結果連結
* driver.current_url ：取得當下瀏覽器的網址
* elem.send_keys() ：輸入電影名稱
* elem.click() ：按下搜尋按鈕與連結

安裝與使用 Chrome 瀏覽器外掛：XPath Helper
Selenium WebDriver 除了與 BeautifulSoup4、PyQuery 一樣支援以 CSS Selector 定位資料位址，亦支援 XPath，利用這個機會，我們簡介如何安裝與使用 Chrome 瀏覽器外掛 XPath Helper:

1. 前往 Chrome Web Store，點選外掛（Extensions）
2. 搜尋 XPath Helper 並點選加入到 Chrome 瀏覽器
3. 確認要加入 XPath Helper
4. 完成安裝

依照下列步驟使用 Chrome 瀏覽器外掛：XPath Helper:

1. 點選 XPath Helper 的外掛圖示
2. 按住 shift 鍵移動滑鼠到想要定位的元素
3. 試著縮減 XPath，從最前面開始刪減，並留意 XPath Helper 介面左邊的 XPath 與右邊被定位到的資料，尋找一個最短仍可以正確對應的 XPath

使用 Selenium 擷取多部電影資訊
接著寫作 get_movies() 函數，這個函數接受輸入電影名稱，會利用 Selenium 瀏覽到指定電影頁面，再呼叫一開始寫好的 get_movie_info() 函數，最後將多部電影的結果儲存到 Python 的 dict 中並以電影名稱作為 dict 的 key:

view plaincopy to clipboardprint?
from pyquery import PyQuery as pq  
from selenium import webdriver  
from random import randint  
import time  
  
def get_movie_info(movie_url):  
    """  
    Get movie info from certain IMDB url  
    """  
    d = pq(movie_url)  
    movie_rating = float(d("strong span").text())  
    movie_genre = [x.text() for x in d(".subtext a").items()]  
    movie_release_date = movie_genre.pop()  
    movie_poster = d(".poster img").attr('src')  
    movie_cast = [x.text() for x in d(".primary_photo+ td a").items()]  
  
    # 回傳資訊  
    movie_info = {  
        "movieRating": movie_rating,  
        "movieReleaseDate": movie_release_date,  
        "movieGenre": movie_genre,  
        "moviePosterLink": movie_poster,  
        "movieCast": movie_cast  
    }  
    return movie_info  
  
def get_movies(*args):  
    """  
    Get multiple movies' info from movie titles  
    """  
    imdb_home = "https://www.imdb.com/"  
    driver = webdriver.Chrome(executable_path="C:/tmp/chromedriver.exe") # Use Chrome  
    # driver = webdriver.Firefox(executable_path="PATHTOYOURGECKODRIVER") # Use Firefox  
      
    movies = dict()  
    for movie_title in args:  
        # 前往 IMDB 首頁  
        driver.get(imdb_home)  
        # 定位搜尋欄位  
        search_elem = driver.find_element_by_id("suggestion-search")  
        # 輸入電影名稱  
        search_elem.send_keys(movie_title)  
        # 定位搜尋按鈕  
        submit_elem = driver.find_element_by_id("suggestion-search-button")  
        # 按下搜尋按鈕  
        submit_elem.click()  
        # 限縮搜尋結果為「電影」類  
        category_movie_elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a")  
        # 按下限縮搜尋結果  
        category_movie_elem.click()  
        # 定位搜尋結果連結  
        first_result_elem = driver.find_element_by_xpath("//div[@class='findSection'][1]/table[@class='findList']/tbody/tr[@class='findResult odd'][1]/td[@class='result_text']/a")  
        # 按下搜尋結果連結  
        first_result_elem.click()  
          
        # 呼叫 get_movie_info()  
        current_url = driver.current_url  
        movie_info = get_movie_info(current_url)  
        movies[movie_title] = movie_info  
        time.sleep(randint(3, 8))  
          
    driver.close()  
    return movies  

接著你可以測試如下:

view plaincopy to clipboardprint?
get_movies("Avengers: Endgame", "Captain Marvel")  

執行結果:

view plaincopy to clipboardprint?
{'Avengers: Endgame': {'movieRating': 8.5,  
  'movieReleaseDate': '24 April 2019 (Taiwan)',  
  'movieGenre': ['Action', 'Adventure', 'Drama'],  
  'moviePosterLink': 'https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg',  
  'movieCast': ['Robert Downey Jr.',  
   'Chris Evans',  
   'Mark Ruffalo',  
   'Chris Hemsworth',  
   'Scarlett Johansson',  
   'Jeremy Renner',  
   'Don Cheadle',  
   'Paul Rudd',  
   'Benedict Cumberbatch',  
   'Chadwick Boseman',  
   'Brie Larson',  
   'Tom Holland',  
   'Karen Gillan',  
   'Zoe Saldana',  
   'Evangeline Lilly']},  
'Captain Marvel': {'movieRating': 6.9,  
  'movieReleaseDate': '6 March 2019 (Taiwan)',  
  'movieGenre': ['Action', 'Adventure', 'Sci-Fi'],  
  'moviePosterLink': 'https://m.media-amazon.com/images/M/MV5BMTE0YWFmOTMtYTU2ZS00ZTIxLWE3OTEtYTNiYzBkZjViZThiXkEyXkFqcGdeQXVyODMzMzQ4OTI@._V1_UX182_CR0,0,182,268_AL_.jpg',  
  'movieCast': ['Brie Larson',  
   'Samuel L. Jackson',  
   'Ben Mendelsohn',  
   'Jude Law',  
   'Annette Bening',  
   'Djimon Hounsou',  
   'Lee Pace',  
   'Lashana Lynch',  
   'Gemma Chan',  
   'Clark Gregg',  
   'Rune Temte',  
   'Algenis Perez Soto',  
   'Mckenna Grace',  
   'Akira Akbar',  
   'Matthew Maher']}}  

Supplement
* Selenium WebDriver API

程式扎記

標籤

2019年12月25日星期三

[ Python 常見問題 ] How to find all occurrences of a substring?

2019年12月19日星期四

[ Python 文章收集 ] Python Unpacking 實用技巧分享

2019年12月7日星期六

[LeetCode] Easy - 942. DI String Match

2019年12月5日星期四

[ Python 文章收集 ] 輕鬆學習 Python：透過操控瀏覽器擷取網站資料

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2019年12月25日 星期三