2018年11月28日 星期三

[ 常見問題 ] How to compare dates in datetime fields in Postgresql?

Source From Here 
Question 
How to make below query right while the column update_date is of type as "date/time"? 
  1. select * from table where update_date >= '2013-05-03' AND update_date <= '2013-05-03'  
How-To 
One example as below: 
  1. SELECT *  
  2. FROM table  
  3. WHERE update_date >= '2013-05-03'::date  
  4. AND update_date < ('2013-05-03'::date + '1 day'::interval);  
Supplement 
PostgreSQL Doc - Date/Time Functions and Operators

[ 文章收集 ] PostgreSQL 筆記 — 索引

Source From Here 
PostgreSQL 筆記 — 索引 
在資料庫當中,為了改進查詢效率,在資料量大的時候,可以透過建立索引讓資料庫加速查詢效率。本篇文章將會探討 PostgreSQL 的 index。雖然在大部分的情況,直接使用 CREATE INDEX 的語法已經能夠應付常見的開發場景,但是在 PostgreSQL 當中,其實還提供了許多不同的 index types 使用。 

索引的基本原理 
資料庫最重要的就是查詢資料,當資料存儲在硬碟時就要考慮如何處理 I/O 的效能問題。外接硬碟因為其本身硬體的限制,會比直接在記憶體讀取耗費更多的時間。在資料庫當中,如果沒有對資料表建立索引,將會使用 sequential scans 來查詢資料。在資料數量小的時候並沒有太明顯的效能差異,甚至還比建立索引還快,不過當資料量越來越大,sequential scans 很有可能造成讀取過慢的問題。 

索引如何解決 sequential scans 過慢的問題 
透過額外在硬碟建立一張索引表,我們可以透過索引表當作目錄,資料庫在查詢時,就會用給定的索引表尋找相對應的索引。再透過由索引指向實體資料,連結到硬碟當中的實體位址。 

B Tree 
B Tree 是一個專門為硬碟優化的平衡樹,跟一般的二元樹不同,B Tree 的樹可以有很多分支,有效減少樹的深度。B tree 的原理需要比較長的篇幅介紹,所以這邊省略介紹。 

PostgreSQL 中的索引 

建立索引 
在 PostgreSQL 中可以透過 SQL 語法來建立索引: 
  1. CREATE INDEX index_name ON table_name (column_name) USING method;  
在 PostgreSQL 中,如果沒有特別指定演算法,預設使用 B-Tree 來建立索引。也可以針對某個 column 做索引。也可以一次對多個 column 建立索引: 
# CREATE INDEX index_name ON table_name (col1, col2);
"Index Scan using index_subscribers_on_email on subscribers (cost=0.28..8.29 rows=1 width=76)"

用 EXPLAIN 來觀察 query 的效能。發現透過 B-Tree 的方式成功減少了查詢時間。 
注意: 
* 使用 CREATE INDEX 建立索引時會鎖住整張表,在資料量大時可能需要耗費不少時間。
* 建立索引時我們可以加入 CREATE INDEX CONCURRENTLY 來確保資料表不會被 lock 住。不過為了確保索引表的完整性,CONCURRENTLY 需要花更多的時間來建立索引
When this option is used, PostgreSQL will build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it’s done. There are several caveats to be aware of when using this option — see Building Indexes Concurrently.

source


恰如其分的 index 
為了解釋資料庫使用 index 的時機,我們先對某個只有兩筆資料的 table 下 index: 
# CREATE INDEX index_users_on_email ON users (email);
# EXPLAIN SELECT email FROM users WHERE email = 'xxx@gmail.com';
"Seq Scan on users (cost=0.00..1.02 rows=1 width=32)"
" Filter: ((email)::text = 'xxx@gmail.com'::text)"

對於資料量小的資料,儘管建立了索引,資料庫仍然會使用 sequential scan 查詢。這是因為 Random I/O 的花費會比循序查詢的方式還高,因此如果資料量小時,就算建立 index 也不會增加查詢效能,反而浪費了硬碟空間。 

index size 
對一個具有 1300 多筆的資料建立 index: 
# CREATE INDEX index_subscribers_on_email on subscribers (email);
# SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; //80K

透過 pg_relation_size 拿取相關 relation 的大小。刪除 index 可使用 SQL 如 DROP INDEX IF EXISTS index_name. 為了確保索引的完整性,在建立索引時,PostgreSQL 會將整張表做 exclusive lock,建立完索引後才 release。因此對頻繁查改的資料表來說很有可能造成服務中斷,對於筆數相當多的資料來說,建立索引有可能花上幾分鐘或是幾十分鐘。 

維護 index 
首先先在資料庫刪除幾筆資料,對於資料庫來說,預設刪除的方式並不是真正從硬碟當中釋放空間,而是標上類似「已刪除」的標記。對於索引來說,刪除資料並不會減少索引大小
# DELETE FROM subscribers WHERE id >1350;
# SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K

這時可以透過 REINDEX 的方式重新建立索引。REINDEX INDEX index_name
# REINDEX INDEX index_subscibers_on_email;
# SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 72K

重建索引會把無用的索引刪除,索引表的大小減少了。從上面簡單的測試可以得知,對於增刪頻繁的資料表來說,定期重建索引可以減少索引表的大小,如果需要刪除大量的資料,也盡量重建索引來維護索引表的大小,否則可能造成硬碟空間的浪費不過 REINDEX 也會造成鎖表,如果不允許服務維修或暫停的話,可以直接用 CREATE INDEX CONCURRENTLY 的方式重新建立一個全新的索引,再把舊的索引刪除,並且重新命名索引。這會比 REINDEX 耗費更久時間,而且也比較麻煩一些,不過可以確保資料表不會被鎖住。 

對特定資料做索引 
有些時候我們只會頻繁查詢某些特定條件的資料,這時候就不一定要對所有資料下索引。透過 CREATE INDEX index_name ON table_name WHERE ... 的方式,我們可以針對需要的資料做索引。 

索引排序 
在建立索引時如果沒有特別宣告排序,預設會使用 ASC 來建立索引。不過在某些使用情境下,我們可能更常用 DESC 的方式來排序,例如排名、文章發佈日期等等,這時使用 DESC 建立索引更有效率。 
  1. CREATE INDEX index_articles_on_published_date ON articles (published_date DESC);  
重建索引 
B-Tree索引只有在全部清空索引 page 才會被重複利用,所以在資料庫刪除某個值並不會減少索引表的大小。原本的資料為 1400 筆,刪除 50 筆,索引大小沒有變化。 
# DELETE FROM subscribers WHERE id > 1350;
# SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K

重新建立索引 
# REINDEX INDEX index_subscibers_on_email;
# SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size;
72K

Postgre 當中的 index type 
上面介紹完了 postgre index 與使用方式,接下來介紹 Postgre 當中常見的 index type。如果在建立 index 時沒有特別指定使用的方法,PostgreSQL 會使用 B Tree 來建立索引。 

Btree 
B Tree 支援下列幾種查詢操作: 
<
<=
=
>=
>

GIN(Generalized Inverted Index) 
GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.

GIN 被用在全文搜尋或是或是像 Array 這種可能會被。如同字面上的意思,倒排索引,就是利用 Value 當作索引值,再去找相對應出現的索引。例如: 
  1. this is cat  
  2. this is an apple.  
  3. cat meows.  
  4. // build inverted index  
  5. "this": {(00), (1,0)}  
  6. "is": {(0,1), (1,1)}  
  7. "an": {(1,2)}  
  8. "apple: {(1,3)}  
  9. "cat": {(0,2), (2,0)}  
  10. "meows": {(2,1)}  
透過下面的索引表,如果我們想要搜尋 cat 這個關鍵字,就可以從 key 為 cat 的 value 找到對應的索引。發現 cat 出現在第一個句子的第二個字以及第三個句子的第一個字。 

適用場景: 
* 全文搜尋
* 陣列

在 PostgreSQL 中使用 GIN: 
  1. CREATE INDEX index_on_document on sentences using gin(document);  
GiST(Generalized Search Tree) 
GiST 其實更算是一種通用的 interface,我們可以使用 GiST 來定義自己的 index 實現方式(B-Tree, R-Tree 等)。一般來說比較少使用到。但像是空間結構的資料,用 B-Tree 可能較難加速像是包含、相鄰、相交的搜索, 因此像是 PostGIS 就是使用 GiST 來建立對空間結構查詢效能較好的索引。 

適用場景: 
* 自己實現 index 接口
* 空間結構
* 使用 B-Tree 可能無法符合使用場景的資料結構

BRIN(Block Range Indexes) 
跟 B-Tree 對每一行做索引不同,BRIN 會對一段 range 做索引,雖然性能方面上仍然是 B-TREE 勝,但是 size 比 B-Tree 小很多。 

如果資料常常是以範圍查詢,例如 log 檔,某個範圍的交易訂單、出帳資料等等,這些資料量通常相當驚人,用 B-Tree 建立索引可能會耗費不小的硬碟空間。不過這些資料通常比較少使用精確搜尋,而是用像範圍搜尋的方式來批次處理這些資料,這時候使用 BRIN 可以得到還不錯的效能,同時節省了相當大的空間。 

適用場景: 
* Log 檔分析
* 交易訂單分析
* 資料量大,常用範圍搜尋

Bloom 
Bloom Filter 是為了解決 Hash table 的空間以及查詢時間,算法本身的原理可以相當快速地判斷某個元素是否有出現在集合,但會有 false positive 的情形出現,因此在 PostgreSQL 當中需要做二次檢查。 
  1. CREATE INDEX bloom_index_event ON events USING bloom (startTime, endTime, name, gifts)  
  2. WITH (length=80, startTime=2, endTime=2, name=4, gifts=2);  
length 為每個 signature 的長度,預設為 80,最長為 4096。後面的參數為對應的 column name 會被 mapped 到幾個 bits,最少為 2,最多為 4096。 

適用場景: 
對多個 column 的精確搜尋 SELECT * FROM events WHERE startTime=20171111 and endTime=20171231 and name=christmas and gifts="gift_special",透過 bloom filter 可以快速濾掉不在集合裡的結果。


總結 
一般來說,B-Tree 幾乎能夠處理大部分的使用情境。但 PostgreSQL 有豐富的 index 類型可供使用,讓資料能夠更容易地被查詢、索引,也給開發者更高的彈性。這是我相當喜歡 PostgreSQL 的一大原因之一。這次介紹的 index types,並沒有很深入探討背後的實作原理,但有興趣的讀者可以自行尋找相關的資源閱讀。簡單整理重點如下: 
1. 在 PostgreSQL 中有數種下 index 的演算法可供使用,每個演算法都有適合的使用情景。

2. 使用 CREATE INDEX 時會鎖住整張表,在為大量資料建立索引時很可能會造成影響。可以用 CREATE INDEX CONCURRENTLY 解決,但需要花更多時間建立索引。

3. 若資料頻繁地更新、刪除,會導致冗餘的 index。盡可能地固定使用 REINDEX 重新建立索引減少硬碟空間的使用。同時注意 REINDEX 會造成鎖表。可以再次建立 index,再把原本的 index 名稱刪除來避免鎖表

4. CREATE UNIQUE INDEX 可透過 unique 來確保值的唯一性

5. INDEX 可以透過 WHERE 條件來對特定的 column 下索引,針對特定使用情景。

6. INDEX 可以透過排序(預設為 ASC),來符合特定的使用情景(文章發佈日期等等

7. index 並不是銀彈,可以發現在小資料當中就算下 index 仍然會使用 seq scan,這是因為 Random I/O 的代價遠高於 seq can。因此對於固定且數量不多的資料(ex: 縣市、郵遞區號),建立索引並不一定能增加查詢時間,反而佔據了不必要的硬碟空間。


Supplement 
Tutorialspoint - PostgreSQL - INDEXES

2018年11月27日 星期二

[ Python 文章收集 ] Processing XML in Python with ElementTree

Source From Here 
Preface 
When it comes to parsing and manipulating XML, Python lives true to its "batteries included" motto. Looking at the sheer amount of modules and tools it makes available out of the box in the standard library can be somewhat overwhelming for programmers new to Python and/or XML. 

A few months ago an interesting discussion took place amongst the Python core developers regarding the relative merits of the XML tools Python makes available, and how to best present them to users. This article (and hopefully a couple of others that will follow) is my humble contribution, in which I plan to provide my view on which tool should be preferred and why, as well as present a friendly tutorial on how to use it. 

The code in this article is demonstrated using Python 2.7; it can be adapted for Python 3.x with very few modifications. 

Which XML library to use? 
Python has quite a few tools available in the standard library to handle XML. In this section I want to give a quick overview of the packages Python offers and explain why ElementTree is almost certainly the one you want to use. xml.dom.* modules - implement the W3C DOM API. If you're used to working with the DOM API or have some requirement to do so, this package can help you. Note that there are several modules in the xml.dom package, representing different tradeoffs between performance and expressivity. 

xml.sax.* modules - implement the SAX API, which trades convenience for speed and memory consumption. SAX is an event-based API meant to parse huge documents "on the fly" without loading them wholly into memory; xml.parser.expat - a direct, low level API to the C-based expat parser. The expat interface is based on event callbacks, similarly to SAX. But unlike SAX, the interface is non-standard and specific to the expat library. 

Finally, there's xml.etree.ElementTree (from now on, ET in short). It provides a lightweight Pythonic API, backed by an efficient C implementation, for parsing and creating XML. Compared to DOM, ET is much faster and has a more pleasant API to work with. Compared to SAX, there is ET.iterparse which also provides "on the fly" parsing without loading the whole document into memory. The performance is on par with SAX, but the API is higher level and much more convenient to use; it will be demonstrated later in the article. 

My recommendation is to always use ET for XML processing in Python, unless you have very specific needs that may call for the other solutions. 

ElementTree - one API, two implementations 
ElementTree is an API for manipulating XML, and it has two implementations in the Python standard library. One is a pure Python implementation in xml.etree.ElementTree, and the other is an accelerated C implementation in xml.etree.cElementTree (depreciated in v3.3). It's important to remember to always use the C implementation, since it is much, much faster and consumes significantly less memory. If your code can run on platforms that might not have the _elementtree extension module available [4], the import incantation you need is (For Python 2.x): 
  1. try:  
  2.     import xml.etree.cElementTree as ET  
  3. except ImportError:  
  4.     import xml.etree.ElementTree as ET  
This is a common practice in Python to choose from several implementations of the same API. Although chances are that you'll be able to get away with just importing the first module, your code may end up running on some strange platform where this will fail, so you better prepare for the possibility. Note that starting with Python 3.3, this will no longer be needed, since the ElementTreemodule will look for the C accelerator itself and fall back on the Python implementation if that's not available. So it will be sufficient to just import xml.etree.ElementTree. But until 3.3 is out and your code runs on it, just use the two-stage import presented above. 

Parsing XML into a tree 
Let's start with the basics. XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two objects for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading, writing, finding interesting elements) are usually done on the ElementTreelevel. Interactions with a single XML element and its sub-elements is done on the Element level. The following examples will demonstrate the main uses [5]. 

We're going to use the following XML document for the sample code: 
- test.xml 
  1. "1.0"?>  
  2.   
  3.     "testing" hash="1cdf045c">  
  •         text,source  
  •     
  •   
  •     "release01" hash="f200013e">  
  •         "subrelease01">  
  •             xml,sgml  
  •         
  •   
  •     
  •   
  •     "invalid">  
  •     
  •   
  •   Let's load and parse the document: 
    >>> import xml.etree.ElementTree as ET
    >>> tree = ET.ElementTree(file='test.xml')

    Now let's fetch the root element: 
    >>> tree.getroot()

    As expected, the root is an Element object. We can examine some of its attributes: 
    >>> root = tree.getroot()
    >>> root.tag, root.attrib
    ('doc', {})

    True, the root element has no attributes [6]. As any Elementit presents an iterable interface for going over its direct children
    >>> for child_of_root in root:
    ... print("{}, {}".format(child_of_root.tag, child_of_root.attrib))
    ...
    branch, {'name': 'testing', 'hash': '1cdf045c'}
    branch, {'name': 'release01', 'hash': 'f200013e'}
    branch, {'name': 'invalid'}

    We can also access a specific child, by index: 
    >>> root[0].tag, root[0].text
    ('branch', '\n text,source\n ')

    Finding interesting elements 
    From the examples above it's obvious how we can reach all the elements in the XML document and query them, with a simple recursive procedure (for each element, recursively visit all its children). However, since this can be a common task, ET presents some useful tools for simplifying it. 

    The Element object has an iter method that provices depth-first iteration (DFS) over all the sub-elements below it. The ElementTree object also has the iter method as a convenience, calling the root's iterHere's the simplest way to find all the elements in the document
    >>> for elem in tree.iter():
    ... print("{}, {}".format(elem.tag, elem.attrib))
    ...
    doc, {}
    branch, {'name': 'testing', 'hash': '1cdf045c'}
    branch, {'name': 'release01', 'hash': 'f200013e'}
    sub-branch, {'name': 'subrelease01'}
    branch, {'name': 'invalid'}

    This could naturally serve as a basis for arbitrary iteration of the tree - go over all elements, examine those with interesting properties. ET can make this task more convenient and efficient, however. For this purpose, the iter method accepts a tag name, and iterates only over those elements that have the required tag
    >>> for elem in tree.iter(tag='branch'):
    ... print("{}, {}".format(elem.tag, elem.attrib))
    ...
    branch, {'name': 'testing', 'hash': '1cdf045c'}
    branch, {'name': 'release01', 'hash': 'f200013e'}
    branch, {'name': 'invalid'}

    XPath support for finding elements 
    A much more powerful way for finding interesting elements with ET is by using its XPath supportElement has some "find" methods that can accept an XPath path as an argument. findreturns the first matching sub-element, findall all the matching sub-elements in a list and iterfind provides an iterator for all the matching elements. These methods also exist on ElementTree, beginning the search on the root element. 

    Here's an example for our document: 
    >>> for elem in tree.iterfind('branch/sub-branch'):
    ... print("{}, {}".format(elem.tag, elem.attrib))
    ...
    sub-branch, {'name': 'subrelease01'}

    It found all the elements in the tree tagged sub-branch that are below an element called branch. And here's how to find all branch elements with a specific name attribute
    >>> for elem in tree.iterfind('branch[@name="release01"]'):
    ... print("{}, {}".format(elem.tag, elem.attrib))
    ...
    branch, {'name': 'release01', 'hash': 'f200013e'}

    To study the XPath syntax ET supports, see this page

    Building XML documents 
    ET provides a simple way to build XML documents and write them to files. The ElementTree object has the write method for this purpose. Now, there are probably two main use scenarios for writing XML documents. You either read one, modify it, and write it back, or create a new document from scratch. 

    Modifying documents can be done by means of manipulating Element objects. Here's a simple example: 
    >>> root = tree.getroot()
    >>> del root[2]
    >>> root[0].set('foo', 'bar')
    >>> for subelem in root:
    ... print("{}, {}".format(subelem.tag, subelem.attrib))
    ...
    branch, {'name': 'testing', 'foo': 'bar', 'hash': '1cdf045c'}
    branch, {'name': 'release01', 'hash': 'f200013e'}

    What we did here is delete the 3rd child of the root element, and add a new attribute to the first child. The tree can then be written back into a file. Here's how the result would look: 
    >>> tree.write('/tmp/test.xml')
    >>> with open('/tmp/test.xml', 'r') as fh:
    ... print(fh.read())
    ...
    1.   
    2.     "bar" hash="1cdf045c" name="testing">  
  •         text,source  
  •     
  •   
  •     "f200013e" name="release01">  
  •         "subrelease01">  
  •             xml,sgml  
  •         
  •   
  •     
  •   
  •     
  •   
    Note that the order of the attributes is different than in the original document. This is because ET keeps attributes in a dictionary, which is an unordered collection. Semantically, XML doesn't care about the order of attributes. Building whole new elements is easy too. The ET module provides the SubElement factory function to simplify the process: 
    >>> a = ET.Element('elem')
    >>> c = ET.SubElement(a, 'child1')
    >>> c.text = "some text"
    >>> d = ET.SubElement(a, 'child2')
    >>> b = ET.Element('elem_b')
    >>> root = ET.Element('root')
    >>> root.extend((a, b))
    >>> tree = ET.ElementTree(root)
    >>> ET.dump(root)
    some text

    XML stream parsing with iterparse 
    As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM. 

    We've just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn't it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse

    I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing. I'm auto-generating an XML document to work with. Here's a tiny portion from its beginning: 
    1. "1.0" standalone="yes"?>  
    2.   
    3.     
    4.       
    5.       "item0">  
  •         United States      
  •         1  
  •         duteous nine eighteen   
  •         Creditcard  
  •           
  •             
  •            ...  
  • I've emphasized the element I'm going to refer to in the example with a comment. We'll see a simple script that counts how many such location elements there are in the document with the text value "Zimbabwe". Here's a standard approach using ET.parse
    1. tree = ET.parse(sys.argv[2])  
    2.   
    3. count = 0  
    4. for elem in tree.iter(tag='location'):  
    5.     if elem.text == 'Zimbabwe':  
    6.         count += 1  
    7.   
    8. print count  
    All elements in the XML tree are examined for the desired characteristic. When invoked on a ~100MB XML file, the peak memory usage of the Python process running this script is ~560MB and it takes 2.9 seconds to run. Note that we don't really need the whole tree in memory for this task. It would suffice to just detect location elements with the desired value. All the other data can be discarded. This is where iterparse comes in: 
    1. count = 0  
    2. for event, elem in ET.iterparse(sys.argv[2]):  
    3.     if event == 'end':  
    4.         if elem.tag == 'location' and elem.text == 'Zimbabwe':  
    5.             count += 1  
    6.     elem.clear() # discard the element  
    7.   
    8. print count  
    The loop iterates over iterparse events, detecting "end" events for the location tag, looking for the desired value. The call to elem.clear() is key here - iterparse still builds a tree, doing it on the fly. Clearing the element effectively discards the tree [7], freeing the allocated memory. 

    When invoked on the same file, the peak memory usage of this script is just 7MB, and it takes 2.5 seconds to run. The speed improvement is due to our traversing the tree only once here, while it is being constructed. The parse approach builds the whole tree first, and then traverses it again to look for interesting elements. 

    The performance of iterparse is comparable to SAX, but its API is much more useful - since it builds the tree on the fly for you; SAX just gives you the events and you build the tree yourself. This article presents a basic tutorial for ET. I hope it will provide anyone with interest in the subject enough material to start using the module and explore its more advanced capabilities on their own. 

    Conclusion 
    Of the many modules Python offers for processing XML, ElementTree really stands out. It combines a lightweight, Pythonic API with excellent performance through its C accelerator module. All things considered, it almost never makes sense not to use it if you need to parse or produce XML in Python. 

    Supplement 
    用 ElementTree 在 Python 中解析 XML

    [ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

    Source From  Here   Pivot Tables   We have seen how the  GroupBy  abstraction lets us explore relationships within a dataset. A pivot ta...