程式扎記: [ Data Structures with Java ] Section 21.3 : Designing Hash Tables

2011年4月6日星期三

[ Data Structures with Java ] Section 21.3 : Designing Hash Tables

Preface :
In the previous section, we focused on the concept of a hash function. To understand its action, we used an array to introduce a hash table and had the hash function map a key to an index position in the array. In this section, we turn our attention to the design of a hash table. In the process, we define different storage models that work with a hash function to resolve collisions. We assume the table doesn't contain duplicate values and so collisions result from the hashing process. The issue is clear. When two or more data items hash to the same table index, they cannot occupy the same position in the table. We are left with the option of locating one of the items at another position in the table or of redesigning the table to store a sequence of colliding keys at each index. In the latter case, a list holds all of the elements that hash to that location. These options represent two classical strategies for collision resolution called linear probing and chaining with separate list.

Linear Probing :
This technique assumes that the hash table is an array of elements with an associated hash function. Initially, we tag each entry in the table as "empty". To add an item to the table, apply the hash function to the key and divide the value by the table size. The resulting index is a location in the table. If the entry is empty, the algorithm inserts the item at the index position. Otherwise, it "probes" the table looking for the first open slot. The probe starts at the next hash index and begins a sequential search at successive indices. The search wraps around to the start of the table after it probes the last table entry. An insertion occurs at the first open location. The search could cycle through all of the indices and return to the original hash location without finding an open slot. In this case, the table is full, and the algorithm throws an exception. Below is a example :
Example 21.2
Consider a hash table of size 11 which combines with the identity hash function to store eight distinct integer data items. The table lists the items and the corresponding hash index, whose value is item%11 :

* Inserts 54, 77, 94, 89, 14 :

The first five items hash to different indices. In each case, the table location is empty and the item is inserted at the hash index. Below figure(1) displays the table with the annotation "1" next to the occupied cells, indicating that only one probe (test for empty) was necessary to insert the item.

* Insert 54 :

The hash index for 45 is 1, which provides the first location. The location is occupied by 89 from a previous insertion. The linear probe picks up at index 2 and continues until the table has an open slot. This occurs immediately because no item occupies the position at index 2. Inserting 45 requires only two probes (Below figure(b))

* Insert 35 :

As the table begins to fill, the likelihood of collisions increases along with the number of probes required to insert an item. With data value 35, which hashes to index 2, there is no collision with any previous value. However, we require additional probes because the value 45 took slot 2 in response to a collision with 89. Value 35 requires three probes before landing at index 4(Below figure(c))

* Insert 76 :

Adding 76 to the list dramatically illustrates a potential problem with linear probing. The value hashes to index 10 but finds an open slot only after seven probes (Below figure(d))

The following code outline describes the linear probe insertion algorithm, assuming no duplicate values :

view plaincopy to clipboardprint?
int index, origIndex;  
  
// compute the hash index of item for a table of size n  
index = (item.hashCode() & Integer.MAX_VALUE) % n;  
  
// save the original hash index  
origIndex = index;  
  
// cycle through the table looking for an empty slot, a match, or a table full condition (origIndex = index)  
do{  
    // test whehter the table slot is empty of the key matches the data field of the table entry  
    if table[index] is empty  
        insert item in table at table[index] and return  
    else if table[index] matches item  
        return  
    // we are not yet successful; begin a probe starting at the next table location  
    index = (index+1) % n;  
  
} while (index != origIndex);  
// we have gone around the table without finding an open slot or a match the table is full  
throw new BufferOverflowException();  

- Evaluating Linear Probing
If the size of the table is large relative to the number of items, linear probing works well because a good hash function generates indices that are evenly distributed over the table range, and collisions will be minimal. As the ration of table size to the number of items approaches 1, inherent difficulties with the process become apparent. In the previous example, adding 76 to the table requires seven probes, which is more than 40% of the 17 probes required to store the entire list. Even through five of the eight data values hash to distinct indices, the linear probe algorithm requires an average 2.1 probes per item. A phenomenon called clustering can occur when using linear probes and will degrade performance.

Chaining with Separate Lists :
A second approach to hashing defines the hash table as an indexed sequence of linked lists. Each list, called a bucket, holds a set of items that hash to the same table location. This collision resolution strategy is referred to as chaining with separate lists.

A bucket is a singly linked list. Each entry of the table is a reference to the first node in a sequence of items that hash to the table index. A node has the familiar structure with two fields for the value and for the reference to the next node.
To add object item, begin with the key and use the hash function to identify the index, i, for the appropriate bucket (linked list) in the table. Assume the array of node reference is called table. If table is null, add item as the first entry in the list; otherwise, a scan of the list compares item with the value of each node. If the scan ends with no match, then item is not in the list. Add it to the front of the list.

- Evaluating Chaining with Separate Lists
Chaining with separate lists is generally faster than linear probing because chaining only searches items that hash to the same table location. Furthermore, with linear probing, the number of table entries is limited to the table size, whereas the collections used in chaining grow as necessary and hence the number of elements in the table is limited only by the amount of memory. In addition, deleting an element from the hash table is simple. Just erase it from the associated list. Deleting an element from a hash table that uses linear probing is more difficult.

- Rehashing
When using chaining with separate lists, the number of elements is not limited by the size of the hash table. However, the individual linked lists increases the search time for elements in a list. We can solve these problem by using a strategy known as [i]rehashing. The idea is relatively simple. When the number of elements in the current hash table is a specified percentage of table size, create a larger hash table. Elements from the old table must be hashed into the new table. The new hash table will provide better performance until it begins to fill up, at which time apply rehashing again. We use this strategy in building the Hash and HashMap classes.

程式扎記

標籤

2011年4月6日星期三

[ Data Structures with Java ] Section 21.3 : Designing Hash Tables

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2011年4月6日 星期三

[ Data Structures with Java ] Section 21.3 : Designing Hash Tables

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2011年4月6日星期三