程式扎記: [ Data Structures with Java ] Section 21.8 : Hash Table Performance

標籤

2011年4月12日 星期二

[ Data Structures with Java ] Section 21.8 : Hash Table Performance

Preface : 
The hashing process includes both the hash function and the hash table. The design of the process has the goal of providing very efficient search operations. In this section, we discuss the performance of hashing using linear probing and chaining with separate lists. As with any performance analysis, we look at the search algorithm under both worst-case and average-case conditions. 
The analysis of hashing performance depends upon the quality of the hash function and the size of the hash table. A poor hash function maps all of the keys to a single table index; a good hash function provides a uniform distribution of hash values. A concept called load factor provides a measure of the table size. If a hash table has m entries, and n is the number of elements currently in the table, the load factor λ for the table is : 
λ = n / m 

In the case of linear probe, m is the size of the array (table range) ; for chaining with separate lists, m is the number of buckets. When the table is empty, λ is 0. As we add more items to the table, λ increases. For linear probe, λ attains the maximum value of 1 when the table is full (m=n). When the table uses chaining with separate lists, the individual buckets can grow as large as needed, and λ can become greater than 1.0. 
The worst-case linear probe or chaining with separate lists occurs when all data items hash to the same table location. If the table contains n elements, the search time is O(n), no better than that for the sequential search. To analyze the average case for a search, we assume that the hash function uniformly distributes indices around the hash table. That is, any item is equally likely to hash into any of the m slots, independent of where any other elements falls in the table. To attain a uniform distribution, it is best for m to be a prime number. Any discussion of the average case for linear probe is beyond the scope of this book. Consider the case in which the table uses chaining with separate lists. For the target named item, the time to compute the hash value hf(item) is O(1), so the number of comparisons it takes to locate item depends on the length of the list at array location bucket[hf(item)]. The assumption of a uniform hash distribution implies that we can expect λ = n/m elements in each bucket(list). There are two cases, Either item is in the list and the search is successful, or item is not in the list and the search fails. On average, an unsuccessful search makes λ comparisons before arriving at the end of a list and returning failure. Computing the average running for a successful search is more difficult. A mathematical analysis shows that the average number of probes for a successful search requires no more than 1.25 probes. When the table is 2/3 full (λ = 2/3), chaining requires no more than 1.33 probes. 

- Fine-Tuning Hashing 
The performance of chaining is a function of the load factor λ. In theory, λ = n/m can increase without bounds when the table stores elements in a fixed number of buckets. As λ grows, the search performance deteriorates. For this reason, a programmer does not use a hash table if there is no upper bound on the total number of element, say, R * m. In this case, λ = n/m <= (R*m)/m = R. The running time for a successful and an unsuccessful search satisfies the relationships : 
 

The terms on the right hand side of the inequalities are constants, so the average search has running time O(1)! That is, the running time is independent of the number of data items. The challenge is to make the constant time reasonably small, which is equivalent to saying that λ is reasonably small. The programmer can fine-tune the hashing process by adjusting the number of buckets. The choice of m must take into account system resources and the need to have the hash function uniformly distributed its values in the index range from 0 to m - 1. Note that difficult mathematical analysis indicates that the average performance of linear probing is also O(1). 

Evaluating Ordered and Unordered Sets : 
An unordered set or map class uses a hash table, so it provides average constant-time performance for the insertion, removal and retrieval operations. An ordered set or map stores elements in a binary search tree. Access and update operation have runtime efficiency O(log2n). The application will dictate which type is more appropriate. If the programmer needs very efficient access and updates without any concern for the ordering of the elements, the unordered set or map is appropriate. When order is important, an ordered collection is more appropriate.

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!