程式扎記: [ Data Structures with Java ] Section 21.2 : Designing Hash Functions

標籤

2011年4月5日 星期二

[ Data Structures with Java ] Section 21.2 : Designing Hash Functions


Preface :
The identity function for an integer key is an example of a hash function. We need other functions for different types of keys. Some general design principles guide the creation of all hash functions. Evaluating a hash function should be efficient. The goal of the hashing process is efficient access to the corresponding table entry. A hash function's ability to compute the hash value efficiently is an important factor in attaining this goal. A hash function can produce collisions when the value is telescoped into the table range by using % operator. Ideally, we would create a perfect hash function that produces no collisions. Short of this ideal situation, a hash function should produce uniformly distributed hash values. This spreads the hash table indices around the table, which help minimize collisions.

Java Method hashCode() :
Hash table applications involve elements that typically use strings or numeric values as the key. A design strategy mush include hash functions that deal with integer, real and string types. The field of algorithms provides extensive research on this topic. We limit out our discussion to a small set of hash functions that meet our design criteria.
The Java programming language provides a general hashing function with the hashCode() method in the Object superlcass.
The implementation of this general hashCode() method converts the internal address of the object into an integer value. This has limited application because two different objects will have different values for hashCode(), even if they store the same data! For instance, hashCode() would produce distinct values for the following strings :
// strings one and two are the same; not so for integer values
// one.hashCode() and two.hashCode()

String one = "java", two="java";

Key Java classes such as String, Integer and Double override hashCode() in the Object class with implementations that are based on the data in the object. As a result, we have access to predefined hash functions for most hash table applications. The presence of hashCode() in the Object superclass allows us to declare generic hash-table-based-collection classes where the method is specific to the generic type. Let us look at the Java hashCode() implementations for Integer and String objects.
- Integer Class Hash Function
The Integer class provides the identity function for hashCode(). The implementation returns the integer value that is wrapped in the object :
- Integer hashCode() :
  1. ...  
  2. public int hashCode(){return value;}  
  3. ...  

With an integer key, the identity function can serve as a good hash function, provided all or a portion of the number is random. Assume a television manufacturer marks its products with seven-digit serial numbers, in which the low-order five digits are random. Using a table with size n<=10^k(k<=5), the identify hash function maps the serial number to a random table index :
Integer tv = 6824015;
// index for a serial number in a 100000-element table
tv.hashCode() % 100000 = 6824015 % 100000 = 24015

- String Class Hash Function
In the majority of hash-table applications, the key is a string, such as a Social Security number, a license number, or a policy number. For instance, compilers often keep track of identifiers in a program by using a hash table. To create an efficient hash function, we must combine the sequence of characters in the string to for an integer. This task is carried out by the String class hashCode() method. The algorithm assumes that the class internally stores the ncharacters of the string in a character array. The following is the definition of the hash value for string str. Let s be the corresponding array with characters from the string and size n = str.length(). The calculations include both multiplication and addition :
hash = s[0] * 31^(n-1) + s[1] * 31^(n-2) + ... + s[n-1]

The String class hashCode() method is implemented with a loop having n (string length) iterations. Note that the hash value of the empty string is zero :
- String hashCode() :
  1. ...  
  2. public int hashCode() {  
  3.     int hash = 0;  
  4.     for(int i=0; i
  5.         hash = 31*hash + s[i];  
  6.     return hash;  
  7. }  
  8. ...  

User-Defined Hash Functions :
A class would override the method hashCode() only if instances of the class could serve as a key. We do this for the Time24 class. The hash value for an object is its time converted to minutes. Because hour and minute are normalized to fall within range 0-23 and 0-59 respectively, each time is a unique positive value :
- Time24 hashCode() rewrite :
  1. ...  
  2.     @Override  
  3.     public int hashCode(){  
  4.         return 60*hour + minute;  
  5.     }  
  6. ...  

The hash function in the String class uses multiplication to create random values that limit collision. A function that simply added the ASCII values of the characters would result in the same hash value for the string "dog" and "god". Similar calculations may be needed for integer keys. The integer identity function is a good hash function if low-order digits in the integer keys are random. When this is not true, two may collisions may occur. For instance, suppose a television manufacturer stamps each item with a serial number that uses the last four digits to record the year in which the item is made. If the hash table size is 10,000, the identify hash function maps all of the products for a given year to the same index. Only TV sets manufactured in different years would avoid a collision. If the television serial number will be a key, we need to design a custom hash function that "mixes up" the bits and produces more uniformly distributed values. Below is a simple algorithm to meet our requirement :
Algorithm :
Assume serialNum is a 32-bits integer value specifying a product's serial number. Its value is in the range from 0 to 2^31-1 (between 0 and Integer.MAX_VALUE). Assign the integer value into the 64-bits long variable, hashValue. Nonnegative long values have a range from 0 to 2^63-1. With the relative bit-sizes for int and long variables, the square of hashValue must be a nonnegative 64-bits number, because :
(Integer.MAX_VALUE)^2 = (2^31-1)^2 = 2^62 - 2^32 +1 < 2^63 - 1

Return the remainder after dividing the square by Integer.MAX_VALUE. This is a nonnegative 32-bits integer that contains "jumbled up" bits from the original value.

Television sets are instances of the Product class, which defines the instance variable serialNum. We use the custom algorithm to override the method hashCode() in the class :
- Product class with overrided hashCode() :
  1. public class Product {  
  2.     private int serialNum;  
  3.     ...  
  4.     @Override  
  5.     public int hashCode() {  
  6.         long hashValue = serialNum;  
  7.         hashValue *= hashValue;  
  8.         return (int)(hashValue % Integer.MAX_VALUE);  
  9.     }  
  10. }  

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!