1. Times33算法

虽然写了好几年php,可我之前还真没听过这个算法.. 哈哈哈 略尴尬

1.1.1. 概述

Times33是一个很简单的hash算法,对字符串进行哈希的函数,现在几乎所有流行的 HashMap都采用 DJB Hash Function, 俗称 Times33 算法

1.1.2. 原理


1.1.3. 代码


unsigned int DJBHash(char* str, unsigned int len)
   unsigned int hash = 5381;
   unsigned int i    = 0;

   for(i = 0; i < len; str++, i++)
      hash = ((hash << 5) + hash) + (*str);

   return hash;

一般可以改进一下, *33可以换成<< 5 + hash

1.1.4. 为啥是33和5381

  • 33

    这个神奇的数字33,为什么用来计算哈希的效果会比其他许多常数(无论是否为质数)更有效,并没有人给过足够充分的解释。因此,Ralf S. Engelschall尝试通过自己的方法解释其原因。通过对1到256中的每个数字进行测试,发现偶数的哈希效果非常差,根据用不了。而剩下的128个奇数,除了1之外,效果都差不多。这些奇数在分布上都表现不错,对哈希表的填充覆盖大概在86%。

    从哈希效果来看(Chi^2应该是指卡方分布),虽然33并不一定是最好的数值。但17、31、33、63、127和129等相对其他的奇数的一个很明显的优势是, 由于这些奇数与16、32、64、128只相差1,可以通过移位(如1 << 4 = 16)和加减1来代替乘法,速度更快 。

  • 5381

    • 奇数
    • 质数
    • 亏数(在数论中,若一个正整数除了本身外之所有因数之和比此数自身小,则称此数为亏数。 更为严格地说,亏数是指使得函数σ < 2n的正整数,其中σ指的是因数和函数,即n的所有正因数之和。2n − σ称作n的亏度。 例如15的真因数有1,3,5,而1+3+5=9,9<15,所以15可称为亏数。)
    • 二进制是001/010/100/000/101
  • 作者原话

DJBX33A (Daniel J. Bernstein, Times 33 with Addition)

  This is Daniel J. Bernstein's popular `times 33' hash function as
  posted by him years ago on comp.lang.c. It basically uses a function
  like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
  known hash functions for strings. Because it is both computed very
  fast and distributes very well.

  The magic of number 33, i.e. why it works better than many other
  constants, prime or not, has never been adequately explained by
  anyone. So I try an explanation: if one experimentally tests all
  multipliers between 1 and 256 (as RSE did now) one detects that even
  numbers are not useable at all. The remaining 128 odd numbers
  (except for the number 1) work more or less all equally well. They
  all distribute in an acceptable way and this way fill a hash table
  with an average percent of approx. 86%.

  If one compares the Chi^2 values of the variants, the number 33 not
  even has the best value. But the number 33 and a few other equally
  good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
  advantage to the remaining numbers in the large set of possible
  multipliers: their multiply operation can be replaced by a faster
  operation based on just one shift plus either a single addition
  or subtraction operation. And because a hash function has to both
  distribute good _and_ has to be very fast to compute, those few
  numbers should be preferred and seems to be the reason why Daniel J.
  Bernstein also preferred it.

                   -- Ralf S. Engelschall <rse@engelschall.com>

results matching ""

    No results matching ""