From kreckel at thep.physik.uni-mainz.de  Tue Apr  9 18:25:57 2002
From: kreckel at thep.physik.uni-mainz.de (Richard B. Kreckel)
Date: Tue, 9 Apr 2002 18:25:57 +0200 (CEST)
Subject: Fibonacci Hashing revisted
Message-ID: <Pine.LNX.4.21.0204091753090.27325-100000@higgs.physik.uni-mainz.de>

Hi,

As a sequel to that thread about Fateman's fast mult benchmark I have
growing qualms about our hashcode generation.  First, the observations
made in basic::compare(const basic&):

1) Hardly ever a hash collision between different types occur while
   hash collissions between different types are very frequent.  This
   suggests that something is better in one case than in the other.

2) Of course, there are unavoidable collisions when two equal objects
   are compared.  These amount to about 2/3 of all cases.  In other
   words: 1/3 of all collisions result in a call to compare_same_type()
   and a subsequent traversal.  This seems way too much, given a 
   hash-space of 31 bit.  (More on that later.)

3) Applying a unary ~ on the hash-values of class numeric (i.e.
   inverting all bits) reduces the number of collisions by about 20%
   and results in a speedup of about 10%.  As a crude estimate we could
   deduce a potential speedup factor two(!), given an ideal
   collision-free hashing.  No, I don't expect this to happen, but maybe
   50% should be doable.

What's wrong with our hashing?  I am still trying to find out.  Right now,
what I have is this:

a) CLN's hashes are sub-optimal for our purposes.  Basically, they seem
   to populate too few bits for the frequent small numbers.  Maybe that
   can be improved.  This results in some funny collisions due to some
   fundamental properties of the binary number system.  That should be
   easy to improve even without changing anything in CLN.  (I hesitate
   doing this in CLN since I have no idea what other people are using
   cln::equal_hashcode() for.)

b) We rotate the bits in expairseq::calchash() before xoring with the
   next element.  I doubt this makes sense.  The rotation is done in
   order to have h(h(X),h(Y)) be different from h(h(Y),h(X)).  This looks
   superfluous to me, since the elements of expairseq are canonicalized
   anyways when the hashing occurs so commutativity should be a non-issue.

c) The split-up of our hash space into two halves, one for numeric hashes
   and one for non-numeric hashes effectively throws away one valuable
   bit.  As far as I can recall, there were some historic reasons why
   this was done (initially, we didn't have hashes for numeric types).
   However, if one would like to try out the hashtab-based class expairseq
   again, then the implementation seems to demand that split-up.  See
   expairseq::calc_hashindex(const ex&) for the only occurrence of
   the is_a_numeric_hash() macro.  Hmm, I fail to see the theoretical
   reason why hashtabs need such a trick.  Since I very much like to
   introduce one clean globalized hash space, I need to remove this code.
   Is it really needed?  Why?

Call for help: Does anyone know more about this stuff?
(Alex, maybe you, since you wrote it long ago?!?)

Regards
    -richy.
-- 
Richard B. Kreckel
<Richard.Kreckel at Uni-Mainz.DE>
<http://wwwthep.physik.uni-mainz.de/~kreckel/>


From form at nikhef.nl  Wed Apr 17 13:58:10 2002
From: form at nikhef.nl (FORM Administrator)
Date: Wed, 17 Apr 2002 13:58:10 +0200
Subject: Form is really slow and fast  for me.  What could this mean?
Message-ID: <200204171158.g3HBwAW11551@tosca.nikhef.nl>

In the single line example FORM will definitely not look for common
subexpressions and try to evaluate them separately. That is completely
against its philisophy. There you would have to help it a bit.

Note that it will always be possible to invent benchmarks like this that give
poor performance. Actual performance is what counts most.

Cheers

Jos Vermaseren


From kreckel at thep.physik.uni-mainz.de  Tue Apr 30 21:42:59 2002
From: kreckel at thep.physik.uni-mainz.de (Richard B. Kreckel)
Date: Tue, 30 Apr 2002 21:42:59 +0200 (CEST)
Subject: Fibonacci Hashing revisted
In-Reply-To: <Pine.LNX.4.21.0204091753090.27325-100000@higgs.physik.uni-mainz.de>
Message-ID: <Pine.LNX.4.21.0204302123040.13612-100000@higgs.physik.uni-mainz.de>

Hi,

After several beers and off-list discussions with Alex I have now
committed some changes in our hashing code to CVS.  (It's on the branch
for version 1.1 only and there are no plans of backporting it to 1.0.)  
Anyway, the split-up into 31 bits plus 1 bit is gone now, so the code
should leave less questions and generally be shorter and more readable.  
The main problem was with CLN's hash values, which are not suited for our
purposes, since small integer's hash values have too few bits populated.  
They are xor'ed out soon with a high probability.  Multiplying with the
golden ratio easily fixes this.  All issues I brought up in my email from
three weeks have been addressed, but the speedup is minute.

I took the opportunity to analyze the hashing's effectiveness a
bit.  Executive summary: it's as good as it gets now.

What we want to measure is: given any two objects A and B of
classes derived from class basic (not necessarily the same), how good
is our hashing?  IOW, when A and B are different, how high is the
probability

        number of collisions
    P = ---------------------
        number of comparisons

The best we can expect with 32-bit hashvalues is obviously
P_opt=2^(-32)=2.3*10^(-10).  This is measured by instrumenting
basic::compare(const basic&) with two counters.  The results after
running the timings from the suite of regression tests a value are
approximately

    P_new=3.2*10^(-5)    vs.    P_old=8.5*10^(-3),

suggesting an improvement of a factor 260 or 270.  At the same time,
the timings got only very marginally better, indicating that there is
nothing to be gained by better hash functions, even perfect
collision-free ones (if they would exist), since deep traversals caused by
equal objects outweight deep traversals caused by hash collisions of
different objects.

Regards
     -richy.
-- 
Richard B. Kreckel
<Richard.Kreckel at Uni-Mainz.DE>
<http://wwwthep.physik.uni-mainz.de/~kreckel/>

On Tue, 9 Apr 2002, I wrote:
> As a sequel to that thread about Fateman's fast mult benchmark I have
> growing qualms about our hashcode generation.  First, the observations
> made in basic::compare(const basic&):
> 
> 1) Hardly ever a hash collision between different types occur while
>    hash collissions between different types are very frequent.  This
>    suggests that something is better in one case than in the other.
> 
> 2) Of course, there are unavoidable collisions when two equal objects
>    are compared.  These amount to about 2/3 of all cases.  In other
>    words: 1/3 of all collisions result in a call to compare_same_type()
>    and a subsequent traversal.  This seems way too much, given a 
>    hash-space of 31 bit.  (More on that later.)
> 
> 3) Applying a unary ~ on the hash-values of class numeric (i.e.
>    inverting all bits) reduces the number of collisions by about 20%
>    and results in a speedup of about 10%.  As a crude estimate we could
>    deduce a potential speedup factor two(!), given an ideal
>    collision-free hashing.  No, I don't expect this to happen, but maybe
>    50% should be doable.
> 
> What's wrong with our hashing?  I am still trying to find out.  Right now,
> what I have is this:
> 
> a) CLN's hashes are sub-optimal for our purposes.  Basically, they seem
>    to populate too few bits for the frequent small numbers.  Maybe that
>    can be improved.  This results in some funny collisions due to some
>    fundamental properties of the binary number system.  That should be
>    easy to improve even without changing anything in CLN.  (I hesitate
>    doing this in CLN since I have no idea what other people are using
>    cln::equal_hashcode() for.)
> 
> b) We rotate the bits in expairseq::calchash() before xoring with the
>    next element.  I doubt this makes sense.  The rotation is done in
>    order to have h(h(X),h(Y)) be different from h(h(Y),h(X)).  This looks
>    superfluous to me, since the elements of expairseq are canonicalized
>    anyways when the hashing occurs so commutativity should be a non-issue.
> 
> c) The split-up of our hash space into two halves, one for numeric hashes
>    and one for non-numeric hashes effectively throws away one valuable
>    bit.  As far as I can recall, there were some historic reasons why
>    this was done (initially, we didn't have hashes for numeric types).
>    However, if one would like to try out the hashtab-based class expairseq
>    again, then the implementation seems to demand that split-up.  See
>    expairseq::calc_hashindex(const ex&) for the only occurrence of
>    the is_a_numeric_hash() macro.  Hmm, I fail to see the theoretical
>    reason why hashtabs need such a trick.  Since I very much like to
>    introduce one clean globalized hash space, I need to remove this code.
>    Is it really needed?  Why?
> 
> Call for help: Does anyone know more about this stuff?
> (Alex, maybe you, since you wrote it long ago?!?)