Fibonacci Hashing revisted
Richard B. Kreckel
kreckel at thep.physik.uni-mainz.de
Tue Apr 9 18:25:57 CEST 2002
Hi,
As a sequel to that thread about Fateman's fast mult benchmark I have
growing qualms about our hashcode generation. First, the observations
made in basic::compare(const basic&):
1) Hardly ever a hash collision between different types occur while
hash collissions between different types are very frequent. This
suggests that something is better in one case than in the other.
2) Of course, there are unavoidable collisions when two equal objects
are compared. These amount to about 2/3 of all cases. In other
words: 1/3 of all collisions result in a call to compare_same_type()
and a subsequent traversal. This seems way too much, given a
hash-space of 31 bit. (More on that later.)
3) Applying a unary ~ on the hash-values of class numeric (i.e.
inverting all bits) reduces the number of collisions by about 20%
and results in a speedup of about 10%. As a crude estimate we could
deduce a potential speedup factor two(!), given an ideal
collision-free hashing. No, I don't expect this to happen, but maybe
50% should be doable.
What's wrong with our hashing? I am still trying to find out. Right now,
what I have is this:
a) CLN's hashes are sub-optimal for our purposes. Basically, they seem
to populate too few bits for the frequent small numbers. Maybe that
can be improved. This results in some funny collisions due to some
fundamental properties of the binary number system. That should be
easy to improve even without changing anything in CLN. (I hesitate
doing this in CLN since I have no idea what other people are using
cln::equal_hashcode() for.)
b) We rotate the bits in expairseq::calchash() before xoring with the
next element. I doubt this makes sense. The rotation is done in
order to have h(h(X),h(Y)) be different from h(h(Y),h(X)). This looks
superfluous to me, since the elements of expairseq are canonicalized
anyways when the hashing occurs so commutativity should be a non-issue.
c) The split-up of our hash space into two halves, one for numeric hashes
and one for non-numeric hashes effectively throws away one valuable
bit. As far as I can recall, there were some historic reasons why
this was done (initially, we didn't have hashes for numeric types).
However, if one would like to try out the hashtab-based class expairseq
again, then the implementation seems to demand that split-up. See
expairseq::calc_hashindex(const ex&) for the only occurrence of
the is_a_numeric_hash() macro. Hmm, I fail to see the theoretical
reason why hashtabs need such a trick. Since I very much like to
introduce one clean globalized hash space, I need to remove this code.
Is it really needed? Why?
Call for help: Does anyone know more about this stuff?
(Alex, maybe you, since you wrote it long ago?!?)
Regards
-richy.
--
Richard B. Kreckel
<Richard.Kreckel at Uni-Mainz.DE>
<http://wwwthep.physik.uni-mainz.de/~kreckel/>
More information about the GiNaC-devel
mailing list