hash table size

Date: 20 Feb 2008 17:23:37
From: dkraft
Subject: hash table size

Hi,

How large should the hash table size be?
The large the better?
Or is 1536 MB wasting hash memory?
And is 512 MB enough?

Thanks
Dieter

Date: 20 Feb 2008 11:10:06
From: Guest
Subject: Re: hash table size

Basically, the larger the better provided it all fits into physical memory
with no chance of paging out to disk. It's not going to fit into the L2
cache anyway....

Basically, run a simple test...

With whatever test suite you have, try it with various sizes.

The odds are good you'll see that the larger it is, the better the program
performs, with the improvements slowing down the larger the table size gets.

If you have the memory and it's not being used for anything else, then why
not use it? You have nothing to loose by doing so because a larger hash
table will never make a program play worse (unless there are bugs in the
program, and a few pathological test cases.)

Incidentally, this newsgroup is no longer suitable for discussions of
computer chess. Too much spam, Sam Sloan, USCF, etc. etec. garbage.

Most chess programmers have moved over to:

http://www.talkchess.com/forum/viewforum.php?start=0&f=7&topic_view=flat

(They have several forums, but that's the programmer's forum.)

"dkraft" <[email protected] > wrote in message
news:[email protected]...
> Hi,
>
> How large should the hash table size be?
> The large the better?
> Or is 1536 MB wasting hash memory?
> And is 512 MB enough?
>
> Thanks
> Dieter

Date: 20 Feb 2008 19:25:26
From: Anders Thulin
Subject: Re: hash table size

Guest wrote:
> Basically, the larger the better provided it all fits into physical memory
> with no chance of paging out to disk. It's not going to fit into the L2
> cache anyway....

I don't quite agree: every hash table entry imposes some administrative
overhead that offsets the benefit of having a hash table. If that overhead
gets too large, you lose.

Just where the cutoff is depends on how the hash table was implemented.
Testing can be used to detect the cut-off, but it will probably take a lot
of careful testing to discover.

So: ask the software writers. They should have an idea if there
is a sweet spot or not.

--
Anders Thulin anders*thulin.name http://www.anders.thulin.name/

Date: 20 Feb 2008 18:08:17
From: Guest
Subject: Re: hash table size

"Anders Thulin" <[email protected] > wrote in message
news:G8%[email protected]...
> Guest wrote:
>> Basically, the larger the better provided it all fits into physical
>> memory
>> with no chance of paging out to disk. It's not going to fit into the L2
>> cache anyway....
>
> I don't quite agree: every hash table entry imposes some administrative
> overhead that offsets the benefit of having a hash table. If that
> overhead
> gets too large, you lose.

The only overhead is:

HashIndex = Hash % TableSize;

That's constant regardless of table size.

And most tables are implented one of four ways.

1) Check one entry.

2) Check two entries.

3) Check one hash line, which may contain two or more entries.

4) One small table and one large table.

All four cases behave the same regardless of table size.

The only additional overhead would be an increase in the number of TLB
entries. (The cpu internal entries that convert addresses into physical
memory addresses.) But since there are already a lot of them and they
aren't going to be able to stay in the L2 cache, every hash check is nearly
guaranteed to be a full cost random memory access.

So unless your hash table is *very* small (like small enough to fit into L2
or only slightly larger), there is no extra overhead when increasing the
size, as long as you stay within physical memory.

Now, if you try to clear the hash, then a larger hash will take more time.
But that's a one time thing before each search, and most people don't
actually clear the hash. Instead other things are done so that old entries
are very unlikely to match for a new search.

> Just where the cutoff is depends on how the hash table was implemented.
> Testing can be used to detect the cut-off, but it will probably take a lot
> of careful testing to discover.
>
> So: ask the software writers. They should have an idea if there
> is a sweet spot or not.

I am a writer. Not a major program writer, no. But I have written a few
chess programs over the years.

And this question has come up repeatedly over the years. And the answer is
always the same by the major chess program writers. (Hyatt, etc. etc.)
People who have done these kinds of tests and always get the same basic
answer.

As to whether there is a 'sweet spot', that would suggest the most benefit
for the amount of memory consumed.

Like 75% at 16m, 95% at 32m and 96% at 64m.

In that case, the 'sweet spot' would be 95% at 32m. The most 'bang for the
buck'.

Finding that 'sweet spot' would require testing. Which I did suggest he do.

But although going to 64m would cost you an extra 32m for only 1%
improvement would make it unatractive, it's still not a bad setting if you
have the extra memory to spare.

Except for the occasional random 'bad' position that might hurt from a
larger hash, the majority of time, for normal positions, more memory is
better.

Date: 21 Feb 2008 16:49:59
From: Anders Thulin
Subject: Re: hash table size

Guest wrote:

> The only overhead is:
>
> HashIndex = Hash % TableSize;

and later:

> All four cases behave the same regardless of table size.

No, there's more to it. You show it yourself when you say:

> 1) Check one entry.
> 2) Check two entries.

no further overhead here, true, except that the second may
increase code space, and so cause other kinds of overhead.
If it's implemented as a 1..2 loop, there is obviously
counting overhead.

> 3) Check one hash line, which may contain two or more entries.

and here you have (I presume) a bucket search, which may be more
or less complex depending on how you implement it. I don't see how
you can claim no overhead in this situation.

> 4) One small table and one large table.

But why stop there? The table is not static -- you don't only
do searches. You also have to insert new stuff, and weed out old.
In the single entry case, it's easy, but in the bucket hash it
tends to get hairy: a fast search usually means a slow insert.

Now all that adds up: code has to execute to search the table,
code hash to execute to change the table. As long as that code
executes faster than the code the hash replaces, all is fine. But
does it? If the hash table is poorly implemented, you may turn
out to have a bucket hash with linear searches of the buckets,
and one that resizes by increasing bucket chain length only.
I've seen those in actual use -- and turned out by professional
programmers, too.

It doesn't matter one whit what theory says, if the programmer
didn't do a good job implementing it.

In this particular case, we don't have a single clue to what code
base the question applies to. I assume it applies to code base,
of course -- if the question is purely theoretical, one of the
standard books on search algorithms will probably be the best source for
a reply.

--
Anders Thulin anders*thulin.name http://www.anders.thulin.name/

Date: 21 Feb 2008 21:08:15
From: David Richerby
Subject: Re: hash table size

Anders Thulin <[email protected] > wrote:
>Guest wrote:
>> The only overhead is:
>>
>> HashIndex = Hash % TableSize;
>
>and later:
>
> > All four cases behave the same regardless of table size.
>
> No, there's more to it. You show it yourself when you say:
>
>> 1) Check one entry.
>> 2) Check two entries.
>
> no further overhead here, true, except that the second may increase
> code space, and so cause other kinds of overhead. If it's
> implemented as a 1..2 loop, there is obviously counting overhead.

Yes but that's insignificant and independent of table size.

>> 3) Check one hash line, which may contain two or more entries.
>
> and here you have (I presume) a bucket search, which may be more or
> less complex depending on how you implement it. I don't see how you
> can claim no overhead in this situation.

Because the buckets are of constant size, independent of the size of
the hash table.

>> 4) One small table and one large table.
>
> But why stop there? The table is not static -- you don't only
> do searches. You also have to insert new stuff, and weed out old.

Insertion overwrites old data; it's not deleted `manually'. The hash
table is operating as a cache, not a dictionary: if the data you want
isn't in its `priy' location, you look in a constant number of
other places for it and then give up, deleting the stalest or
shallowest data you have and replacing it with the new evaluation.

> Now all that adds up: code has to execute to search the table,
> code hash to execute to change the table.

Yes but the speed of this code is completely independent of the size
of the hash table.

> If the hash table is poorly implemented

... then you've already lost. Any code might have been badly written
and behave stupidly as a result. Yes, it's possible that the hash
table was written by an idiot but, if it was, the engine will probably
be weak for other reasons, too.

In a well-written engine, the optimal hash table size really is `as
big as you can have it without swapping.'

Dave.

--
David Richerby Disgusting Atlas (TM): it's like
www.chiark.greenend.org.uk/~davidr/ a map of the world but it'll turn
your stomach!

Date: 22 Feb 2008 02:18:37
From: Andy Walker
Subject: Re: hash table size

In article <l5A*[email protected] >,
David Richerby <[email protected] > wrote:
>In a well-written engine, the optimal hash table size really is `as
>big as you can have it without swapping.'

Whoa, it's not *quite* as simple as that.

(a) Even in a well-written engine, you may want/need to "clear
out" the table from time to time. Initialising several terabytes
may take a little while even on a modern machine ....

(b) Some of the standard operations on a hash table involve
indexing and finding remainders modulo the table size. These may
be more efficient if the table size is a power of 2, and if the
relevant numbers fit into [in C terms] ints rather than longs.

(c) If you manage to lock down all, or most, of the storage of
a machine, you may perhaps be unpopular, in a multi-user environment,
with other users of the machine. *Your* program may not be swapping,
but *theirs* are! [Even on a home PC, the other users may be *you*,
trying to run other things, such as the engine that this program is
playing against.]

--
Andy Walker
Nottingham

Date: 21 Feb 2008 21:24:56
From: Guest
Subject: Re: hash table size

"Andy Walker" <[email protected] > wrote in message
news:[email protected]...
> In article <l5A*[email protected]>,
> David Richerby <[email protected]> wrote:
>>In a well-written engine, the optimal hash table size really is `as
>>big as you can have it without swapping.'
>
> Whoa, it's not *quite* as simple as that.
>
> (a) Even in a well-written engine, you may want/need to "clear
> out" the table from time to time. Initialising several terabytes
> may take a little while even on a modern machine ....

Actually, current programs don't do that. They leave the hash alone. They
don't even initialize it when the program is first run.

Either they actually use the data from the previous search, or they
invalidate it at the time of access...

There are several ways to invalidate at access...

1) Change all the hash keys so you get a new random value. This isn't a
prefered choice because many people use the hash as a way to quickly check
for repetition.

2) The hash entry contains some sort of counter as part of the 'lock'.

3) At time of access into the hash, XOR in an additional key that changes
for every search.

This way old entries from the previous search (or uninitialized hash memory)
will very very rarely be considered valid.

The odds are so remote that you are more likely to encounter a normal false
hash hit during the search.

And if you actually check the returned move as valid (not everybody does
because it takes time and they feel the odds are acceptible), then the odds
are so remote that you are accessing old data as current data that you can
totally ignore it.

You have a better chance of getting random bit flip errors in the memory or
in the cpu.

As for terabytes... well, not too many people have access to machines with
terabytes of physical memory....

> (b) Some of the standard operations on a hash table involve
> indexing and finding remainders modulo the table size. These may
> be more efficient if the table size is a power of 2, and if the
> relevant numbers fit into [in C terms] ints rather than longs.

It is true that doing an AND is faster than a modulo. But the time
difference isn't really all that significant considering how long main
memory is going to take to access. And the savings can greatly outweigh the
few extra cycles a modulo would take.

But yes, many people do prefer a power of two size.

And as for the 'ints' vs. 'longs'... Sorry. That was true in the days of 16
bit compilers, but these days on 32 bit systems, both int & long will be 32
bits. And on a 32 bit system, 32 bits will be enough to access all of
physical memory. (With the exception of certain hacks that Intel did for
some servers to increase memory by paging it.) So there will be no need for
64 bit 'long long' which would involve a more complicated modulo.

And on 64 bit systems (like the powerful chess programs all use and even
many hobbiests use), int's and long's are both going to be 64 bits. (Well,
actually that needs to be qualified a bit, since the C99 standard is a
little vague on how to handle all 8, 16, 32, 64 and 128 bit data sizes that
can be done on a 64 bit system.)

And no 64 bit system is going to have more than 2^64 bytes of memory, so
there's no need to get into 128 bit 'long long'.

> (c) If you manage to lock down all, or most, of the storage of
> a machine, you may perhaps be unpopular, in a multi-user environment,

Well DUH.

However, on a multi-user environment you are likely to have memory
limitations imposed so you can't use more than your fair share. In fact,
you may only be given a couple meg of physical memory and all the rest of
your data will be paged in & out as needed. Definetly not a good thing for
large hash tables.

If you are on a system without that restriction, then it's probably not a
problem for other reasons, such as verbal agreements between users, etc.
etc.

And if you are on a system without that restriction and you don't have
agreements etc. and you do it anyway and cause problems... (shrug) Tough.
That's your problem when the sysadmin catches you.

> with other users of the machine. *Your* program may not be swapping,
> but *theirs* are! [Even on a home PC, the other users may be *you*,
> trying to run other things, such as the engine that this program is
> playing against.]

No.

If you are runing a chess program for a reason (such as a tournament), then
you know not to use it for other things.

If you intend to use it for other things, then just don't tell it to use all
your available memory for the hash. (Although your other programs will
reduce the cpu power available to the chess program, which may cost it a
full ply of search.)

And if you do want to run some other program and a chess program with
massive hash tables at the same time.... (shrug) Get a second computer.
It's not like they are expensive. A second computer can be bought for $400
that's more than good enough for basic usage, web surfing, etc. For just a
few dollars more, you could get a low end laptop instead.

Date: 24 Feb 2008 01:23:09
From: Andy Walker
Subject: Re: hash table size

In article <[email protected] >, Guest <[email protected]> wrote:
>> (a) Even in a well-written engine, you may want/need to "clear
>> out" the table from time to time. Initialising several terabytes
>> may take a little while even on a modern machine ....
>Actually, current programs don't do that. They leave the hash alone. They
>don't even initialize it when the program is first run.

I don't have access to the source code of the commercial
programs, so you seem to have an advantage in this respect. But:

>Either they actually use the data from the previous search, or they
>invalidate it at the time of access...
>There are several ways to invalidate at access... [...]

OK. But they all involve trade-offs, inc extra computation
and/or extra data stored in the table; which way the trade-offs
balance would seem to depend on depth of search, expected number
of reads vs writes, relative speed of memory, etc., and it would
be at least a slight surprise if they balanced the same way for
all engines on all hardware for all time.

>> (b) Some of the standard operations on a hash table involve
>> indexing and finding remainders modulo the table size. [...]
>It is true that doing an AND is faster than a modulo. But the time
>difference isn't really all that significant considering how long main
>memory is going to take to access. [...]

OK, but DR's contention was that swapping was the *only*
reason not to have the table as large as possible. Presumably,
it is very likely that a table with [eg] 2^25 entries is better
than one with 2^25 + 1?

>And as for the 'ints' vs. 'longs'... Sorry. That was true in the days of 16
>bit compilers, but these days on 32 bit systems, both int & long will be 32
>bits. And on a 32 bit system, 32 bits will be enough to access all of
>physical memory. [...]

I was recently using a system with 32-bit ints and 64-bit
longs and pointers; I didn't think it particularly unusual. But
perhaps it was.

>> (c) If you manage to lock down all, or most, of the storage of
>> a machine, you may perhaps be unpopular, in a multi-user environment,
>Well DUH.
>However, on a multi-user environment you are likely to have memory
>limitations imposed so you can't use more than your fair share. In fact,
>you may only be given a couple meg of physical memory and all the rest of
>your data will be paged in & out as needed. Definetly not a good thing for
>large hash tables.

I don't think I've used a system like *that* since the days when
2M of physical memory was more than "elephantine"! ...

>[...]
>And if you are on a system without that restriction and you don't have
>agreements etc. and you do it anyway and cause problems... (shrug) Tough.
>That's your problem when the sysadmin catches you.

... And ever since those days I've *been* the sysadmin, at
least for the machine on my desk. But not really the point:

>If you are runing a chess program for a reason (such as a tournament), then
>you know not to use it for other things.

OK.

>If you intend to use it for other things, then just don't tell it to use all
>your available memory for the hash. (Although your other programs will
>reduce the cpu power available to the chess program, which may cost it a
>full ply of search.)

Then we're back to contention with DR's claim that the larger
the better until swapping occurs. Personally, I've just acquired my
first "Vista" machine [no flowers, by request], with its annoying
habits of firing off all sorts of stuff to and from M$ without any
notice, suddenly deciding to re-arrange its discs, telling me at
awkward moments to insert such-and-such disc to do dumping, and so
on. But even my Unix box is liable to receive e-mail, or to start
"cron" jobs. None of these things seem likely to occupy that much
CPU, but they do load up some surprisingly large programs.

>And if you do want to run some other program and a chess program with
>massive hash tables at the same time.... (shrug) Get a second computer.
>It's not like they are expensive. A second computer can be bought for $400
>that's more than good enough for basic usage, web surfing, etc. For just a
>few dollars more, you could get a low end laptop instead.

Perhaps that's a left- vs right-pondian viewpoint. The machines
you can get for that sort of price over here really aren't "good enough"
at all. If they were merely last year's low-end models reduced to clear,
that might be different, but we only seem to get a teensy discount for
that. Instead, the very cheap ones are totally inadequate. YMMV.

As it happens, Prof Kraft [the OP] really gave us no idea what
hardware or software he was asking about, if any [eg, perhaps he is
writing his own program, or perhaps he needs background info for an
academic paper] so all bets are off. Your initial advice ["Basically,
run a simple test..."] to him is sound, of course. My only concern
was that one should not be quite so confident that "bigger is better"
in all [non-swapping] cases.

--
Andy Walker
Nottingham

Date: 24 Feb 2008 12:22:31
From: Guest
Subject: Re: hash table size

"Andy Walker" <[email protected] > wrote in message
news:[email protected]...
> In article <[email protected]>, Guest <[email protected]>
> wrote:
>>> (a) Even in a well-written engine, you may want/need to "clear
>>> out" the table from time to time. Initialising several terabytes
>>> may take a little while even on a modern machine ....
>>Actually, current programs don't do that. They leave the hash alone.
>>They
>>don't even initialize it when the program is first run.
>
> I don't have access to the source code of the commercial
> programs, so you seem to have an advantage in this respect. But:

I don't really either. But the methods have been widely discussed for
years.

(Well, I do have the same kind of access that everybody else has. The early
Fruit & Toga, Strelka (the reverse enginered Rybka), Crafty, and so on.
However I generally make a point not to look at them. I much prefer people
talking about chess programming ideas and what they do rather than actually
looking at the code itself.)

Sure, you can clear out the memory. That'll work, and for most memory sizes
it wont take an excessive amount of time.

Or you can leave it in and use the results for the next search. That's been
done by programs too.

The current methods of invalidating at time of access have really only
become used in the last few years, when processors got so much faster than
memories.

In the older days, probably right up to the P3 days, processor speeds were
closer to main memory, and memory sizes were small enough that directly
clearing it wasn't a problem.

I don't think even the mainframe programs worried about it. (Although I
can't say for sure, since there weren't too many mainframe chess programs
left after the mid-80s.)

Right off hand, I don't know when the 'invalidate at access' ideas actually
came about.

>>Either they actually use the data from the previous search, or they
>>invalidate it at the time of access...
>>There are several ways to invalidate at access... [...]
>
> OK. But they all involve trade-offs, inc extra computation
> and/or extra data stored in the table; which way the trade-offs

Just an extra 64 bit XOR. Just a couple cycles per hash check. And
considering you'll probably be spending 200 to 400 cycles waiting for a
random access to main memory, it's neglible.

Some people prefer to use a few extra bits in the hash entry to hold a
search ID number. It depends on how much info you try to fit into a hash
entry and what cache line size you expect. Since you don't want a hash
entry (or group of entries) to span more than one cache line, you may have a
few extra bits to spare for a unique search ID.

> balance would seem to depend on depth of search, expected number
> of reads vs writes, relative speed of memory, etc., and it would
> be at least a slight surprise if they balanced the same way for
> all engines on all hardware for all time.

That is true.

There are lots of ways to do things in a chess program.

And as I said above, it's only been relatively recent that cpu's became so
much faster than main memory, so in the 'old days' it just didn't matter.

>>> (b) Some of the standard operations on a hash table involve
>>> indexing and finding remainders modulo the table size. [...]
>
>>It is true that doing an AND is faster than a modulo. But the time
>>difference isn't really all that significant considering how long main
>>memory is going to take to access. [...]
>
> OK, but DR's contention was that swapping was the *only*
> reason not to have the table as large as possible. Presumably,

It is the main reason. But not the only one.

> it is very likely that a table with [eg] 2^25 entries is better
> than one with 2^25 + 1?

Not really any great reason for it to be.

There would only be a few reasons it might.

1) The extra size of the hash table.

Not too likely one more entry is going to matter! Even doubling a large
cache is only going to help a few ratings points once you get past a
reasonable size. (I don't have any current numbers for that.)

2) The odd size of the hash table might spread out the entries more or less
uniformly.

Some truth to that, but for chess hashes, which use random Zobrist keys,
it's not really a major issue. Probably wouldn't hurt, though.

3) It being easier to make a hash index out of the 64 bit hash value.

Working with powers of two (or +1 or -1 of 2^x) is certainly more
convenient. But not really a major problem.

True, a 64 bit divide is likely to take around 64 cycles. Some cpu's will
be faster and some slower. But slow. Some may even be "dead dog" slow.

But you (or the compiler) can probably schedule things so you can do some
other stuff while waiting for the divide to finish.

If you are on a system where you know it has a slow divide and you still
want an odd size hash, all you have to do is multiply by the reciprocal.
Since you'll be working with one size hash for the whole program run, it'll
be quick & easy to set up.

And since we are doing this for an index rather than a math answer, you wont
even need to do a 'fix-up' to get the exact right answer. It being off by
one will be tolerable since it will be consistantly off by one for that
index. (But if you want to do the fix-up, it'd take only another few
cycles. Still much faster than a divide.)

Do any current programs do it that way.... (shrug) I have no idea. I
haven't heard anybody actually say.

But it's a simple tweak so I would certainly assume at least some do if they
want to use odd size hashes. If you have 4g of memory, the OS and the
program is going to use some, so you'd be left with about 1.5g of unused
memory if you used only powers of two..

Or they may just go ahead and do a power of two but do a multi-bucket hash
entry. They may be able to fit 3 entries per cache line, so their cache
size would actually be 3*2^x.

There's a lot of room for doing things different ways. It depends a lot on
personal preference and what you know about your target hardware.

>>And as for the 'ints' vs. 'longs'... Sorry. That was true in the days of
>>16
>>bit compilers, but these days on 32 bit systems, both int & long will be
>>32
>>bits. And on a 32 bit system, 32 bits will be enough to access all of
>>physical memory. [...]
>
> I was recently using a system with 32-bit ints and 64-bit
> longs and pointers; I didn't think it particularly unusual. But
> perhaps it was.

I'm not sure which you are meaning.

1) An unusual system where the cpu really does work with 32 bit ints but
extended pointers.

Well... there are quite a few non-x86 systems around. And not all of them
do things the same way.

Admittedly I am more familiar with regular x86 style consumer processors,
rather than RISC, mainframes, workstations, etc.

And the C89 & C99 standards provide quite a bit of flexibility in how big
things can be. You can indeed have a 32 bit system with a 64 bit address
space. The C standard doesn't really care what the system requires or even
what the compiler writer chooses to do (61 bit integers and 47 bit pointers
on an 8 bit micro, for example.)

2) Simply an extended 32 bit cpu, where the pointers have been hacked in the
cpu to be longer. Somewhat like the segment registers of the old 8088 or
the stuff Intel put into its servers before they were ready to release 64
bit cpus.

I haven't used those Intel servers, but I don't think you can access the
extra memory directly. You have to go through some hoops to acess the extra
memory.

Would the cost be excessive for that increased hash size... Don't know.
That'll depend on the chess program & the cpu.

However, the extra cost for each access is not going to be large. It'd
probably be swamped by the slow memory access times. If main memory often
takes 300 cycles for a random access, then what is an extra 4-10 cycles?
Your program is still at a dead stop waiting for the hash entry to be
returned.

3) A real 64 bit cpu where the compiler just happens to use 32 bits for an
int but 64 for longs and pointers.

In this case, it doesn't matter. It's still all 64 bit hardware. It's not
like a 32 bit cpu working with 64 bit math.

The only issues you have to deal with are the unspecified & inconsistant
sizes of int & long. The C standard doesn't really care as long as
sizeof(short) <= sizeof(int) <= sizeof(long). You could have all of them be
64 bits, if you wanted. It's so hopelessly screwed up that the C99 standard
gave up and created brand new data types that you should use whenever size
is important. I wouldn't be surprised if the next C standard actually
started depricating the use of 'int'.

So it is certainly possible you have a 64 bit system where the C compiler
does 32 bit ints and 64 bit longs. That's just the compiler. The hardware
is still in 64 bit mode, so it's not an issue.

You will likely have issues if you switch compilers, though. That's why
it's better to use the new data types that C99 specified. I generally never
use 'int' for anything except general vars, loop indexes, etc. where there
is never any possibility that it would even aproach 32 bits. (Yeah, I am
kind of gambling that my programs will never be ported to a system that has
less than 32 bit integers. Which is probably safe....)

But I guess I shouldn't have done a blanket statement like that. (grin) I
guess I sit corrected....

Now, having said all of that....

Hashes are going to be at least 64 bits. Nothing smaller is going to be
acceptable. 32 bits stopped being tolerable back in the 70s or early 80s, I
guess. 48 bits by the early 80s. At 64 bits, the false hash hit rate on
current processors is low enough that it's not a major concern (provided the
occasional false hash hit doesn't crash your engine. That's why some
authors always verify the suggested move from the hash.)

So you aren't really going to be getting into 'int' vs. 'long' anyway.

About the only issue would be whether or not your cpu has 64 bit modulo or
not. If it doesn't, then a power of two might be a better choice else you
might have to do a slower 'software' based modulo.

You can do a full cache line of hash buckets per entry. That way you get
the advantage of using more memory than a simple power of two while at the
same time keeping the hash- >index conversion quick & simple.

If you have terabytes of memory, then you probably have a 64 bit cpu, with a
full 64 bit ALU with a full 64 bit divide (and mul) instruction.

>>> (c) If you manage to lock down all, or most, of the storage of
>>> a machine, you may perhaps be unpopular, in a multi-user environment,
>>Well DUH.
>>However, on a multi-user environment you are likely to have memory
>>limitations imposed so you can't use more than your fair share. In fact,
>>you may only be given a couple meg of physical memory and all the rest of
>>your data will be paged in & out as needed. Definetly not a good thing
>>for
>>large hash tables.
>
> I don't think I've used a system like *that* since the days when
> 2M of physical memory was more than "elephantine"! ...

Well, a couple meg of memory was just an example.

But no multi-user environment today is going to let one user hog all the
resources of the system. I don't think even the most primative multi-user
systems in use today lets a user get away with that. They wouldn't be much
of an OS if they did.

If you are on a 1g system with 8 other people, no reasonable multi-user OS
is going to let one user have 900meg of that and starve the rest of the
users. Not without elevating their privileges.

The amount of physical memory each user gets is likely to be somewhat
variable, changing depending on the load etc. But the user is likely to
only be 'guarnateed' a small amount of memory. Everything else will be
paged to disk as needed and you can't really count on any particular amount
of memory for a hash.

So it wouldn't be a good idea for a user to even try to do a big hash. It
just wouldn't work well, the program would run slowly and it'd play terrible
because stuff would be paging to & from disk.

>>[...]
>>And if you are on a system without that restriction and you don't have
>>agreements etc. and you do it anyway and cause problems... (shrug) Tough.
>>That's your problem when the sysadmin catches you.
>
> ... And ever since those days I've *been* the sysadmin, at
> least for the machine on my desk. But not really the point:

>>If you are runing a chess program for a reason (such as a tournament),
>>then
>>you know not to use it for other things.
>
> OK.
>
>>If you intend to use it for other things, then just don't tell it to use
>>all
>>your available memory for the hash. (Although your other programs will
>>reduce the cpu power available to the chess program, which may cost it a
>>full ply of search.)
>
> Then we're back to contention with DR's claim that the larger
> the better until swapping occurs. Personally, I've just acquired my
> first "Vista" machine [no flowers, by request], with its annoying

I got my first Vista system (a laptop) a few months ago.... Can't say I
like it.

> habits of firing off all sorts of stuff to and from M$ without any
> notice, suddenly deciding to re-arrange its discs, telling me at

You can turn off the background defrag. Along with much of the other
background & scheduled stuff.

> awkward moments to insert such-and-such disc to do dumping, and so

Never had Vista ask me for a disk.

> on. But even my Unix box is liable to receive e-mail, or to start
> "cron" jobs. None of these things seem likely to occupy that much
> CPU, but they do load up some surprisingly large programs.

Right.

On a highly variable system like that.... well, you are screwed, basically.
That's just the nature of having a system that isn't suitable for running a
processor intensive, memory hungry program.

About the best you can do is only use a few meg and hope that'll be enough
and that the OS wont page out even that small amount. Or you live with the
variability.

But on a system that is fairly stable. With no sudden background task etc.
(like my XP box, where very little happens without me doing it), you can get
pretty consistant memory & processor
availability.

>>And if you do want to run some other program and a chess program with
>>massive hash tables at the same time.... (shrug) Get a second computer.
>>It's not like they are expensive. A second computer can be bought for
>>$400
>>that's more than good enough for basic usage, web surfing, etc. For just
>>a
>>few dollars more, you could get a low end laptop instead.
>
> Perhaps that's a left- vs right-pondian viewpoint. The machines
> you can get for that sort of price over here really aren't "good enough"

I don't know where you are. I'm in the U.S. A low end system with a 1.8ghz
AMD x2, 1g and 250g drive would be around $400. A little shopping around
would probably improve that somewhat. Not a massive cost. Eating out at a
fast food resturant 15 times. Eating at a regular resturant 10 times.
10-15 tanks of gas for the car.

Most people (and even families) can blow that much money in a month or two
and never even notice it.

The US, Canada, & UK should have roughly comparable average household
incomes.

> at all. If they were merely last year's low-end models reduced to clear,
> that might be different, but we only seem to get a teensy discount for
> that. Instead, the very cheap ones are totally inadequate. YMMV.

Depends on what you are wanting to do.

The 'average' person is going to surf the web, check email, do some instant
chatting, watch a few youtube videos, etc.

A more advanced person might do a little programming or video or photo
editing, or a game or two.

No serious number crunching or heavy 3d game playing. (Yes, they may like
heavy 3d games, but they probably aren't going to be playing one in the
short time the main system is occupied.)

For that, most low end systems *are* good enough to be a secondary computer.
That's why they are a secondary system, rather than a second main system.

Most low end systems have more computing power than many of the more
expensive systems of just a couple years ago. We just expect more from them
and the OS (Vista) takes up more of the power.

I have two computers at my desk. My laptop and a desktop. Before I got the
laptop, I had two desktop systems hooked up to a KVM switch. My main system
and a slower old secondary system. The second one usually didn't get used
unless I was doing something on the main one that I didn't want to
interrupt, or I was wanting to do a long term program that would run several
days and I didn't want to tie up the main system.

(As a side note, I've read that a lot of chess programmers actually use a
small collection of old or cheap computers to do their chess program tuning
test games. They can get a couple cheap low end computers and do more test
games that way than if they spent the same amount on a high end system.)

> As it happens, Prof Kraft [the OP] really gave us no idea what

I have no idea who the author was. I guess you recognised him, though.

> hardware or software he was asking about, if any [eg, perhaps he is

Since he was only talking about using up to 1.5g of hash memory, it
obviously was either a small desktop system or a laptop.

A high end desktop would have had more memory. Perhaps up to 4g. A server
would have had more. A mainframe or workstation would have had more.

So it probably was just a plain desktop or laptop.

> writing his own program, or perhaps he needs background info for an
> academic paper] so all bets are off. Your initial advice ["Basically,

If he's needing background info for a paper, then he's in the wrong
newsgroup! I don't think he wants to hear much about Sam Sloan etc. etc...

He should head over to TalkChess.com That's where the chess programmers
hang out these days.

> run a simple test..."] to him is sound, of course. My only concern
> was that one should not be quite so confident that "bigger is better"
> in all [non-swapping] cases.

Maybe a blanket statement is a little unwise. Just because of the few
situations and systems and such where issues might arise.

But it's not too far off though. Close enough that you could safely say
something like:

"Using as much memory as what's available without paging is nearly always a
good idea."

Just the addition of the word 'nearly' covers the few cases where it
wouldn't.

>
> --
> Andy Walker
> Nottingham

Date: 21 Feb 2008 15:36:15
From: Guest
Subject: Re: hash table size

"David Richerby" <[email protected] > wrote in message
news:l5A*[email protected]...
> Anders Thulin <[email protected]> wrote:
>>> 4) One small table and one large table.
>>
>> But why stop there? The table is not static -- you don't only
>> do searches. You also have to insert new stuff, and weed out old.
>
> Insertion overwrites old data; it's not deleted `manually'. The hash
> table is operating as a cache, not a dictionary: if the data you want

You pointing out that a chess hash table is being done as a cache rather
than a dictionary, and him saying that a table isn't static, actually
explains a few things about his message.

I bet he's thinking along the line of a a linked list or tree or b-tree or
some sort of data structure that gets updated dynamically. Maybe even a
structure who's size and organization can change during the search.

Stuff gets inserted, removed, re-organized, searched, etc.

Data structures and algorithms that are simply not used in a chess hash
table.

Anders, a chess hash table is done as a fixed sized table. It's a straight
array with no other structure. It's not a linked list or a tree or a b-tree
or any other 'fancy' data structure. Just a basic linear vector array.

To access a hash entry, you do:

HashEntry = HashTable [ HashIndex % HashSize];
if (HashEntry.Lock == MAGIC(HashIndex)) return HashEntry;
return NULL;

To save it, you do:
OldEntry = HashTable [ HashIndex % HashSize];
if (OldEntry.Draft >= NewEntry.Draft)
HashTable[HashIndex % HashSize] = NewEntry;

And so on.

If you do a hash entry that actually contains two or more (a fixed number)
entries, then the code will be slightly more complicted, but still fairly
simple and a fixed cost regardless of the size of the table.

Other modifications to the hash table will change things slightly, but
again, they will be of a fixed cost that is independant of the size of the
table.

It didn't occur to me until now that you might be thinking of too high a
level of data structures.

Since this is a computer chess newsgroup, most people in here are already
familar with the basic algorithms, and I just assumed you were.

Date: 21 Feb 2008 11:57:48
From: Guest
Subject: Re: hash table size

"Anders Thulin" <[email protected] > wrote in message
news:[email protected]...
> Guest wrote:
>
>> The only overhead is:
>>
>> HashIndex = Hash % TableSize;
>
> and later:
>
> > All four cases behave the same regardless of table size.
>
> No, there's more to it. You show it yourself when you say:
>
>> 1) Check one entry.
> > 2) Check two entries.
>
> no further overhead here, true, except that the second may
> increase code space, and so cause other kinds of overhead.
> If it's implemented as a 1..2 loop, there is obviously
> counting overhead.

Say what???

A loop, if you did it that way, would only cost a few cycles. Say 10
cycles.

An access to main memory, which is essentially a random access at full cost,
will cost at least 200 cycles or even up to 400 cycles. Yeah, main memory
is very expensive!

Is an extra 10 cycles really significant?

And that has *nothing* to do whether you are using 32m, 128m or even 1.5g of
memory.

That's an implementation detail of how you choose to implement the hash
table. It will be a constant and will not be influenced by how big the hash
table is. Which is what the original question was.... Which was whether he
should go ahead and use 1536m or stay with 512m.

>> 3) Check one hash line, which may contain two or more entries.
>
> and here you have (I presume) a bucket search, which may be more

Why a bucket search?? Due to cache line sizes (the only reason to do this
kind of hash organization), you'll only have 2 or almost certainly no more
than 4 entries. No need for any fancy search when a straight comparison
will do fine.

Don't go trying to make things more complex than you need to.

> or less complex depending on how you implement it. I don't see how
> you can claim no overhead in this situation.

I'm not claiming no overhead for how the hash table is implemented. I'm
claiming there would be no additional overhead (or penalty) if you increase
your cache size from 512m to 1.5g, which is what the original question was.

It's even what the thread subject says... "hash table size". Not "hash
table implementation cost".

Go read the original message.

I'll save you the trouble and quote it here...

***
How large should the hash table size be?
The large the better?
Or is 1536 MB wasting hash memory?
And is 512 MB enough?
***

Again, his question wasn't about how to implement it. Whether one method
was better than the other or easier or whatever.

It was how much memory was enough. Whether using his whole 1.5g available
memory was a waste or whether he should cut it back to just 512m.

>
>> 4) One small table and one large table.
>
> But why stop there? The table is not static -- you don't only
> do searches. You also have to insert new stuff, and weed out old.
> In the single entry case, it's easy, but in the bucket hash it
> tends to get hairy: a fast search usually means a slow insert.

Which will be a constant cost regardless of how big the hash is.

Which is what the original question was.

> Now all that adds up: code has to execute to search the table,
> code hash to execute to change the table. As long as that code
> executes faster than the code the hash replaces, all is fine. But
> does it? If the hash table is poorly implemented, you may turn

You are talking more about bugs etc. than what the original question was
about.

> out to have a bucket hash with linear searches of the buckets,
> and one that resizes by increasing bucket chain length only.
> I've seen those in actual use -- and turned out by professional
> programmers, too.
>
> It doesn't matter one whit what theory says, if the programmer
> didn't do a good job implementing it.

And how is that at all relevant to the original question?

> In this particular case, we don't have a single clue to what code
> base the question applies to. I assume it applies to code base,

Assuming his hash table isn't out right broken, and assuming he hasn't gone
out of his way to implement it in an idiotic maner, then the cost of the
accessing the hash entry in main memory will be larger than all the other
mainenance costs.

So, as long as his hash table manages to work, and do the job it's designed
for, it will be saving move generation costs, reducing the search bounds,
which makes the tree smaller which cuts out evals and move gens and
q-searches, etc...

So it really all comes down to:

1) Asuming his table isn't broken.
2) Asuming his table wasn't deliberately designed to be inefficient at large
memory sizes
3) the implementation details will effect the performance of the hash table
(ie: hit rate)
4) the cost of actually maintaining the table will be constant (for whatever
method he chooses) regardless of table size.
5) then it all comes down to his original question...

How about that.

Even after going through your totally off subject stuff, it all still comes
down to his original question...

Whether it was a waste of memory to use his full 1.5g of available memory or
just cut it down to 512m.

Unless your next message actually talks about the original subject (which is
also what the subject header says), I wont be replying to any more of your
trolling.

If you want to start a brand new thread that talks about implementation
ideas and methods and the overheads involved with each, and so on, then I
might be willing to read and participate in that discussion. Although
actually, it would be much more convenient to do this over at the TalkChess
forum. That's where the major chess programmers are. That way you could
get the feedback from dozens of chess programmers with 10 or 20 or more
years experience. I don't go there every day, but I'm sure there are many
others that would be willing to immediately discuss implementation issues.

But if you keep trying to change the meaning of the original question in
this thread.... bye.

> of course -- if the question is purely theoretical, one of the
> standard books on search algorithms will probably be the best source for
> a reply.
>
> --
> Anders Thulin anders*thulin.name http://www.anders.thulin.name/

Date: 20 Feb 2008 20:42:13
From: Gian-Carlo Pascutto
Subject: Re: hash table size

Anders Thulin wrote:

> I don't quite agree: every hash table entry imposes some
> administrative overhead that offsets the benefit of having a hash
> table.

What would that administrative overhead be?

--
GCP