3

I have a rather large dataset consisting of 2.3GB worth of data spread over 160 Million byte[] arrays with an average data length of 15 bytes. The value for each byte[] key is only an int so memory usage of nearly half the hashmap (which is over 6GB) is made up of the 16 byte overhead of each byte array

overhead = 8 byte header + 4 byte length rounded up by VM to 16 bytes.

So my overhead is 2.5GB.

Does anybody know of a hashmap implementation that stores its (variable length) byte[] keys in one single large byte array so there would be no overhead (apart from 1 byte length field)?

I would rather not use an in memory DB as they usually have a performance overhead compared to a plain Trove TObjectIntHashMap that i'm using and I value cpu cycles even more then memory usage.

Thanks in advance

4
  • Arrays don't implement equals() or hashCode(), does your Trove hashmap provide these for you? Commented Jan 25, 2017 at 1:14
  • Why are you using byte[] as the key of your hashmap? Commented Jan 25, 2017 at 1:14
  • What JVM are you using? My impression is that the overhead will only be 12 bytes if the length of the byte[] is odd, because it's just that the whole structure is rounded up to a multiple of 8. Commented Jan 25, 2017 at 1:15
  • I override equals() and hashCode() with >new TObjectIntCustomHashMap<byte[]>(new HashingStrategy<byte[]>() I'm using byte[] because thats what my data is!, byte[] size is not always odd Commented Jan 25, 2017 at 1:18

1 Answer 1

2

Since most personal computers have 16GB these days and servers often 32 - 128 GB or more, is there a real problem with there being a degree of bookkeeping overhead?

If we consider the alternative: the byte data concatenated into a single large array -- we should think about what individual values would have to look like, to reference a slice of a larger array.

Most generally you would start with:

public class ByteSlice {
    protected byte[] array;
    protected int offset;
    protected int len;
}

However, that's 8 bytes + the size of a pointer (perhaps just 4 bytes?) + the JVM object header (12 bytes on a 64-bit JVM). So perhaps total 24 bytes.

If we try and make this single-purpose & minimalist, we're still going to need 4 bytes for the offset.

public class DedicatedByteSlice {
    protected int offset;
    protected byte len;

    protected static byte[] getArray() {/*somebody else knows about the array*/}
}

This is still 5 bytes (probably padded to 8) + the JVM object header. Probably still total 20 bytes.

It seems that the cost of dereferencing with an offset & length, and having an object to track that, are not substantially less than the cost of storing the small array directly.

One further theoretical possibility -- de-structuring the Map Key so it is not an object

It is possible to conceive of destructuring the "length & offset" data such that it is no longer in an object. It then is passed as a set of scalar parameters eg (length, offset) and -- in a hashmap implementation -- would be stored by means of arrays of the separate components (eg. instead of a single Object[] keyArray).

However I expect it is extremely unlikely that any library offers an existing hashmap implementation for your (very particular) usecase.

If you were talking about the values it would probably be pointless, since Java does not offer multiple returns or method OUT parameters; which makes communication impractical without "boxing" destructured data back to an object. Since here you are asking about Map Keys specifically, and these are passed as parameters but need not be returned, such an approach may be theoretically possible to consider.

[Extended] Even given this it becomes tricky -- for your usecase the map API probably has to become asymmetrical for population vs lookup, as population has to be by (offset, len) to define keys; whereas practical lookup is likely still by concrete byte[] arrays.

OTOH: Even quite old laptops have 16GB now. And your time to write this (times 4-10 to maintain) should be worth far more than the small cost of extra RAM.

Sign up to request clarification or add additional context in comments.

1 Comment

yeah I was thinking of a simple get(int offset, byte len) Problem then is the hash function, how would you find the offset from translating the byte[] key to an offset in the single large byte[] array for all keys without resorting to partitioning the array into fixed length "buckets" that would have to be as large as the largest key which would then waste lots space as average key sizes would be much less then the max key size. I have 32GB ram but i'm using the rest of it for other data during processing (to much to explain), hence I need to be frugal with my memory usage.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.