3

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results. Is there a python equivalent for that function? Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

3
  • You mean you want to reproduce exactly the same results as Java's String.hashCode in Python? Commented Jun 29, 2023 at 13:36
  • 1
    You can call hash(obj) for any object in Python. Mutable types (dict, list, etc) are however unhashable. Custom classes can define the behavior by implementing a __hash__ method. Default for custom classes is based on object identity (memory location) with hash(obj) == hash(id(obj)//16) Commented Jun 29, 2023 at 13:37
  • @Sweeper, yes, exactly that Commented Jun 29, 2023 at 13:41

2 Answers 2

6

According to java String source code, the hash implementation is:

public int hashCode()
    {
      if (cachedHashCode != 0)
        return cachedHashCode;
    
      // Compute the hash code using a local variable to be reentrant.
      int hashCode = 0;
      int limit = count + offset;
      for (int i = offset; i < limit; i++)
        hashCode = hashCode * 31 + value[i];
      return cachedHashCode = hashCode;
    }

You can transfer this to Python (w/o caching):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode

>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074  # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347  # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1)  # unsigned
        if hashCode & 2**31:
            hashCode -= 2**32  # make it signed
        return hashCode

Now, even overflowing hashes behave the same:

>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949  # Java hash for "abcdef"

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

Sign up to request clarification or add additional context in comments.

4 Comments

I verified this works for the string "The Quick Brown Fox" resulting in -732416445
hashCode = (hashCode * 31 + ord(char)) & 0xffffffff should produce the same bit value in hashCode in python as in Java, you can convert to a signed value at the end.
@PresidentJamesK.Polk Good improvement. Used it in the updated version.
@KellyBundy lol, no more hex for me. Off-by-one erros definitely one of the 2 hard things in IT.
1

This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

def java_hasher(text):
    size = 32
    sign = 1 << size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed & sign-1) - (text_hashed & sign)
    
print(java_hasher("The Quick Brown Fox"))

That should give you: -732416445

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.