Python equivalent for Java hashCode() function

Question

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results. Is there a python equivalent for that function? Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

You mean you want to reproduce exactly the same results as Java's String.hashCode in Python? — Sweeper
– Sweeper, Commented Jun 29, 2023 at 13:36
You can call hash(obj) for any object in Python. Mutable types (dict, list, etc) are however unhashable. Custom classes can define the behavior by implementing a __hash__ method. Default for custom classes is based on object identity (memory location) with hash(obj) == hash(id(obj)//16) — user2390182
– user2390182, Commented Jun 29, 2023 at 13:37

user2390182 · Accepted Answer · 2023-06-29 17:06:54Z

6

According to java String source code, the hash implementation is:

public int hashCode()
    {
      if (cachedHashCode != 0)
        return cachedHashCode;
    
      // Compute the hash code using a local variable to be reentrant.
      int hashCode = 0;
      int limit = count + offset;
      for (int i = offset; i < limit; i++)
        hashCode = hashCode * 31 + value[i];
      return cachedHashCode = hashCode;
    }

You can transfer this to Python (w/o caching):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode

>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074  # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347  # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1)  # unsigned
        if hashCode & 2**31:
            hashCode -= 2**32  # make it signed
        return hashCode

Now, even overflowing hashes behave the same:

>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949  # Java hash for "abcdef"

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

edited Jun 29, 2023 at 17:06

answered Jun 29, 2023 at 14:02

user2390182

73.7k6 gold badges71 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

JonSG Over a year ago

I verified this works for the string "The Quick Brown Fox" resulting in -732416445

President James K. Polk Over a year ago

hashCode = (hashCode * 31 + ord(char)) & 0xffffffff should produce the same bit value in hashCode in python as in Java, you can convert to a signed value at the end.

user2390182 Over a year ago

@PresidentJamesK.Polk Good improvement. Used it in the updated version.

user2390182 Over a year ago

@KellyBundy lol, no more hex for me. Off-by-one erros definitely one of the 2 hard things in IT.

JonSG · Accepted Answer · 2023-06-29 14:31:27Z

1

This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

def java_hasher(text):
    size = 32
    sign = 1 << size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed & sign-1) - (text_hashed & sign)
    
print(java_hasher("The Quick Brown Fox"))

That should give you: -732416445

edited Jun 29, 2023 at 14:31

answered Jun 29, 2023 at 14:17

JonSG

13.6k2 gold badges32 silver badges48 bronze badges

Collectives™ on Stack Overflow

Python equivalent for Java hashCode() function

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related