1

This question came up while I was saving a large number of model-inferred embeddings to plain text. To do so, I needed to convert lists of float embeddings into strings, and I found this conversion to be surprisingly time-consuming.

Inspired by this discussion, I benchmarked four different methods for converting float arrays to strings. Surprisingly, orjson performed the best—even though it's a third-party JSON serialization library.

This got me wondering: Is there a native Python method that can achieve performance comparable to orjson for converting lists of floats to strings?

Below are the commands I used for profiling, along with the results:

$ python -m pyperf timeit --fast -s 'x = [3141592653589793] * 100' 'str(x)'
Mean +- std dev: 4.79 us +- 0.06 us

$ python -m pyperf timeit --fast -s 'from orjson import dumps; x = [3141592653589793] * 100' 'dumps(x)'
Mean +- std dev: 2.70 us +- 0.02 us

$ python -m pyperf timeit --fast -s 'from json import dumps; x = [3141592653589793] * 100
' 'dumps(x)' 
Mean +- std dev: 8.03 us +- 0.31 us

$ python -m pyperf timeit --fast -s 'x = [3141592653589793] * 100' '"{}".format(x)'
Mean +- std dev: 4.94 us +- 0.16 us
4
  • 1
    You linked to the very discussion that answers your question, and your own results show that a) there were improvements already or b) this microbenchark is too sensitive to noise. Your results show that str(x) is less than twice as slow as str(x) compared to orjson. Back in 2022 it was 3-4 times slower. orjson is faster because it's written in Rust and uses a different implementation. The discussion links to other fast libraries Commented Sep 12 at 10:36
  • 1
    As for Surprisingly ... even though it's a third-party it's quite the opposite. External libraries can be faster because they can use different implementations without breaking compatibility. Converting floats to strings isn't trivial. The linked discussion is about replacing the existing C algorithm with newer and faster algorithms like Rye and Dragonbox. orjson may be using very different string management mechanisms too. Allocating and releasing memory is expensive, so fast serializers pre-allocate and reuse buffers Commented Sep 12 at 10:47
  • Why are you talking about floats but measuring with ints? Please fix the wrong one. Commented Sep 12 at 12:51
  • What about serializing to bytes instead of strings? Commented Sep 12 at 16:22

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.