Caching data: numpy vs pandas vs MySQL

Question

I am currently processing time series data stored into h5 files, each file containing one hour of data.

In order to move towards real time processing, I would like to process time series data, one second at a time. The plan is to aggregate one second of data, process the data, clear the cache and repeat.

My first idea was to do this using numpy arrays or pandas dataframe, but a colleague suggested caching the data to a MySQL database instead.

In order to benchmark the performance of each approach, I ran a simple timing exercise, trying to access 1,000 samples:

Method	Execution time
Pandas	1.36 µs
Numpy	790 ns
MySQL	552 ns

The code used to obtain these results is detailed below.

From this limited exercise, it looks like the MySQL approach is the winner, but since most of the processing relies on numpy and pandas functions anyways, I am not sure whether it would make much sense to cache the data into a database prior to writing them to a numpy array or a pandas dataframe.

So here's my question: apart from improved performance, what are the benefits of using a MySQL database to cache data?

Benchmark

import pandas as pd
import numpy as np
import mysql.connector
from timeit import timeit

Pandas dataframe:

df = pd.DataFrame()
df['test'] = np.arange(1,1000)
%timeit df['test']

This returns 1.36 µs ± 26.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Numpy array:

%timeit np.arange(1,1000)

This returns 790 ns ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

MySQL database:

cnx = mysql.connector.connect(user='root', password='',
                              host='127.0.0.1',
                              database='mydb')

try:
   cursor = cnx.cursor()
   cursor.execute("""
      select * from dummy_data
   """)
   %timeit result_mysql = [item[0] for item in cursor.fetchall()]
finally:
    cnx.close()

This yields 552 ns ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

These timings don't look right to me. I find it hard to believe that you can query a full table across the MySQL server in less time than it takes numpy to initialise an array of 1000 values. Equally, %timeit df['test'] isn't doing any assignment, it's just referencing a column — roganjosh
– roganjosh, Commented Jan 9, 2023 at 19:03
Also, the premise doesn't seem right to me. Why aren't you doing the aggregation by second inside a single pandas dataframe or a MySQL query? — roganjosh
– roganjosh, Commented Jan 9, 2023 at 19:05
Thanks for your comment @roganjosh: indeed, the times that I tried comparing are the times needed to "query" the data, not those needed to populate a table (e.g. write data into a MySQL table). — Sheldon
– Sheldon, Commented Jan 9, 2023 at 19:09
Please re-read my comment; you appear to have the opposite understanding to what I was saying. "I find it hard to believe that you can query a full table across the MySQL server in less time than it takes numpy to initialise an array of 1000 values" — roganjosh
– roganjosh, Commented Jan 9, 2023 at 19:11
If I understand your comment properly, it means that I cannot compare %timeit result_mysql = [item[0] for item in cursor.fetchall()] with %timeit df['test'], because the latter is just referencing a column. How do you suggest to assess the performance of pandas then? — Sheldon
– Sheldon, Commented Jan 9, 2023 at 19:17

roganjosh · Accepted Answer · 2023-01-09 22:40:37Z

2

Two parts of this answer.

Timings

The first thing that should stand out here is that there is no way that a list comprehension should beat the initialisation of a numpy array. This is immediately suspicious:

%timeit np.arange(1,1000)
790 ns ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit result_mysql = [item[0] for item in cursor.fetchall()]
552 ns ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

It's not impossible that Python might beat numpy on something like this but the probability is vanishingly small (some regression in a new numpy release?).

The issue here is that cursor.fetchall() returns a generator of results, which can only yield records; the first iteration of 1000000 loops goes slow, consumes the iterator, and the remaining 999999 iterations become [item[0] for item in ()].

This becomes more obvious when you run something like this:

import numpy as np

def create_array():
    a = np.arange(1, 1000)


def list_comp(tups):
    a = [item[0] for item in tups]
         
         
test = [(x,) for x in range(999)]

# Results

%timeit create_array()
2.12 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit list_comp(test)
61.7 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Actual Approach

I don't really understand your use of cache here. There's a couple of things that don't add up for me:

You throw the intermediate data away - that's not a cache
You don't re-use any pre-allocated memory (less important in python vs. compiled languages, but still can make a difference) - that's not a cache.

Unless you'll explode RAM by pulling all the data, then aggregating on a per-second basis, then I don't see why you're processing it like this. Both pandas and MySQL will benefit from bulk analysis and then reducing it down to results on a per-second basis. Even with the poor timings from your own investigation, MySQL might actually beat pandas in speed, especially if the data is too large to hold in memory.

Bottom line - these speed tests are not suitable on their own to determine anything about what's most appropriate for your actual application. Just don't use MySQL as an interim storage on a per-second basis

edited Jan 9, 2023 at 22:40

answered Jan 9, 2023 at 20:50

roganjosh

13.3k4 gold badges33 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sheldon Over a year ago

Thanks for your detailed answer @roganjosh. The Timings sanity check was really insightful. Now, I might not have been clear enough about my Actual Approach: I know that I can use pandas to "bulk process" my dataset; in fact, this is what I am currently doing. The reason why I want to analyze one second of data at a time is to have the ability to deliver processed results in quasi real time. I was simply wondering about the best way to stream, aggregate and process these data over short intervals: based on your answer, I will stay away from using a MySQL database for the time being.

roganjosh Over a year ago

Just to elaborate on my second point of "not a chache" in that case - every time you make a new array, numpy needs to make an allocation I memory, which takes time. If you're always going to have 1000 elements, it would be better just to overwrite the data in the existing array each time to speed things up. I suspect, though, that the number of signals will be highly variable

Collectives™ on Stack Overflow

Caching data: numpy vs pandas vs MySQL

Benchmark

Pandas dataframe:

Numpy array:

MySQL database:

1 Answer 1

Timings

Actual Approach

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Benchmark

Pandas dataframe:

Numpy array:

MySQL database:

1 Answer 1

Timings

Actual Approach

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related