I am currently processing time series data stored into h5 files, each file containing one hour of data.
In order to move towards real time processing, I would like to process time series data, one second at a time. The plan is to aggregate one second of data, process the data, clear the cache and repeat.
My first idea was to do this using numpy arrays or pandas dataframe, but a colleague suggested caching the data to a MySQL database instead.
In order to benchmark the performance of each approach, I ran a simple timing exercise, trying to access 1,000 samples:
| Method | Execution time |
|---|---|
| Pandas | 1.36 µs |
| Numpy | 790 ns |
| MySQL | 552 ns |
The code used to obtain these results is detailed below.
From this limited exercise, it looks like the MySQL approach is the winner, but since most of the processing relies on numpy and pandas functions anyways, I am not sure whether it would make much sense to cache the data into a database prior to writing them to a numpy array or a pandas dataframe.
So here's my question: apart from improved performance, what are the benefits of using a MySQL database to cache data?
Benchmark
import pandas as pd
import numpy as np
import mysql.connector
from timeit import timeit
Pandas dataframe:
df = pd.DataFrame()
df['test'] = np.arange(1,1000)
%timeit df['test']
This returns 1.36 µs ± 26.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy array:
%timeit np.arange(1,1000)
This returns 790 ns ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
MySQL database:
cnx = mysql.connector.connect(user='root', password='',
host='127.0.0.1',
database='mydb')
try:
cursor = cnx.cursor()
cursor.execute("""
select * from dummy_data
""")
%timeit result_mysql = [item[0] for item in cursor.fetchall()]
finally:
cnx.close()
This yields 552 ns ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit df['test']isn't doing any assignment, it's just referencing a columnpandasdataframe or a MySQL query?%timeit result_mysql = [item[0] for item in cursor.fetchall()]with%timeit df['test'], because the latter is just referencing a column. How do you suggest to assess the performance of pandas then?