Problem with RAM when creating a DataFrame with a large number of columns from a TensorFlow Dataset

Ask Question

Asked 1 year, 1 month ago

Modified 1 year, 1 month ago

Viewed 44 times

I work with large amounts of data which I process using TensorFlow Dataset (TFDS) and save to pandas.DataFrame. My goal is to convert the data from one format to another for further analysis. But when I create a DataFrame with a large number of columns (~8500), my RAM fills up quickly and the process terminates with a low memory error.

Current code:

import tensorflow as tf
import pandas as pd
from tqdm import tqdm

datapoint_indices = [x[0] for x in filtered_ranking_table]

# Empty DataFrame to store results
column_names = ["class"]
column_names += [f'datapoint_{i}' for i in datapoint_indices]
# df = pd.DataFrame(columns=column_names)
# max_rows = 114003  # or some other upper limit
# df = pd.DataFrame({name: [None] * 162078 for name in column_names})

# Trying to create a DataFrame with a fixed number of rows
# max_rows = 114003  # Row limit
# df = pd.DataFrame(index=range(max_rows), columns=column_names)

df = pd.DataFrame({name: [np.nan] * 162078 for name in column_names})

for datapoint_n, clusters in tqdm(dataset.take(114003), total=114003):
    if datapoint_n.numpy() in datapoint_indices:
        prev_index = len(df)  # Current length of df
        for i, cluster in enumerate(clusters):
            cluster = cluster.numpy()
            cluster = [x for x in cluster if x != 0]
            df.loc[prev_index:prev_index + len(cluster) - 1, 'class'] = i
            df.loc[prev_index:prev_index + len(cluster) - 1, f'datapoint_{datapoint_n}'] = pd.Series(cluster, index=range(prev_index, prev_index + len(cluster)))
            prev_index += len(cluster)

df = df.dropna(how='all')
df = df.astype({"class": int})

What I've tried so far:

Creating an empty DataFrame with fixed rows (max_rows) and dynamic number of columns (datapoint_indices).
Using for loop to fill data сolumn by column in blocks as in the code above, which helps for small number of columns, but fails for 8500+ columns due to lack of RAM.

Questions:

How can this process be optimised to reduce memory consumption?
Is there any way to write the data directly to a file (like Parquet, CSV or HDF5) instead of loading it into RAM?
What approaches can help with this amount of data and number of columns?

Any tips on optimisation or approaches to save the data directly to a file would be appreciated.

asked Nov 3, 2024 at 20:27

lurum28

You could try using Polars instead of Pandas in a lazy execution mode and stream data to a parquet, see documentation here: docs.pola.rs/api/python/stable/reference/api/…

NotAName
– NotAName

2024-11-04 06:03:57 +00:00
Commented Nov 4, 2024 at 6:03
Thanks for the tip, but when I tried to implement this on polars I ran into the problem that I won't be able to record by columns as I do in the original code

lurum28
– lurum28

2024-11-04 20:41:08 +00:00
Commented Nov 4, 2024 at 20:41

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Problem with RAM when creating a DataFrame with a large number of columns from a TensorFlow Dataset

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest