0

I work with large amounts of data which I process using TensorFlow Dataset (TFDS) and save to pandas.DataFrame. My goal is to convert the data from one format to another for further analysis. But when I create a DataFrame with a large number of columns (~8500), my RAM fills up quickly and the process terminates with a low memory error.

Current code:

import tensorflow as tf
import pandas as pd
from tqdm import tqdm

datapoint_indices = [x[0] for x in filtered_ranking_table]

# Empty DataFrame to store results
column_names = ["class"]
column_names += [f'datapoint_{i}' for i in datapoint_indices]
# df = pd.DataFrame(columns=column_names)
# max_rows = 114003  # or some other upper limit
# df = pd.DataFrame({name: [None] * 162078 for name in column_names})

# Trying to create a DataFrame with a fixed number of rows
# max_rows = 114003  # Row limit
# df = pd.DataFrame(index=range(max_rows), columns=column_names)

df = pd.DataFrame({name: [np.nan] * 162078 for name in column_names})

for datapoint_n, clusters in tqdm(dataset.take(114003), total=114003):
    if datapoint_n.numpy() in datapoint_indices:
        prev_index = len(df)  # Current length of df
        for i, cluster in enumerate(clusters):
            cluster = cluster.numpy()
            cluster = [x for x in cluster if x != 0]
            df.loc[prev_index:prev_index + len(cluster) - 1, 'class'] = i
            df.loc[prev_index:prev_index + len(cluster) - 1, f'datapoint_{datapoint_n}'] = pd.Series(cluster, index=range(prev_index, prev_index + len(cluster)))
            prev_index += len(cluster)

df = df.dropna(how='all')
df = df.astype({"class": int})

What I've tried so far:

  • Creating an empty DataFrame with fixed rows (max_rows) and dynamic number of columns (datapoint_indices).
  • Using for loop to fill data сolumn by column in blocks as in the code above, which helps for small number of columns, but fails for 8500+ columns due to lack of RAM.

Questions:

  1. How can this process be optimised to reduce memory consumption?
  2. Is there any way to write the data directly to a file (like Parquet, CSV or HDF5) instead of loading it into RAM?
  3. What approaches can help with this amount of data and number of columns?

Any tips on optimisation or approaches to save the data directly to a file would be appreciated.

2
  • You could try using Polars instead of Pandas in a lazy execution mode and stream data to a parquet, see documentation here: docs.pola.rs/api/python/stable/reference/api/… Commented Nov 4, 2024 at 6:03
  • Thanks for the tip, but when I tried to implement this on polars I ran into the problem that I won't be able to record by columns as I do in the original code Commented Nov 4, 2024 at 20:41

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.