2

I'm working with a large dataset (~10 million rows and 50 columns) in pandas and experiencing significant performance issues during data manipulation and analysis. The operations include filtering, merging, and aggregating the data, and they are currently taking too long to execute.

I've read about several optimization techniques, but I'm unsure which ones are most effective and applicable to my case. Here are a few specifics about my workflow:

I primarily use pandas for data cleaning, transformation, and analysis. My operations include multiple groupby and apply functions. I'm running the analysis on a machine with 16GB RAM.

Could the community share best practices for optimizing pandas performance on large datasets?

1.Memory management techniques. 2.Efficient ways to perform groupby and apply. 3.Alternatives to pandas for handling large datasets. 4.Any tips for parallel processing or utilizing multiple cores effectively.

I primarily use pandas for data cleaning, transformation, and analysis. My operations include multiple groupby and apply functions. I'm running the analysis on a machine with 16GB RAM.

1
  • Please provide enough code so others can better understand or reproduce the problem. Commented Jul 16, 2024 at 23:24

3 Answers 3

0

If you are using Linux, I recommend trying FireDucks, an accelerator that optimizes any pandas workload without code changes. https://fireducks-dev.github.io/

FireDucks can efficiently address existing bottlenecks in your code and memory issues during execution. I am one of the developers of this library. Feel free to contact me with any queries you may have.

Sign up to request clarification or add additional context in comments.

2 Comments

By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024. news.ycombinator.com/item?id=42135303. TLDR- Don't use it if you don't want vendor lock-in
Unlike many other libraries, FireDucks doesn’t involve any code level dependencies for you to learn and integrate to your existing pandas code for the optimization benefit. It can seamlessly integrate to an existing code and you can switch back to your original pandas program when you decide not to use FireDucks. Hence, please don’t misjudge the above statement related to commercialization. It will be freely available for the community with continuous improvements, bug fixes, etc., whereas the business team will focus on strengthening the enterprise supports etc. as for the commercial aspects.
0

Memory management in Pandas

The Kaggle book suggests modifying data types based on the machine limits for floating and integer types to reduce memory usage in Pandas.

def reduce_mem_usage(df, verbose=True):
    numerics = ["int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

Alternative to Pandas for handling large datasets

1. Polars: A DataFrame library written in Rust that aims to provide fast performance and a low memory footprint. According to the benchmarks, Polars achieve more than 30x performance gains compared to Pandas.

2. RAPIDS: An open-source suite of data science and analytics libraries that leverages NVIDIA GPUs for accelerating data processing workflows. Within RAPIDS, cuDF is a GPU DataFrame library that offers a pandas-like API but with the computational advantages of GPU acceleration. It is not limited to data processing and you can also speed up the training of ML models. For instance, you can train a RandomForest on GPU using RAPIDS: Kaggle Notebook.

Comments

0

You can try to use parallel-pandas library. It's very easy to use:

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16)

# create big DataFrame
df = pd.DataFrame(np.random.random((1_000_000, 100)))

#your CPU-intensive function
def foo(x):
    pass

res = df.p_apply(foo)


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.