Optimizing pandas performance on large datasets

Question

I'm working with a large dataset (~10 million rows and 50 columns) in pandas and experiencing significant performance issues during data manipulation and analysis. The operations include filtering, merging, and aggregating the data, and they are currently taking too long to execute.

I've read about several optimization techniques, but I'm unsure which ones are most effective and applicable to my case. Here are a few specifics about my workflow:

I primarily use pandas for data cleaning, transformation, and analysis. My operations include multiple groupby and apply functions. I'm running the analysis on a machine with 16GB RAM.

Could the community share best practices for optimizing pandas performance on large datasets?

1.Memory management techniques. 2.Efficient ways to perform groupby and apply. 3.Alternatives to pandas for handling large datasets. 4.Any tips for parallel processing or utilizing multiple cores effectively.

I primarily use pandas for data cleaning, transformation, and analysis. My operations include multiple groupby and apply functions. I'm running the analysis on a machine with 16GB RAM.

Please provide enough code so others can better understand or reproduce the problem. — Community
– Community Bot, Commented Jul 16, 2024 at 23:24

Sourav · Accepted Answer · 2024-08-01 02:45:52Z

0

If you are using Linux, I recommend trying FireDucks, an accelerator that optimizes any pandas workload without code changes. https://fireducks-dev.github.io/

FireDucks can efficiently address existing bottlenecks in your code and memory issues during execution. I am one of the developers of this library. Feel free to contact me with any queries you may have.

edited Aug 1, 2024 at 2:45

answered Jul 31, 2024 at 12:37

Sourav

11 bronze badge

Sign up to request clarification or add additional context in comments.

2 Comments

ultrasounder Feb 10 at 16:39

By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024. news.ycombinator.com/item?id=42135303. TLDR- Don't use it if you don't want vendor lock-in

Sourav Feb 11 at 3:18

Unlike many other libraries, FireDucks doesn’t involve any code level dependencies for you to learn and integrate to your existing pandas code for the optimization benefit. It can seamlessly integrate to an existing code and you can switch back to your original pandas program when you decide not to use FireDucks. Hence, please don’t misjudge the above statement related to commercialization. It will be freely available for the community with continuous improvements, bug fixes, etc., whereas the business team will focus on strengthening the enterprise supports etc. as for the commercial aspects.

Mohammad Talaei · Accepted Answer · 2024-08-03 07:42:59Z

Memory management in Pandas

The Kaggle book suggests modifying data types based on the machine limits for floating and integer types to reduce memory usage in Pandas.

def reduce_mem_usage(df, verbose=True):
    numerics = ["int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

Alternative to Pandas for handling large datasets

1. Polars: A DataFrame library written in Rust that aims to provide fast performance and a low memory footprint. According to the benchmarks, Polars achieve more than 30x performance gains compared to Pandas.

2. RAPIDS: An open-source suite of data science and analytics libraries that leverages NVIDIA GPUs for accelerating data processing workflows. Within RAPIDS, cuDF is a GPU DataFrame library that offers a pandas-like API but with the computational advantages of GPU acceleration. It is not limited to data processing and you can also speed up the training of ML models. For instance, you can train a RandomForest on GPU using RAPIDS: Kaggle Notebook.

padu · Accepted Answer · 2024-08-13 07:40:40Z

0

You can try to use parallel-pandas library. It's very easy to use:

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16)

# create big DataFrame
df = pd.DataFrame(np.random.random((1_000_000, 100)))

#your CPU-intensive function
def foo(x):
    pass

res = df.p_apply(foo)

answered Aug 13, 2024 at 7:40

padu

9198 silver badges12 bronze badges

Collectives™ on Stack Overflow

Optimizing pandas performance on large datasets

3 Answers 3

2 Comments

Memory management in Pandas

Alternative to Pandas for handling large datasets

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Memory management in Pandas

Alternative to Pandas for handling large datasets

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related