PySpark dataframe Pandas UDF returns empty dataframe

Question

I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.

This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.

Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?

schema = StructType([
        StructField("Key", StringType()),
        StructField("Number", IntegerType())
    ])

@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
    def get_odd(df):
        odd = df.loc[df['Number']%2 == 1]
        return odd[['ID', 'Number']]

Because I wish to deploy the algorithm on a cluster and groupby enables distributed computing. Conducting the conditions i have on a huge dataframe is very expensive without groupby. — ALeex
– ALeex, Commented May 17, 2020 at 19:47
Use a if/else to return an empty data frame with columns defined ? Also how does your return match the schema as you have only the ID column returned ? — akuiper
– akuiper, Commented May 17, 2020 at 20:16

lpounng · Accepted Answer · 2020-06-10 10:05:58Z

4

I come across this issue with null DataFrame in some groups. I solve this by checking for empty DataFrame and return a DataFrame with schema defined:

if df_out.empty:
    # change the schema as needed
    return pd.DataFrame({'fullVisitorId': pd.Series([], dtype='str'),
                         'time': pd.Series([], dtype='datetime64[ns]'),
                         'total_transactions': pd.Series([], dtype='int')})

answered Jun 10, 2020 at 10:05

lpounng

6407 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

snark Over a year ago

I've found that it's sufficient to provide the column names alone. So if something goes wrong in my pandas_udf functions and I want to return an empty pandas dataframe I just do: return pd.DataFrame(columns=schema.fieldNames()), where schema is the schema of the Spark DataFrame (to be returned) which you passed into your pandas_udf function.

Collectives™ on Stack Overflow

PySpark dataframe Pandas UDF returns empty dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related