I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.
This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.
Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID', 'Number']]
IDcolumn returned ?