Convert multiple rows into one row with multiple columns in pyspark?

Question

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):

id    name    foods    foods_eaten    color  continent
1     john    apples   2              red     Europe
1     john    oranges  3              red     Europe
2     jack    apples   1              blue    North America

I want to convert it to:

id    name    apples    oranges    color    continent 
1     john    2         3          red       Europe
2     jack    1         0          blue      North America

Edit:

(1) I updated the data to show a few more of the columns.

(3) I've done

df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')

Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.

Please show the code you already have while posting a question. — Jacob Celestine
– Jacob Celestine, Commented Jun 20, 2022 at 22:42
what columns are you losing? can you elaborate your scenario? — samkart
– samkart, Commented Jun 21, 2022 at 6:25
@samkart i will update the question. i feel like there is an easier implementation using pivot or something. — oogway74
– oogway74, Commented Jun 21, 2022 at 14:31

teedak8s · Accepted Answer · 2022-06-20 23:56:50Z

Extending from what you have done so far and leveraging here

>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
|  foods| id|name|
+-------+---+----+
| apples|  1|john|
|oranges|  1|john|
| banana|  2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name|            foods|
+---+----+-----------------+
|  1|john|[apples, oranges]|
|  2|jack|         [banana]|
+---+----+-----------------+

>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()

>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
|  1|john|  apples| oranges|
|  2|jack|  banana|    null|
+---+----+--------+--------+

Collectives™ on Stack Overflow

Convert multiple rows into one row with multiple columns in pyspark?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related