0

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):

id    name    foods    foods_eaten    color  continent
1     john    apples   2              red     Europe
1     john    oranges  3              red     Europe
2     jack    apples   1              blue    North America

I want to convert it to:

id    name    apples    oranges    color    continent 
1     john    2         3          red       Europe
2     jack    1         0          blue      North America

Edit:

(1) I updated the data to show a few more of the columns.

(3) I've done

df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')

Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.

3
  • Please show the code you already have while posting a question. Commented Jun 20, 2022 at 22:42
  • what columns are you losing? can you elaborate your scenario? Commented Jun 21, 2022 at 6:25
  • @samkart i will update the question. i feel like there is an easier implementation using pivot or something. Commented Jun 21, 2022 at 14:31

1 Answer 1

0

Extending from what you have done so far and leveraging here

>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
|  foods| id|name|
+-------+---+----+
| apples|  1|john|
|oranges|  1|john|
| banana|  2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name|            foods|
+---+----+-----------------+
|  1|john|[apples, oranges]|
|  2|jack|         [banana]|
+---+----+-----------------+

>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()

>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
|  1|john|  apples| oranges|
|  2|jack|  banana|    null|
+---+----+--------+--------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.