I want to do group cols some aggregations operations like count, count_distinct or nunique.
For examples,
# the samples values in `date` column are all unique
df.show(7)
+--------------------+---------------------------------+-------------------+---------+
| category| tags| datetime| date|
+--------------------+---------------------------------+-------------------+---------+
| null| ,industry,display,Merchants|2018-01-08 14:30:32| 20200704|
| social,smart| smart,swallow,game,Experience|2019-06-17 04:34:51| 20200705|
| ,beauty,social| social,picture,social|2017-08-19 09:01:37| 20200706|
| default| default,game,us,adventure|2019-10-02 14:18:56| 20200707|
|financial management|financial management,loan,product|2018-07-17 02:07:39| 20200708|
| system| system,font,application,setting|2015-07-18 00:45:57| 20200709|
| null| ,system,profile,optimization|2018-09-07 19:59:03| 20200710|
df.printSchema()
root
|-- category: string (nullable = true)
|-- tags: string (nullable = true)
|-- datetime: string (nullable = true)
|-- date: string (nullable = true)
# I want to do some group aggregations by PySpark like follows in pandas
group_date_tags_cnt_df = df.groupby('date')['tags'].count()
group_date_tags_nunique_df = df.groupby('date')['tags'].nunique()
group_date_category_cnt_df = df.groupby('date')['category'].count()
group_date_category_nunique_df = df.groupby('date')['category'].nunique()
# expected output here
# AND all results should ignore ',' in the splitted result and `null` value in aggregations operations
group_date_tags_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
group_date_tags_nunique_df.show(4)
+---------+---------------------------------+
| date| count(DISTINCT tag)|
+---------+---------------------------------+
| 20200704| 3|
| 20200705| 4|
| 20200706| 3|
| 20200707| 4|
# It should ignore `null` here
group_date_category_cnt_df.show(4)
+---------+---------+
| date| count|
+---------+---------+
| 20200704| 0|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
group_date_category_nunique_df.show(4)
+---------+----------------------------+
| date| count(DISTINCT category)|
+---------+----------------------------+
| 20200704| 1|
| 20200705| 2|
| 20200706| 2|
| 20200707| 1|
But the tags and category columns are string type here.
So I think I should do split way first and do group aggregations operations based on.
But I am a little awkward to implement it.
So could anyone help me?