I don't know how to do this using only PySpark-SQL, but here is a way to do it using PySpark DataFrames.
Basically, we can convert the struct column into a MapType() using the create_map() function. Then we can directly access the fields using string indexing.
Consider the following example:
Define Schema
schema = StructType([
StructField('person', IntegerType()),
StructField(
'customtags',
ArrayType(
StructType(
[
StructField('name', StringType()),
StructField('value', StringType())
]
)
)
)
]
)
Create Example DataFrame
data = [
(
1,
[
{'name': 'name', 'value': 'tom'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'm'}
]
),
(
2,
[
{'name': 'name', 'value': 'jerry'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'm'}
]
),
(
3,
[
{'name': 'name', 'value': 'ann'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'f'}
]
)
]
df = sqlCtx.createDataFrame(data, schema)
df.show(truncate=False)
#+------+------------------------------------+
#|person|customtags |
#+------+------------------------------------+
#|1 |[[name,tom], [age,20], [gender,m]] |
#|2 |[[name,jerry], [age,20], [gender,m]]|
#|3 |[[name,ann], [age,20], [gender,f]] |
#+------+------------------------------------+
Convert the struct column to a map
from operator import add
import pyspark.sql.functions as f
df = df.withColumn(
'customtags',
f.create_map(
*reduce(
add,
[
[f.col('customtags')['name'][i],
f.col('customtags')['value'][i]] for i in range(3)
]
)
)
)\
.select('person', 'customtags')
df.show(truncate=False)
#+------+------------------------------------------+
#|person|customtags |
#+------+------------------------------------------+
#|1 |Map(name -> tom, age -> 20, gender -> m) |
#|2 |Map(name -> jerry, age -> 20, gender -> m)|
#|3 |Map(name -> ann, age -> 20, gender -> f) |
#+------+------------------------------------------+
The catch here is that you have to know apriori the length of the ArrayType() (in this case 3) as I don't know of a way to dynamically loop over it. This also assumes that the array has the same length for all rows.
I had to use reduce(add, ...) here because create_map() expects pairs of elements in the form of (key, value).
Group by fields in the map column
df.groupBy((f.col('customtags')['name']).alias('name')).count().show()
#+-----+-----+
#| name|count|
#+-----+-----+
#| ann| 1|
#|jerry| 1|
#| tom| 1|
#+-----+-----+
df.groupBy((f.col('customtags')['gender']).alias('gender')).count().show()
#+------+-----+
#|gender|count|
#+------+-----+
#| m| 2|
#| f| 1|
#+------+-----+