I have a large dataframe (30 million rows) which has the following columns where one column is an array of structs. I'd like to fetch all the id by querying for a specific key or a key/value pair.
+--------+--------------------+--------------------+
| id| tags| timestamp|
+--------+--------------------+--------------------+
| id_1|[{k1,v1}, {k2,v2}..]| t1|
| id_2|[{k3,v3}, {k4,v4}..]| t2|
| id_3|[{k5,v5}, {k6,v6}..]| t3|
The schema for this df is as follows:
root
|-- id: long (nullable = true)
|-- tags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- timestamp: long (nullable = true)
I've tried exploding the tags by creating key, value columns using this answer on a smaller df and query it (which works) but I'd like to have something efficient for a larger df.
I've looked into similar questions like this and this but struggled to make anything out of them. Maybe I can use the create_map() function to convert the struct first? TIA!