Fetch row data with key/value from array of structs using PySpark SQL

Question

I have a large dataframe (30 million rows) which has the following columns where one column is an array of structs. I'd like to fetch all the id by querying for a specific key or a key/value pair.

+--------+--------------------+--------------------+
|      id|                tags|           timestamp|
+--------+--------------------+--------------------+
|    id_1|[{k1,v1}, {k2,v2}..]|                  t1|
|    id_2|[{k3,v3}, {k4,v4}..]|                  t2|
|    id_3|[{k5,v5}, {k6,v6}..]|                  t3|

The schema for this df is as follows:

root
 |-- id: long (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- timestamp: long (nullable = true)

I've tried exploding the tags by creating key, value columns using this answer on a smaller df and query it (which works) but I'd like to have something efficient for a larger df.

I've looked into similar questions like this and this but struggled to make anything out of them. Maybe I can use the create_map() function to convert the struct first? TIA!

Oli · Accepted Answer · 2023-03-02 09:42:12Z

1

Another solution is to use array_contains. Normally, array_contains only accepts literal values, that is we cannot check if an item contained in a column is contained in the array. To overcome this limitation, we can use array_contains with an expr expression as follows.

key, value='k1', 'v1'

# checking if a key value pair is in the array: 
res = df.where(f.expr("array_contains(tags, struct('{key}' as key, '{value}' as value))"))

# checking if a key is in the array:
res = df.where(f.expr("array_contains(tags.key, '{key}')"))

answered Mar 2, 2023 at 9:42

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aniketh Reddimi Over a year ago

This is magical! Faster and I can directly query it. Accepting this.

Oli · Accepted Answer · 2023-03-01 14:04:29Z

1

You can use the filter high order function on the tags column and check if the resulting array is empty or not:

# Let us first define a function that tells if a spark struct is equal
# to the key value pair in param
def struct_equals(struct, key, value):
    return struct == F.struct(F.lit(key).alias('key'), F.lit(value).alias('value'))

# let's say that we want all rows containing that pair (k6,v6)
res = df.where(F.size(F.filter(F.col('tags'), lambda x: struct_equals(x, 'k6', 'v6'))) > 0)

# simpler, let's say that we want the rows with a key equal to 'k7'
res = df.where(F.size(F.filter(F.col('tags'), lambda x: x.getItem('key') == 'k7')) > 0)

answered Mar 1, 2023 at 14:04

Oli

10.5k5 gold badges31 silver badges51 bronze badges

4 Comments

Aniketh Reddimi Over a year ago

I've tried out map_from_entries() function which results in {k1 -> v1, k2 -> v2}. Which approach would be better to query for in terms of performance? I haven't tried to search for specific key/value pairs using that approach yet.

Oli Over a year ago

They both use built in functions only without exploding nor shuffling anything so honestly it's hard to tell without testing both approaches. It should be comparable. It could depend on the size of the arrays. Let me know if you try ;-)

Oli Over a year ago

I added another solution based on array_contains.

Aniketh Reddimi Over a year ago

Like you said, both approaches are really similar with regards to performance. I ended up slightly changing struct_equals for MapType and achieved the same results.

Collectives™ on Stack Overflow

Fetch row data with key/value from array of structs using PySpark SQL

2 Answers 2

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related