1

I have a large dataframe (30 million rows) which has the following columns where one column is an array of structs. I'd like to fetch all the id by querying for a specific key or a key/value pair.

+--------+--------------------+--------------------+
|      id|                tags|           timestamp|
+--------+--------------------+--------------------+
|    id_1|[{k1,v1}, {k2,v2}..]|                  t1|
|    id_2|[{k3,v3}, {k4,v4}..]|                  t2|
|    id_3|[{k5,v5}, {k6,v6}..]|                  t3|

The schema for this df is as follows:

root
 |-- id: long (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- timestamp: long (nullable = true)

I've tried exploding the tags by creating key, value columns using this answer on a smaller df and query it (which works) but I'd like to have something efficient for a larger df.

I've looked into similar questions like this and this but struggled to make anything out of them. Maybe I can use the create_map() function to convert the struct first? TIA!

2 Answers 2

1

Another solution is to use array_contains. Normally, array_contains only accepts literal values, that is we cannot check if an item contained in a column is contained in the array. To overcome this limitation, we can use array_contains with an expr expression as follows.

key, value='k1', 'v1'

# checking if a key value pair is in the array: 
res = df.where(f.expr("array_contains(tags, struct('{key}' as key, '{value}' as value))"))

# checking if a key is in the array:
res = df.where(f.expr("array_contains(tags.key, '{key}')"))
Sign up to request clarification or add additional context in comments.

1 Comment

This is magical! Faster and I can directly query it. Accepting this.
1

You can use the filter high order function on the tags column and check if the resulting array is empty or not:

# Let us first define a function that tells if a spark struct is equal
# to the key value pair in param
def struct_equals(struct, key, value):
    return struct == F.struct(F.lit(key).alias('key'), F.lit(value).alias('value'))

# let's say that we want all rows containing that pair (k6,v6)
res = df.where(F.size(F.filter(F.col('tags'), lambda x: struct_equals(x, 'k6', 'v6'))) > 0)

# simpler, let's say that we want the rows with a key equal to 'k7'
res = df.where(F.size(F.filter(F.col('tags'), lambda x: x.getItem('key') == 'k7')) > 0)

4 Comments

I've tried out map_from_entries() function which results in {k1 -> v1, k2 -> v2}. Which approach would be better to query for in terms of performance? I haven't tried to search for specific key/value pairs using that approach yet.
They both use built in functions only without exploding nor shuffling anything so honestly it's hard to tell without testing both approaches. It should be comparable. It could depend on the size of the arrays. Let me know if you try ;-)
I added another solution based on array_contains.
Like you said, both approaches are really similar with regards to performance. I ended up slightly changing struct_equals for MapType and achieved the same results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.