How to process complex data in ArrayType using Spark functions

Question

There is a json data source. Here is an example of one row:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ]
}

So when load it to spark DataFrame, the schema is:

root
 |-- PrimaryAcctNumber: string (nullable = true)
 |-- AdditionalData: array (nullable = true)
 |    |-- element: struct (containsNull = true)

I want to use Spark to create a new column called LongestAddressOfPrimaryAccount based on colomn AdditionalData (ArrayType[StructType]) using the following logic:

Iterate AdditionalData
- If AccountNumber property equals PrimaryAcctNumber of the row, the value of LongestAddressOfPrimaryAccount will be the longest string in Addrs array
- If no AccountNumber property equals PrimaryAcctNumber, the value will be "N/A"

So for the given row above, the expected output is:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ],
  "LongestAddressOfPrimaryAccount": "the longest address in the address list for account1"
}

It is doable to use a UDF or a map function. But that is not the best practice for Spark.

Is it doable to just use Spark functions? Something like:

sourceDdf.withColumn("LongestAddressOfPrimaryAccount", coalesce(
  longest(
    get_field(iterate_array_for_match($"AdditionalData", "AccountNumber", $"PrimaryAcctNumber"), "Addrs")
  )
  , lit("N/A")))

Anahcolus · Accepted Answer · 2018-03-08 06:31:21Z

2

You will have to write a udf function for the requirement you have if you have spark versions 2.2 or less than that as inbuilt functions would be more complex and slower (slower in the sense that you will have to combine more inbuilt functions) than using a udf function. And I am not aware of such inbuilt function that can meet your requirement directly.

Databricks team is working on Nested Data Using Higher Order Functions in SQL and those would be in the next releases.

Until then you will have to write udf function if you don't want to your work to be complex.

answered Mar 8, 2018 at 6:31

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DeepNightTwo Over a year ago

I am for sure using the latest Spark release version. Thanks for the guide.

Collectives™ on Stack Overflow

How to process complex data in ArrayType using Spark functions

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related