1

There is a json data source. Here is an example of one row:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ]
}

So when load it to spark DataFrame, the schema is:

root
 |-- PrimaryAcctNumber: string (nullable = true)
 |-- AdditionalData: array (nullable = true)
 |    |-- element: struct (containsNull = true)

I want to use Spark to create a new column called LongestAddressOfPrimaryAccount based on colomn AdditionalData (ArrayType[StructType]) using the following logic:

  • Iterate AdditionalData
    • If AccountNumber property equals PrimaryAcctNumber of the row, the value of LongestAddressOfPrimaryAccount will be the longest string in Addrs array
    • If no AccountNumber property equals PrimaryAcctNumber, the value will be "N/A"

So for the given row above, the expected output is:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ],
  "LongestAddressOfPrimaryAccount": "the longest address in the address list for account1"
}

It is doable to use a UDF or a map function. But that is not the best practice for Spark.

Is it doable to just use Spark functions? Something like:

sourceDdf.withColumn("LongestAddressOfPrimaryAccount", coalesce(
  longest(
    get_field(iterate_array_for_match($"AdditionalData", "AccountNumber", $"PrimaryAcctNumber"), "Addrs")
  )
  , lit("N/A")))

1 Answer 1

2

You will have to write a udf function for the requirement you have if you have spark versions 2.2 or less than that as inbuilt functions would be more complex and slower (slower in the sense that you will have to combine more inbuilt functions) than using a udf function. And I am not aware of such inbuilt function that can meet your requirement directly.

Databricks team is working on Nested Data Using Higher Order Functions in SQL and those would be in the next releases.

Until then you will have to write udf function if you don't want to your work to be complex.

Sign up to request clarification or add additional context in comments.

1 Comment

I am for sure using the latest Spark release version. Thanks for the guide.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.