1

I have a data frame like below

 col1
 -----------------
 [a1_b1_c1, a2_b2_c2, a3_b3_c3]
 [aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]
 [aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]

now I want to split the elements, need to form the below-mentioned data frame,

newcol1
--------
[c1,c2,c3]
[cc1,cc2,null]
[ccc1,ccc2,null]

Please suggest to me the best way to achieve this?

3
  • Which version of Spark are you on? Commented Jul 29, 2021 at 8:03
  • spark 2.4.4 and python3 Commented Jul 29, 2021 at 8:28
  • @Yeskay Glad you found my solution helpful! Please consider also upvoting it, in addition to already having accepted it :) Commented Jul 29, 2021 at 10:16

2 Answers 2

1

You can use built-in higher order function called as TRANSFORM.

df.createTempView("tab")
spark.sql(
"select col1, TRANSFORM(col1, v-> split(v, '_')[2]) as newcol1 from tab")\
.show(truncate=False)

+------------------------------------------+------------+
|col1                                      |newcol1     |
+------------------------------------------+------------+
|[a1_b1_c1, a2_b2_c2, a3_b3_c3]            |[c1, c2, c3]|
|[aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]       |[cc1, cc2,] |
|[aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]|[ccc1, cc2,]|
+------------------------------------------+------------+
Sign up to request clarification or add additional context in comments.

Comments

0

You can achieve your desired output by using a udf

import pyspark.sql.functions as F
import pyspark.sql.types as T

def my_split(l):
  ll = [e.split('_') + [None]*(3 - len(e.split('_'))) for e in l]
  return [e[2] for e in ll]

my_udf = F.udf(my_split, T.ArrayType(T.StringType()))

df = df.withColumn('newcol1', my_udf('col1'))

df.show(truncate=False)

+------------------------------------------+-----------------+
|col1                                      |newcol1          |
+------------------------------------------+-----------------+
|[a1_b1_c1, a2_b2_c2, a3_b3_c3]            |[c1, c2, c3]     |
|[aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]       |[cc1, cc2, null] |
|[aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]|[ccc1, cc2, null]|
+------------------------------------------+-----------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.