Convert array to string in pyspark

Question

This is my actual code , its working fine

df_train_taxrate = (
  df_train.groupby(
    'Company_code_BUKRS',
    'Vendor_Customer_Code_WT_ACCO',
    'Expense_GL_HKONT',
    'PAN_J_1IPANNO',
    'HSN_SAC_HSN_SAC'
  ).agg(
    f.collect_set('Section_WT_QSCOD').alias('Unique_Sectio_Code'),
    f.collect_set('WHT_rate_QSATZ').alias('Unique_Wtax_rate')
  )
)

But the problem is 'Section_WT_QSCOD,WHT_rate_QSATZ these are array's, while converting arrays into string I'm getting below error.

mycode:

df_train_taxrate = df_train.groupby(
    'Company_code_BUKRS',
    'Vendor_Customer_Code_WT_ACCO',
    'Expense_GL_HKONT',
    'PAN_J_1IPANNO',
    'HSN_SAC_HSN_SAC'
  ).agg(
    f.collect_set('Section_WT_QSCOD').withColumn(
      'Section_WT_QSCOD',                                           
      concat_ws(',', 'Unique_Sectio_Code')
    ),
    f.collect_set('WHT_rate_QSATZ').withColumn(
      'WHT_rate_QSATZ', 
      concat_ws(',', 'Unique_W_tax_rate')
    )
  )

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable

hi henrique, im able to done for first one,for second one im not able to put in the code section. — ludir34
– ludir34, Commented Apr 10, 2020 at 17:18
I think your problem is missing parentheses. It looks like you're trying to call withColumn on collect_set(), which doesn't make any sense. That would explain why you get that error message. — pault
– pault, Commented Apr 10, 2020 at 21:41
You should be doing something like concat_ws(",". f.collect_set('Section_WT_QSCOD')).alias( 'Section_WT_QSCOD') — pault
– pault, Commented Apr 10, 2020 at 21:47

CPak · Accepted Answer · 2020-04-10 18:45:06Z

3

You need to use array_join instead

Example data

import pyspark.sql.functions as F
data = [
    ('a', 'x1'),
    ('a', 'x2'),
    ('a', 'x3'),
    ('b', 'y1'),
    ('b', 'y2')
]
df = spark.createDataFrame(data, ['id', 'val'])

Solution

result = (
    df.
        groupby('id').
        agg(
            F.collect_set(F.col('val')).alias('arr_of_vals')
        ).
        withColumn(
            'arr_to_string',
            F.array_join(
                F.col('arr_of_vals'),
                ','
            )
        )
)
result
DataFrame[id: string, arr_of_vals: array<string>, arr_to_string: string]
result.show(truncate=False)
+---+------------+-------------+                                                
|id |arr_of_vals |arr_to_string|
+---+------------+-------------+
|b  |[y2, y1]    |y2,y1        |
|a  |[x1, x3, x2]|x1,x3,x2     |
+---+------------+-------------+

answered Apr 10, 2020 at 18:45

CPak

13.7k3 gold badges35 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ludir34 Over a year ago

AttributeError: module 'pyspark.sql.functions' has no attribute 'array_join'

CPak Over a year ago

Which version of pyspark are you on?

Collectives™ on Stack Overflow

Convert array to string in pyspark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related