3

This is my actual code , its working fine

df_train_taxrate = (
  df_train.groupby(
    'Company_code_BUKRS',
    'Vendor_Customer_Code_WT_ACCO',
    'Expense_GL_HKONT',
    'PAN_J_1IPANNO',
    'HSN_SAC_HSN_SAC'
  ).agg(
    f.collect_set('Section_WT_QSCOD').alias('Unique_Sectio_Code'),
    f.collect_set('WHT_rate_QSATZ').alias('Unique_Wtax_rate')
  )
)

But the problem is 'Section_WT_QSCOD,WHT_rate_QSATZ these are array's, while converting arrays into string I'm getting below error.

mycode:

df_train_taxrate = df_train.groupby(
    'Company_code_BUKRS',
    'Vendor_Customer_Code_WT_ACCO',
    'Expense_GL_HKONT',
    'PAN_J_1IPANNO',
    'HSN_SAC_HSN_SAC'
  ).agg(
    f.collect_set('Section_WT_QSCOD').withColumn(
      'Section_WT_QSCOD',                                           
      concat_ws(',', 'Unique_Sectio_Code')
    ),
    f.collect_set('WHT_rate_QSATZ').withColumn(
      'WHT_rate_QSATZ', 
      concat_ws(',', 'Unique_W_tax_rate')
    )
  )

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
4
  • Split your code into lines for better reading, please Commented Apr 10, 2020 at 17:05
  • hi henrique, im able to done for first one,for second one im not able to put in the code section. Commented Apr 10, 2020 at 17:18
  • 1
    I think your problem is missing parentheses. It looks like you're trying to call withColumn on collect_set(), which doesn't make any sense. That would explain why you get that error message. Commented Apr 10, 2020 at 21:41
  • 1
    You should be doing something like concat_ws(",". f.collect_set('Section_WT_QSCOD')).alias( 'Section_WT_QSCOD') Commented Apr 10, 2020 at 21:47

1 Answer 1

3

You need to use array_join instead

Example data

import pyspark.sql.functions as F
data = [
    ('a', 'x1'),
    ('a', 'x2'),
    ('a', 'x3'),
    ('b', 'y1'),
    ('b', 'y2')
]
df = spark.createDataFrame(data, ['id', 'val'])

Solution

result = (
    df.
        groupby('id').
        agg(
            F.collect_set(F.col('val')).alias('arr_of_vals')
        ).
        withColumn(
            'arr_to_string',
            F.array_join(
                F.col('arr_of_vals'),
                ','
            )
        )
)
result
DataFrame[id: string, arr_of_vals: array<string>, arr_to_string: string]
result.show(truncate=False)
+---+------------+-------------+                                                
|id |arr_of_vals |arr_to_string|
+---+------------+-------------+
|b  |[y2, y1]    |y2,y1        |
|a  |[x1, x3, x2]|x1,x3,x2     |
+---+------------+-------------+
Sign up to request clarification or add additional context in comments.

2 Comments

AttributeError: module 'pyspark.sql.functions' has no attribute 'array_join'
Which version of pyspark are you on?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.