Merge multiple columns into one column in pyspark dataframe using python

Question

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.

Input dataframe:

+-------+-------+-------+-------+-------+
| name  |mark1  |mark2  |mark3  | Grade |
+-------+-------+-------+-------+-------+
| Jim   | 20    | 30    | 40    |  "C"  |
+-------+-------+-------+-------+-------+
| Bill  | 30    | 35    | 45    |  "A"  |
+-------+-------+-------+-------+-------+
| Kim   | 25    | 36    | 42    |  "B"  |
+-------+-------+-------+-------+-------+

Output dataframe should be

+-------+-----------------+
| name  |marks            |
+-------+-----------------+
| Jim   | [20,30,40,"C"]  |
+-------+-----------------+
| Bill  | [30,35,45,"A"]  |
+-------+-----------------+
| Kim   | [25,36,42,"B"]  |
+-------+-----------------+

Michael Panchenko · Accepted Answer · 2019-07-16 09:09:24Z

26

Columns can be merged with sparks array function:

import pyspark.sql.functions as f

columns = [f.col("mark1"), ...] 

output = input.withColumn("marks", f.array(columns)).select("name", "marks")

You might need to change the type of the entries in order for the merge to be successful

answered Jul 16, 2019 at 9:09

Michael Panchenko

4715 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

fjcf1 · Accepted Answer · 2017-06-19 10:18:29Z

3

look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["mark1", "mark2", "mark3"],
    outputCol="marks")

output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)

answered Jun 19, 2017 at 10:18

fjcf1

1601 gold badge1 silver badge8 bronze badges

2 Comments

Shubham Agrawal Over a year ago

I have string columns as well which I need to merge. With String columns it is giving following error with message StringType is not supported:

File "tester.py",line 34, in <module>     output = assembler.transform(mydata_df)   File"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/ml/base.py",line 105,in transform     return self._transform(dataset) . .   spark/2.1.0/libexec/python/pyspark/sql/utils.py", line 79, in deco     raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: u'Data type StringType is not supported.'

Erkan Şirin Over a year ago

vector assembler is just merging numerical types, not strings. in this case, there is a string type.

Erkan Şirin · Accepted Answer · 2019-08-25 13:34:02Z

1

You can do it in a select like following:

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat( 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade")
        ).alias('marks')  
    )

If [ ] necessary, it can be added lit function.

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat(lit("["), 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade"), lit("]")
        ).alias('marks')  
    )

answered Aug 25, 2019 at 13:34

Erkan Şirin

2,12521 silver badges29 bronze badges

Comments

Igor Melnichenko · Accepted Answer · 2017-09-25 17:03:25Z

0

If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.

answered Sep 25, 2017 at 17:03

Igor Melnichenko

1453 silver badges13 bronze badges

Collectives™ on Stack Overflow

Merge multiple columns into one column in pyspark dataframe using python

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related