Remove empty string from list (Spark Dataframe) [duplicate]

Question

Current Dataframe

+-----------------+--------------------+
|__index_level_0__|        Text_obj_col|
+-----------------+--------------------+
|                1|   [ ,entrepreneurs]|
|                2|[eat, , human, poop]|
|                3|    [Manafort, case]|
|                4|  [Sunar, Khatris, ]|
|                5|[become, arrogant, ]|
|                6|  [GPS, get, name, ]|
|                7|[exactly, reality, ]|
+-----------------+--------------------+

I want that empty string from the list removed. This is test data actual data is pretty big, how can I do this in pyspark?

shuvalov · Accepted Answer · 2019-12-27 13:22:37Z

1

You could use udf for this task:

from pyspark.sql.functions import udf

def filter_empty(l):
    return filter(lambda x: x is not None and len(x) > 0, l)

filter_empty_udf = udf(filter_empty, ArrayType(StringType()))

df.select(filter_empty_udf("Text_obj_col").alias("Text_obj_col")).show(10, False)

Tested on a few rows from your sample:

+------------------+
|Text_obj_col      |
+------------------+
|[entrepreneurs]   |
|[eat, human, poop]|
+------------------+

answered Dec 27, 2019 at 13:22

shuvalov

4,9832 gold badges22 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Remove empty string from list (Spark Dataframe) [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related