-2

Current Dataframe

+-----------------+--------------------+
|__index_level_0__|        Text_obj_col|
+-----------------+--------------------+
|                1|   [ ,entrepreneurs]|
|                2|[eat, , human, poop]|
|                3|    [Manafort, case]|
|                4|  [Sunar, Khatris, ]|
|                5|[become, arrogant, ]|
|                6|  [GPS, get, name, ]|
|                7|[exactly, reality, ]|
+-----------------+--------------------+

I want that empty string from the list removed. This is test data actual data is pretty big, how can I do this in pyspark?

0

1 Answer 1

1

You could use udf for this task:

from pyspark.sql.functions import udf

def filter_empty(l):
    return filter(lambda x: x is not None and len(x) > 0, l)

filter_empty_udf = udf(filter_empty, ArrayType(StringType()))

df.select(filter_empty_udf("Text_obj_col").alias("Text_obj_col")).show(10, False)

Tested on a few rows from your sample:

+------------------+
|Text_obj_col      |
+------------------+
|[entrepreneurs]   |
|[eat, human, poop]|
+------------------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.