Create a range of dates in a pyspark DataFrame

Question

I have the following abstracted DataFrame (my original DF has 60 billion lines +)

Id    Date          Val1   Val2
1     2021-02-01    10     2
1     2021-02-05    8      4
2     2021-02-03    2      0
1     2021-02-07    12     5
2     2021-02-05    1      3

My expected ouput is:

Id    Date          Val1   Val2
1     2021-02-01    10     2
1     2021-02-02    10     2
1     2021-02-03    10     2
1     2021-02-04    10     2
1     2021-02-05    8      4
1     2021-02-06    8      4
1     2021-02-07    12     5
2     2021-02-03    2      0
2     2021-02-04    2      0
2     2021-02-05    1      3

Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).

I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?

pltc · Accepted Answer · 2022-04-16 04:05:07Z

1

I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

answered Apr 16, 2022 at 4:05

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

OdiumPura Over a year ago

I understand your approach of using batch processing and broadcasting. But I had a doubt. How would I generate the date range using the joins?

pltc Over a year ago

From your original dataset, I'd get the min and max dates, generate a list of dates by Pandas, convert to Spark dataframe, perform an inner join between your original dataset with dates dataset based on start and end date per row

OdiumPura Over a year ago

Ok, but I'll need to get the range of dates, right? As example Id 1, there are 3 changes in val1 and val2, so I my min date should be2021-02-01 and my max date 2021-02-05. After this, my min date should be 2021-02-05 and max date 2021-02-07. The only I way I have to do this validation is through UDF right?

pltc Over a year ago

no, you just need to get the min and max of entire dataset

OdiumPura Over a year ago

Sorry, but i didn't understand yet. How I'll generate the between values using an inner join getting the min and max values?

Collectives™ on Stack Overflow

Create a range of dates in a pyspark DataFrame

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related