0

I have the following abstracted DataFrame (my original DF has 60 billion lines +)

Id    Date          Val1   Val2
1     2021-02-01    10     2
1     2021-02-05    8      4
2     2021-02-03    2      0
1     2021-02-07    12     5
2     2021-02-05    1      3

My expected ouput is:

Id    Date          Val1   Val2
1     2021-02-01    10     2
1     2021-02-02    10     2
1     2021-02-03    10     2
1     2021-02-04    10     2
1     2021-02-05    8      4
1     2021-02-06    8      4
1     2021-02-07    12     5
2     2021-02-03    2      0
2     2021-02-04    2      0
2     2021-02-05    1      3

Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).

I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?

1 Answer 1

1

I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

Sign up to request clarification or add additional context in comments.

5 Comments

I understand your approach of using batch processing and broadcasting. But I had a doubt. How would I generate the date range using the joins?
From your original dataset, I'd get the min and max dates, generate a list of dates by Pandas, convert to Spark dataframe, perform an inner join between your original dataset with dates dataset based on start and end date per row
Ok, but I'll need to get the range of dates, right? As example Id 1, there are 3 changes in val1 and val2, so I my min date should be2021-02-01 and my max date 2021-02-05. After this, my min date should be 2021-02-05 and max date 2021-02-07. The only I way I have to do this validation is through UDF right?
no, you just need to get the min and max of entire dataset
Sorry, but i didn't understand yet. How I'll generate the between values using an inner join getting the min and max values?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.