1

I am hoping to do an event study analysis, but I cannot seem to properly build a simple predictive mode with time as the independent variable. I've been using this as a guide.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

#sample data
units = [0.916301354, 0.483947819, 0.551258976, 0.147971439, 0.617461504, 0.957460424, 0.905076453, 0.274261518, 0.861609383, 0.285914819, 0.989686616, 0.86614591, 0.074250832, 0.209507105, 0.082518752, 0.215795111, 0.953852132, 0.768329343, 0.380686392, 0.623940323, 0.155944248, 0.495745862, 0.0845513, 0.519966471, 0.706618333, 0.872300766, 0.70769554, 0.760616731, 0.213847926, 0.703866155, 0.802862491, 0.52468101, 0.352283626, 0.128962646, 0.684358794, 0.360520106, 0.889978575, 0.035806225, 0.15459103, 0.227742501, 0.06248614, 0.903500165, 0.13851151, 0.664684486, 0.011042697, 0.86353796, 0.971852899, 0.487774978, 0.547767217, 0.153629408, 0.076994094, 0.230693561, 0.961345948]
begin_date = '2022-8-01'
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})


# Create estimation data set
est_data = df['2022-08-01':'2022-08-30']

# And observation data
obs_data = df['2022-09-01':'2022-09-14']

# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()

# Get AR
# Using mean of estimation return
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return

# Then using model fit with estimation data
obs_data['risk_pred'] = m.predict()

obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']

# Graph the results
sns.lineplot(x = obs_data['date'],y = obs_data['AR_risk'])
plt.show()

As is, it won't recognise the date as a variable (image attached) error message

I've tried leaving the index as a counter, and just making the date a separate variable, but then when it gets to the "predict" portion, and it doesn't understand how to predict on dates that it has not seen before.

4
  • You need to specify a sep in your pd.read_csv line: pd.read_csv(..., sep = " // "). Also, why do you have a return in the middle of your code? Please add the csv file as well, maybe a google link or justpasteit Commented Apr 1, 2024 at 17:48
  • Ah I see, you assigned a variable named return. Don't do that, return is a keyword and it will mess up your execution Commented Apr 1, 2024 at 17:49
  • @TinoD Apologies! Both of those errors came from me adjusting the code to post. I've fixed those mistakes now. Commented Apr 1, 2024 at 20:52
  • What is the aim of your task? There are some inconsistencies in your approach, like assigning different lengths to the data frame and not using indexing correctly. Commented Apr 2, 2024 at 8:29

1 Answer 1

0

There are quite a lot of bugs in your code. I'll explain one by one in the following (check comments between ''' '''):

'''
small note, here you defined the variable as units and below you want to use a column called "variable".
Not a big problem, most probably you were reading the data from a file anyway, just something to keep in mind
'''
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})
'''
The following two lines do not work like that. 
First, the dataframe is not indexed by a datetime
Second, to reference the index you need to use .iloc. Alternatively you can use .loc
'''
# Create estimation data set
est_data = df['2022-08-01':'2022-08-30'] 
# And observation data
obs_data = df['2022-09-01':'2022-09-14']
'''
Here you are fitting according to est_data.
using the m.predict() function will give you the fitted points of est_data.
This will be important later
'''
# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()
# Get AR
# Using mean of estimation return
'''
you don't need np.mean for this, just use est_data['variable'].mean()
Also it is most probably not needed to have the mean in your script.
You can directly subtract using obs_data['variable'] - est_data['variable'].mean()
'''
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return
'''
This will not always work, and in this case it does not.
m.predict() returns the predictions based on the data in est_data. The same number of points will be outputed
In order for this to work, obs_data needs to have the same number of points as est_data
'''
obs_data['risk_pred'] = m.predict()
obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']

Am currently working on fixing the bugs, will give you a working example soon. For this can you please leave me answers to the following question:

  • Do you really want to fit the model according to est_data? If so, how are you gonna combine this with obs_data?

Edit 1: how to separate the data

The following code references the dates in the data frame:

est_data_Start = pd.to_datetime('2022-08-01')
est_data_End = pd.to_datetime('2022-08-30')
obs_data_Start = pd.to_datetime('2022-09-01')
est_data = df[df["date"].between(est_data_Start,est_data_End)]
obs_data = df[df["date"]>obs_data_Start]

The results for est_data are:

    date    variable
0   2022-08-01  0.916301
1   2022-08-02  0.483948
2   2022-08-03  0.551259
3   2022-08-04  0.147971
4   2022-08-05  0.617462
5   2022-08-06  0.957460
6   2022-08-07  0.905076
7   2022-08-08  0.274262
8   2022-08-09  0.861609
9   2022-08-10  0.285915
10  2022-08-11  0.989687
11  2022-08-12  0.866146
12  2022-08-13  0.074251
13  2022-08-14  0.209507
14  2022-08-15  0.082519
15  2022-08-16  0.215795
16  2022-08-17  0.953852
17  2022-08-18  0.768329
18  2022-08-19  0.380686
19  2022-08-20  0.623940
20  2022-08-21  0.155944
21  2022-08-22  0.495746
22  2022-08-23  0.084551
23  2022-08-24  0.519966
24  2022-08-25  0.706618
25  2022-08-26  0.872301
26  2022-08-27  0.707696
27  2022-08-28  0.760617
28  2022-08-29  0.213848
29  2022-08-30  0.703866

And the rest is going to obs_data.

Edit 2: fitting and predicting

The following code uses OLS to fit a model to the est_data. Then, the model is used to predict the values based on the data found in obs_data:

XTrain = est_data.index # get the training predictor
XTrain = sm.add_constant(XTrain) # add constant term to account for any intercept
m = sm.OLS(est_data["variable"], XTrain).fit() # fit according to training
XTest = obs_data.index # get the testing predictors
XTest = sm.add_constant(XTest) # and add a constant term
obs_data["risk_pred"] = m.predict(XTest) # predict based on the new data
# the following two calculations I just copied from you...
obs_data["AR_mean"] = obs_data["variable"] - est_data["variable"].mean()
obs_data["AR_risk"] = obs_data["variable"] - obs_data["risk_pred"]

The following code plots the results:

plt.figure()
plt.plot(est_data["date"], est_data["variable"], "-o", label = "Estimated")
plt.plot(obs_data["date"], obs_data["variable"], "-o", label = "Observed")
plt.plot(est_data["date"], m.predict(XTrain), label = "Train fit")
plt.plot(obs_data["date"], m.predict(XTest), label = "Test fit")
plt.legend(ncols =4, bbox_to_anchor=[0.5, 1.1, 0.5, 0])
plt.grid()
locator = mdates.AutoDateLocator(minticks = 7) 
formatter = mdates.ConciseDateFormatter(locator) 
plt.gca().xaxis.set_major_locator(locator) 
plt.gca().xaxis.set_major_formatter(formatter) 

The results and the imports are in the following section:

fit

Imports:

import pandas as pd
import numpy as np
import statsmodels.api as sm
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib.dates as mdates 
Sign up to request clarification or add additional context in comments.

5 Comments

This is very helpful thank you! I'm hoping to do an event study - which would require some period of time where I estimate "normal data" and then some small observation window (of varying lengths) where we compare the predicted values in this point to the actual observed data. That will come in within that last line - obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']
The thing is that you cannot assign those predictions to obs_data. Semantically, they are not within the time range, one is in the future and the other is in the past. Syntactically because the two do not have the same number of points, one has 21 and the other 30. Without knowing what you want to do, I cannot help you past separating the data, which I will include in my first edit.
Is there not a way to use the past data to estimate what we might expect in the future and then compare the actual future data to this? At the very basic, I want to do linear regression with variable as a function of time and use that regression to predict x many data points. With x being the length of the obs_data.
For sure this can be done, i'll edit my answer later
@josephbags check my latest edit

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.