Welcome to the wonderful wacky world of dates and times in python. Specifically, how timezone aware and naive dates don't always play nice. Generally, all the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range
method returns timezone naive dates. Consider when pandas tries to make a dataframe out of the following:
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns},
index=pd.date_range(start=period_start, periods=6)))
The 'aapl_close' and 'aapl_returns' specified as columns of the dataframe are both pandas series. They have an index of UTC timezone aware dates. They aren't just dates but datetimes
which include hours. The date_range
index being specified is, by default, timezone naive.
Now, the problem arrises because pandas is smart, sometimes too smart. The first thing pandas does when creating the dataframe above is to create the index. Next it assigns values from the two data series to the appropriate index. Since an index was supplied for the two data columns, pandas doesn't simply assume the first value goes to the first index date. Pandas takes the first index and then 'looks up' the value having the same index in each series. The problem is the two datetimes don't match. Pandas doesn't find any data with the same datetimes as the index so it puts nans in for the data.
What's the fix? The best thing to do is not supply an index. Pandas will infer it from the two series and make things work. Like this
pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
If one really wanted to set the index however, simply supply the timezone information to date_range
method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
Note the added parameter of `tz='UTC". One detail however, the dates in that dataframe will now contain all the calendar days of the year. So the result will look like this
close returns
2013-01-01 00:00:00+00:00 NaN NaN
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
2013-01-05 00:00:00+00:00 NaN NaN
2013-01-06 00:00:00+00:00 NaN NaN
The reason is, as explained above, pandas first creates the index (all calendar days by default) then goes in and assigns the correct data to each day. The new dataframe includes weekends so values for those days are nan (ie there is no pricing or returns on weekends). Generally, we don't want all those nans so generally, don't supply the index and use the index inferred from the price and returns series. That will only include trading days.
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc
method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc
method has a bit more flexibility. This notation also works with datetimes if desired.
Hope that makes things a bit more clear. Typically always use timezone aware dates on Quantopian. Your life will be easier.