Working with History DataFrames

Back to Community

edited

With the recent improvements to history(), I've noticed a fair amount of confusion in the forums surrounding usage of Pandas DataFrame objects. While DataFrames provide an incredibly powerful interface to manipulating tabular data, there's a pretty steep learning curve involved in using them to their fullest potential.

Rather than answer a whole bunch of individual questions, I threw together a quick IPython Notebook showing some examples for how to work with DataFrames that have the same structure as the frame objects returned by history. Hopefully this helps provide examples for some common use cases.

http://nbviewer.ipython.org/gist/ssanderson/8df2e0eecf7e323de447

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

14 responses

Grant Kiehne

Note also that a Pandas dataframe can be converted to its Numpy-array matrix representation with:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html

If your sids are in a list (e.g. context.stocks), then the matrix columns can be ordered as they are in the list:

myDataFrame.as_matrix(context.stocks)

Thomas Wiecki

Grant, thanks for chiming in. In my personal experience, I would discourage that conversion. DataFrames are almost always more flexible and easier to work with and I don't think there's anything you can do in NumPy you can't in Pandas (the reverse is not true). You can always access the underlying numpy array by doing .values if you really need it, but that's rarely the case.

Disclaimer

Grant Kiehne

Hi Thomas,

I agree that some functionality is lost by converting to the Numpy-array matrix representation (e.g. column & row labels). I find working with Numpy closer to MATLAB (http://wiki.scipy.org/NumPy_for_Matlab_Users), so if there is no reason to keep the data in a Pandas DataFrame, I convert it. Also, maybe I'm not up the learning curve yet with Pandas, but some (most?) common numerical libraries need a Numpy ndarray as input; a DataFrame won't work.

Grant

Thomas Wiecki

Hi Grant,

While it's true that numpy is closer to matlab I wouldn't say that that's a pro ;). Most of the times, the libraries that take numpy arrays also accept dataframes (sklearn is an exception). If they don't, I convert it when passing it with .values as a final step (also doesn't copy memory). Learning pandas made me at least 10x more productive in Python so I can only encourage everyone who works with tabular data to invest the time to learn it. Of course you can code your algorithms any way you like but I guess I'm responding because I'd hate for suboptimal practices to be engrained on Quantopian.

Thomas

Disclaimer

Sekuri Jouri

Scott, Thank you for sharing. I'm picking up more and more how powerful pandas can be. For example, sometimes I want to convert daily historical data to weekly etc. I used to do it very inefficiently. I think the pandas way is as follows and I'm sharing in part to double check if I'm doing it correctly:

d_o = history(bar_count=252, frequency='1d', field='open_price')  
d_h = history(bar_count=252, frequency='1d', field='high')  
d_l = history(bar_count=252, frequency='1d', field='low')  
d_c = history(bar_count=252, frequency='1d', field='price')

w_o = d_o.resample('W-Fri', how='first')  
w_h = d_h.resample('W-Fri', how='max')  
w_l = d_l.resample('W-Fri', how='min')  
w_c = d_c.resample('W-Fri', how='last')

Bob Schmidt

I made this snippet from an Eddie Hebert post somewhere. It's probably old tribal knowledge around here. But it was a revelation for me. It shows some of the speed benefits and tradeoffs between pandas and numpy. I think pandas is great and it has lured me back to doing mostly python from R. But for those doing serious number crunching:

http://www.mediafire.com/download/bj5yb8dfhndt8y8/go-faster-with-numpy.ipynb

Grant Kiehne

Hey Thomas,

Well, your testimonial has convinced me to attempt using Pandas extensively in my next algo effort.

As a specific example, say I want to use http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.zscore.html on a dataframe from history. If I'm not mistaken, I'd need to convert the input to an ndarray, and then convert the output back to a dataframe, correct? Or is there a way to apply a function like zscore to the dataframe directly (without writing my own wrapper function)?

Grant

Scott Sanderson

Dataframes have a few methods for applying numpy/scipy functions to entries (they work for regular python functions as well, but they're optimized for numpy/scipy).

For example, DataFrame.apply(function) will call your function on each column/row of your frame, with column being the default,
which means that

df.apply(zscore)

will (I think) do what you want.

Since you're using apply with a numpy/scipy function, you can also do
df.apply(zscore, raw=True) which will pass the underlying raw numpy array to your function, achieving much better performance.

As I mentioned above, you can also use apply with regular python functions, e.g.
df_with_values_squared = df.apply(lambda x: x **2)

Disclaimer

Grant Kiehne

Thanks...yes, I started to think along these lines. There is also a rolling_apply (http://pandas.pydata.org/pandas-docs/stable/computation.html), which should be handy, too. --Grant

Grant Kiehne

As promised, here's my attempt to use Pandas more extensively. Worked just fine. --Grant

Thomas Wiecki

Thanks Grant, code looks great. One minor comment:

dv = pd.DataFrame(c.values*v.values, columns=c.columns, index=c.index)

could just be:

dv = c * v

Disclaimer

Jamie Lunn

Hey @Grant,

I just cloned the above algo but it's not trading.. It looks like there may be a problem with the z-score value. Did something change? I'm trying to import the stats library so i don't need to calculate a z-score manually, like an amateur..

Grant Kiehne

Sorry Jamie,

I see the same thing. No time now to dig into it, but it appears that something has changed, either with the Quantopian code or Python/Pandas/etc.

I hope to have a look in the next day or two. In the meantime, maybe the Quantopian folks know what might be going on?

Grant

Grant Kiehne

Jamie,

This one works, after adding .fillna(0) to the volumes:

v = history(2340,'1m','volume').fillna(0) # volumes

I made a few other tweaks, as well.

Grant

You've successfully submitted a support ticket.

Our support team will be in touch soon.