batch transform decorator with clean

Back to Community

batch transform decorator with clean_nans=False?

posted Feb 23, 2013

Has anyone tried the batch transform decorator with clean_nans=False?

@batch_transform(refresh_period=R_P, window_length=W_L, clean_nans=False)

Perhaps someone can provide an example of how to handle missing data when calling the batch transform with multiple sids. As I understand, it should be straightforward with pandas, but I'm not up the learning curve yet.

One curiosity of mine is to sort out the mechanics of analyzing sets of securities that typically do not trade every minute of market trading day. For small-time individual traders, it seems that there might be an advantage in looking at thinly traded securities. Any insights or information along these lines would be appreciated.

19 responses

John Fawcett

Feb 23, 2013

Grant,

The datapanel passed into your batch_transform will have nans for the bars that do not have any trades. The pandas docs have a section dedicated to dealing with missing data, and it is a pretty quick read: http://pandas.pydata.org/pandas-docs/version/0.9.1/missing_data.html

Some of the highlights are: you can do calculations with dataframes that include NAs and Pandas has very sensible default behaviors; you can fill NAs by filling forward or filling backward; you can drop rows that have any NA values; you can interpolate to fill NAs; you can drop rows that are entirely NA values.

thanks,
fawce

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks...Grant

By the way, you need to update your documentation on the help page...it still states "In the current implementation, the data is filled forward from the last known value where possible and the data is dropped where it is not. In the future, we plan to make this behavior configurable within your algorithm." There is no mention of the clean_nans flag.

Are there other undocumented batch transform features?

Thomas Wiecki

Feb 24, 2013

@Grant: Yes: https://github.com/quantopian/zipline/blob/master/zipline/transforms/utils.py#L324 shows some more options.

I also agree that we should them to the help docs.

Disclaimer

Grant Kiehne

Feb 25, 2013

Thanks Thomas,

In addition to the help docs, examples of all of the use cases would be helpful, since there are lots of "bells and whistles" here:

def __init__(self,  
                 func=None,  
                 refresh_period=None,  
                 window_length=None,  
                 clean_nans=True,  
                 sids=None,  
                 fields=None,  
                 create_panel=True,  
                 compute_only_full=True):

        """Instantiate new batch_transform object.

:Arguments:
func : python function <optional>  
If supplied will be called after each refresh_period  
with the data panel and all args and kwargs supplied  
to the handle_data() call.  
refresh_period : int  
Interval to call batch_transform function.  
window_length : int  
How many days the trailing window should have.  
clean_nans : bool <default=True>  
Whether to (forward) fill in nans.  
sids : list <optional>  
Which sids to include in the moving window. If not  
supplied sids will be extracted from incoming  
events.  
fields : list <optional>  
Which fields to include in the moving window  
(e.g. 'price'). If not supplied, fields will be
extracted from incoming events.  
create_panel : bool <default=True>  
If True, will create a pandas panel every refresh  
period and pass it to the user-defined function.  
If False, will pass the underlying deque reference  
directly to the function which will be significantly  
faster.  
compute_only_full : bool <default=True>  
Only call the user-defined function once the window is  
full. Returns None if window is not full yet.  
"""

Specifically, I'm interested in passing the deque reference directly to vectorized numpy functions. Presumably, the clean_nans function works regardless of the other switch settings, correct?

Just curious..."under the hood" does the backtester load all of the data required to run the backtest into memory prior to executing (e.g. into a numpy array or similar). If so, then it should be very efficient to index over the array and pass-by-reference, right?. Or does the backtester need to stream data from a hard drive?

Grant

Thomas Wiecki

Feb 25, 2013

Hi Grant,

passing the dequeue reference to a vectorized numpy function won't work unless you convert it to a numpy array first in which case you can just use the pandas dataframe we already create for you.

The batch_transform only aggregates whatever comes in, it doesn't do any caching. The data gets streamed from a database.

Thomas

Disclaimer

Grant Kiehne

Feb 25, 2013

Thanks Thomas,

Hmm...I thought that the advantage of create_panel = False would be to speed things up with a pass-by-reference, but apparently I'm confused ("If False, will pass the underlying deque reference directly to the function which will be significantly faster."). Sounds like if I want to use numpy in my function, I'll lose the speed advantage. So, how do I gain efficiency with the batch transform?

Grant

Thomas Wiecki

Feb 25, 2013

I see, maybe that help text can be improved.

The flag means that it will give you a reference to a linked list in memory (i.e. non-contiguous) rather than create an array in (contiguous) memory for you. If you just need to loop over the list you might be able to speed up your computation with this. If you need numpy or any other function that requires contiguous memory this will not help.

The batch transform is as efficient as it gets for what you want to do.

Disclaimer

Grant Kiehne

Feb 25, 2013

Thanks Thomas,

When you get the chance, I'd appreciate a simple example of how to loop over the linked list.

As a curiosity, can one create an array of references in contiguous memory, with the references pointing to values stored in non-contiguous memory? If so, then perhaps one could create a data structure that numpy could crunch on, without having to copy data from the underlying deque into a numpy array. Just a thought...

Grant

Thomas Wiecki

Feb 25, 2013

It's just an iterator. So you can do for event in data: and event will be a dict-like object.

NumPy wouldn't like that very much :). Also, copying pointers or actual data doesn't make a difference.

Do you run into actual problems with the batch transform?

Disclaimer

Grant Kiehne

Feb 26, 2013

Thomas,

I'll have to keep my eye out for an example on the forum. I am not having any problems with the batch transform...just curious.

Grant

Thomas Wiecki

Feb 26, 2013

@Grant: OK, see for example here https://www.quantopian.com/posts/inferring-latent-states-using-a-gaussian-hidden-markov-model for an ever-growing window.

Also I think you got the wrong impression of the batch transform. It's not slow at what it does, it's just that what it does is slower than an iterative version. But if you don't have an iterative implementation batch transform is the best you can do.

Disclaimer

Grant Kiehne

Feb 26, 2013

Thanks Thomas,

I appreciate your guidance in learning how to write algorithms iteratively, if possible. Typically in MATLAB, I would load an entire data set into memory (e.g. as a matrix) and then analyze it. In the context of the Quantopian backtester, as you have pointed out, the data streams from disk. Also, as you have pointed out, if configured to use the pandas datapanel, a batch transform will need to re-create an object upon every call (presumably also re-allocating memory).

I haven't tried it since your recent upgrades, but my experience has been that the batch transform works under daily backtesting, but would appear to bog down under minutely backtesting. Now that I understand what's going on, it's no problem, but you might want to provide guidance in the help docs to head off confusion.

Grant

Grant Kiehne

Jan 19, 2014

Hello,

Did the batch transform get modified, removing this option:

create_panel : bool  
If True, will create a pandas panel every refresh  
period and pass it to the user-defined function.  
If False, will pass the underlying deque reference  
directly to the function which will be significantly  
faster.

I'm trying to sort out how to speed up the batch transform for minutely data, and thought I'd look into this option.

Grant

Thomas Wiecki

Jan 19, 2014

Hi Grant,

I'll have to look into what happened to that option, maybe Eddie knows from the top of his head.

In any case though, to assess whether there is potential speed-up using deque it depends on your use-case. If you only loop through the data you can get speed-ups. If you convert it to an array the deque should be much slower.

Disclaimer

Grant Kiehne

Jan 19, 2014

Thanks Thomas,

No rush, but I thought I'd revisit the option, since the history API does not yet support minute-level data.

Grant

Grant Kiehne

Jan 19, 2014

Thomas,

Is the deque thingy accessible? Does the backtester automatically maintain a history of the data, as the backtest runs?

Grant

Thomas Wiecki

Jan 19, 2014

Sorry, I just realized. There is no more deque underneath so it's not surprising that option is gone.

Disclaimer

Grant Kiehne

Jan 19, 2014

Thanks...is there anything in place of the deque that would be user-accessible? --Grant

You've successfully submitted a support ticket.

Our support team will be in touch soon.