Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Randomly sampling get_fundamentals results and memory problem

Here is a simple algorithm that exposes the problem. It does nothing but calling get_fundamentals. After a while it crashes with MemoryError.
I believe this error has been recently introduced.

10 responses

Luca, we can reproduce the problem and we have an engineer looking at it. Sorry for the difficulty, and thanks for pointing it out.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I have run some more tests and it seems that the problem resides in random.sample(...) function. The same test without that runs fine:

    #symbols = random.sample(fundamental_df.columns.values, 10)  # This rises MemoryError eventually  
    symbols = fundamental_df.columns[0:10]   # This is fine  

Probably random.sample implementation uses too much memory...but really? I cannot safely sample few thousands securities without incurring in a MemoryError?

Some more information. I tried different sampling methods and those are the results:

1 - Those two both crash with MemoryError:

 symbols = random.sample(fundamental_df.columns, 10)  
 symbols = fundamental_df.sample(n=10, axis=1).columns  

2 - This one is more memory efficient but it works only with a small number of samples (10 is ok but 100 and above fails).

symbols = fundamental_df.iloc[:, random.sample(xrange(len(fundamental_df.columns)), 10) ].columns  

Thanks for the pointers, Luca. The data is helpful.

I will check back with the developers in the morning, but I think we have a handle on the cause. What I don't know yet is the fix.

Thanks Dan. I'll probably be able to find a workaround using the new Pipiline API (great job) anyway.

I exposed the problem because an algorithm that used to work doesn't work anymore. I just wanted to make you aware of this.

Thanks.

We shipped a fix for this. We also found something that was causing a performance problem for this example, and fixed that too. Your algo runs to completion, and it runs much faster now.

Glad you like Pipeline!

Thank you for the quick fix

I tried again many backtests and I didn't see MemoryError anymore. Fixed! ;)

Not sure about the performance improvement though. Here is a backtest that simply select 500 random stocks every day and does nothing with those. Despite doing nothing it is very slow and It cannot successfully run the whole backtest time frame. I guess a timeout stops it around 2007.

Anyway, just wanted to make you aware of this performance issue. On my side, I can simply run a shorter backtest.

I have the same problem.
My algorithms with "large" Universes (nearly 500 securities) could not enter the contest because the 2 year backtest did not manage to finish before the timeout.

Two thoughts on the performance.

  1. Try implementing it using Pipeline. Pipeline does a lot of pre-crunching of data in before trading starts, and you might get some performance gains. I'm not quite knowledgeable enough to be sure that you'll see the benefit.
  2. We're working on some significant performance improvements. It's the next big project queued up after Pipeline. It's too far away to give timelines, but there are improvements on the horizon.