Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Accessing data via fetcher (Fund cloning algorithm)

Hi everyone,

I've been banging my head against the wall, because I cannot seem to figure out how to access data that is loaded via the fetcher.

A few words about what I'm trying to do:

  • the CSV contains (parsed) information from the SEC filings made by Berkshire Hathaway
  • each line in the CSV contains information for a new trade or buy/sell in an existing position, i.e. stock symbol, the date when the order should be executed, an indicator what action we are taking and a change column, which I want to use to split my available budget
  • as you can probably imagine already, I'm trying to backtest those trades with Quantopian :)

If this would run outside of Quantopian, I would do the following:

  • load the file into pandas
  • check if there are entries where the current trading day and the date match
  • iterate over those entries and execute orders

When you check the attached code, you will see that some data is logged. But the pattern doesn't make any sense to me. Perhaps one of you guys has an idea.

In general, is it possible to test something like this in Quantopian or am I better of doing this locally with zipline?

Thanks,
Chris

14 responses

Sorry that you're having problems with this. You're showing us some places we really need to make this method and documentation better.

I think I mostly fixed it up for you. I put in a pre- and post-func() for logging purposes, and I changed the way you're indexing the data in handle_data, and now I think it's doing what you were looking for.

The problem that I think you're going to hit next is that the data is forward filled. I think the easiest way to fix that is to include a "reset" in your data after each datapoint so that it goes back down to zero.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi Dan,

thanks for your help with this. I've checked the log output and you are indeed correct, the problem is that now the data is present on every date (i.e. the values for 'action' and 'change').

Could you explain what you mean with the "reset"?

It would be really great to get this working :)

-Chris

Edit

I think I figured out what you meant and have changed the code that generates the CSV (pretty tricky even with Pandas). The CSV now is 30MB big though with ~1m lines, since it contains an entry for every stock and every day to avoid the forward filling.

I also cannot get it to load in Quantopian :(

Hello Christian,

I spent a few hours last week looking at your algo and data and the 'fetcher' forward-fill defeated me. Recent discussions with fawce and Grant here have offered a solution as 'fetcher' does not forward-fill the timestamp. The attached is not complete but it may get you started. You will have to create a mapping between the symbols in your CSV and the Quantopian SIDs - I have done a few of them.

P.

Hello Christian,

That one gave some odd result with 'LIZ' so maybe she wasn't trading at the time. This looks more plausible.

P.

Hello Dan,

Why are the benchmark figures different in my two backtests?

P.

Hello Christian,

My last go!

P.

Hello Christian,

It was interesting to look at but I admit defeat! After removing any examples of multiple trades on the same day for any symbol - see here - and sorting the CSV rows I then tried:

def pivot_data(df):  
    print df.head()  
    print df.tail()  
    pivot = df.pivot(columns="symbol", index="date", values="change")  
    print pivot  
    print pivot['AMP'][0:11]  
    return df  

on the CSV data. I can't really work out what is happening here:

1970-01-01 PRINT symbol date change action  
0 FDC 08/15/2001 15.069360 new  
1 GPS 08/15/2001 19.653780 new  
2 HON 08/15/2001 8.604316 new  
3 LIZ 08/15/2001 6.711568 new  
4 NKE 08/15/2001 22.686257 new

1970-01-01 PRINT symbol date change action  
478 MDLZ 08/15/2013 -91.804615 decrease  
479 NOV 08/15/2013 18.648371 increase  
480 SU 08/15/2013 95.747799 new  
481 USB 08/15/2013 27.366937 increase  
482 VRSN 08/15/2013 33.316451 increase

1970-01-01 PRINT <class 'pandas.core.frame.DataFrame'>  
Index: 49 entries, 02/15/2002 to 11/17/2008           <- Why not to 08/15/2013?  
Data columns (total 91 columns):  
ADM 2 non-null values  
ADP 6 non-null values  
AMP 9 non-null values  
BAC 4 non-null values  
BBY 2 non-null values  
BDX 5 non-null values  
BK 7 non-null values  
BNI 10 non-null values  
BUD 10 non-null values  
CBI 2 non-null values  
CDCO 13 non-null values  
CEG 3 non-null values  
CMCSK 7 non-null values  
COP 15 non-null values  
CVS 4 non-null values  
DE 1 non-null values  
DF 2 non-null values  
DG 5 non-null values  
DISH 1 non-null values  
DOWJB 3 non-null values  
DST 2 non-null values  
DTV 7 non-null values  
DUK 3 non-null values  
DVA 6 non-null values  
ETN 4 non-null values  
FDC 11 non-null values  
FE 2 non-null values  
FISV 3 non-null values  
GD 3 non-null values  
GE 3 non-null values  
GPS 7 non-null values  
GSK 2 non-null values  
HCA ...

1970-01-01 PRINT date  
02/15/2002 NaN  
02/15/2005 NaN  
02/15/2006 NaN  
02/15/2007 -57.050237  
02/15/2008 -54.320470  
02/15/2011 NaN  
02/15/2012 NaN  
02/15/2013 NaN  
02/16/2010 NaN          <- Why out of order?  
02/17/2004 NaN          <- Why out of order?  
02/17/2009 NaN          <- Why out of order?

P.

Hey Peter,

thanks for taking the time to take a look at this :) I haven't tried the latest code you posted, but I'll give it a go this week, it looks promising. Still, it requires to define all stocks that should be traded beforehand, which takes a lot of manual work.

I have this data for basically every hedge fund (starting from 2001). It would've been really interesting to check a few of them in Quantopian to see if there is any value in there. Even if I get this working though, the amount of manual effort required is a big turn-off :)

Regarding your last post, you think this is a problem of how the data is structured or a problem with how Quantopian treats it?

-Chris

Edit

A last resort, apart from trying to backtest this offline with a custom script, could be to define the data as a dict directly in the script. Of course, this would involve lots of manual work as well.

Hello Christian,

I think a pivot is the way to go. I've worked out the problem with the sorting as it was caused by sorting strings of US dates where '02/15/2013' comes before '02/16/2010'. I can get around that in pre_func and the result of the pivot is then as expected i.e.

1970-01-01 PRINT <class 'pandas.core.frame.DataFrame'>  
Index: 49 entries, 2001-08-15 to 2013-08-15  
Data columns (total 91 columns):  
ADM 2 non-null values  
ADP 6 non-null values  
AMP 9 non-null values  
.
.

I'm struggling with actually accessing the pivot data in handle_data but that's my lack of knowledge.

The general problem you have is converting symbols to SIDs. If Quantopian offered a look-up feature you might be able to progress.

If I make any progress with accessing the data I'll post a revised version. Meanwhile I have your data minus some daily 'duplicates' (they stop the pivot happening) here: https://raw.github.com/pcawthron/StockData/master/berkshire_new_indexed.csv

(Ignore the 'indexed' in the file name. That was a dead end I wandered up....)

P.

Peter,

could you explain in more detail what those duplicates are? I found some entries with a "inf" value for the column "change". Did you mean those?

-Chris

Hello Christian,

'Potentially duplicate column headings' is maybe a better term. We want to pivot

MDLZ    15/02/2008  23.50064225  
SNY     15/02/2008  1.1108326  

to

                 MDLZ                   SNY  
15/02/2008       23.50064225            1.1108326  

but we can't do that with:

TT  15/02/2008  70.78325379  
TT  15/02/2008  -100  
USB 15/02/2008  3.193754143  

P.

There's definitely something wrong :) I'll check the code that generates the CSV and see if it can be sorted out.

Having multiple actions for one stock on the same day doesn't make any sense, since a fund can only do one thing (buy into a new position, increase an existing position, decrease an existing position).

Hello Christian,

Thanks to help from Dan, Fawce and Jess we have some progress. Hopefully my questions to them will help improve the 'fetcher' documentation if nothing else.....

The good news: you can have the same symbol trading more than once a day if you want. Also there is no need to pivot the data into a DataFrame of multiple symbols per day.

More good news: the algo attached can tell you which symbols are not in the Quantopian data, for example 'LIZ' is not. I believe that may be 'LIZWI'.

Not so good news: the algo can tell you which symbols are in the Quantopian data but which have not been enumerated by SID in the algo. As you know there is no way at present to automatically import a symbol into the universe of SIDs used by an algo.

Slightly irritating news: because the backtest end date is not available to the algo you need to enter it manually.

The bad news: I've entered some symbols but there are quite a few. And they, of course, will change for every data set you have. Please badger Dan to get a solution. I think a 'fetcher' option to automatically add the SIDs of all matched symbols to the universe of SIDs used by the algo should be very easy to implement as the matching is done now.

This is the log output:

2002-02-15 handle_data: 35 INFO Buying 46.78793461 of Security(2968 [NEE])  
.
.
2013-08-15 handle_data: 35 INFO Buying 27.36693735 of Security(25010 [USB])

2013-10-23 PRINT Symbols not recognised: ['LIZ', 'CMCSK', 'DOWJB', 'NSRGY', 'STRZA']

2013-10-23 PRINT Valid symbols not in universe: ['HON', 'NKE', 'ODP', 'ADP', 'BBY', 'CDCO', 'PNC', 'DST',  
'DUK', 'HCA', 'PIR', 'BUD', 'DF', 'HD', 'LOW', 'LXK', 'TYC', 'AMP', 'COP', 'GE', 'JNJ', 'UPS', 'SNY', 'TGT', 
'BNI', 'NSC', 'IR', 'UNH', 'BAC', 'MDLZ', 'WLP', 'KMX', 'WHI', 'GSK', 'WBC', 'NRG', 'ETN', 'CEG', 'NLC', 'BDX',
 'XOM', 'RSG', 'TRV', 'FISV', 'BK', 'IBM', 'MA', 'DG', 'VRSK', 'CVS', 'DTV', 'GD', 'INTC', 'V', 'DVA', 'GM',  
'LEE', 'VIAB', 'NOV', 'PSX', 'DE', 'MEG', 'PCP', 'ADM', 'KRFT', 'VRSN', 'CBI', 'LMCA', 'DISH', 'SU']
End of logs.  

P.

Hello Christian,

I couldn't resist....

It seems there are 97 symbols in the CSV data so just under the limit of 100 SIDs that can be enumerated in an algo, although three are only traded before the start of the Quantopian data. And 'fetcher' will be changing - see here - which will solve your symbol to SID issue subject to a 150 concurrent SIDs limit.

Have a look at the Cash vs. Positions Value plot. This was bold:

2006-02-16 00:00:00     PG  BUY     15773   $60.79  $958,840.67  

P.