Getting Started Tips

Back to Community

edited

Hello,

I am absolutely new to Quantopian. I started by reading the documentation. My understanding so far is that Quantopian has been developing constantly and some practices are becoming more fashionable and some practices are phasing out. In order not to lose meaningless time, I would like to ask a few questions to clarify what is hot and what is not. Also I would like to understand the main jargon.

1) The word "research" here refers to Jupiter notebooks section under research. Yes/No?
2) The word IDE here refers to the Algorithms section (which is unfortunately under research). Yes/No?
3) What is the difference between datasets under research (from quantopian.research import prices, symbols) and under pipeline?
4) Does "from quantopian.research import prices, symbols" provide different data than get_pricing?
5) All new examples are using pipeline. Are you phasing out research?
6) I see calls to get_pricing, get_fundamentals in community posts without any module import. How is this possible?
7) Which package contains the necessary functions to access historical fundamental data? I want to have a long history of competition eligibile stocks to do analysis first of all (with all adjustments done up until today), not backtest. It seems pipeline is delivering data in a walk forward manner to do backtesting.
8) Where can I find the detailed description of fundamental data provided in Quantopian, for example whether P/E ratio refers to future estimated earnings or trailing 12 months earnings, and when the components like Earnings are updated (it is clear that Price component is updated daily).

Thanks

14 responses

Jamie McCorriston

Welcome to Quantopian!

These are good questions - you're right to try to get the main jargon right at the beginning. Let me start by saying that we recently launched a new set of documentation for Quantopian which can be found here. The new docs are currently released with an "alpha" label. Soon, we will be promoting it to the official documentation and it will be the only version you can find in the top nav bar.

Now, for your specific questions:

1) Yes, most of the time, when we say "research", we mean the Jupyter notebook environment that we call Research. Sometimes, we will use the term "research" to refer to the process of actually researching an idea. Usually, if we mean the research environment, we'll say something like "do this in research" or "... research environment".
2) Yes, IDE is what we use to refer to the development environment where you can write an Algorithm and backtest it. We also consider the page where you can see the list of all your algorithms part of the IDE. Basically, anything behind the /algorithms URL.
3) Everything in the quantopian.research module is only usable in the research environment (Jupyter notebook environment on Q). There are a few methods within the quantopian.research module that allow you to query data (like get_pricing), but these are mostly useful for pulling in data for the sake of creating a plot or doing some form of quick analysis. Pipeline is a newer API (~4 years old now, but still newer!) and is now what we consider to be the core API of Quantopian. At a high level, Pipeline is an API that is designed to help you express cross-sectional computations (across assets) over large amounts of data. All data that we build into the platform is done in Pipeline, so my recommendation would be to focus on learning the Pipeline API ahead of the quantopian.research API. The best resources to do that are the new documentation (Pipeline User Guide & Pipeline API Reference) and the Pipeline Tutorial. Pipeline is available in Research and the IDE, meaning you can iterate on ideas with a Pipeline in a Jupyter notebook, and then load it into an Algorithm later. You can see all the data available in Pipeline in the Data Reference.
4) from quantopian.research import prices is just a different interface to the same data as get_pricing.
5) Research is actually the best place to iterate on pipelines, so it's here to stay! We actually suggest that people spend most of their time in research and only move to the IDE once they have an idea they'd like to run through a full backtest. I'd highly recommend checking out the Getting Started Tutorial to get a sense for the typical workflow (which begins in Research and later moves to the IDE).
6) The "magically" available functions are something that we used to do to make it easier for new folks to get started with less code. However, over the years, we have changed our opinion and have come to the realization that magically available functions are probably more confusing than they are helpful. Technically, these functions are being imported for you behind the scenes, but we've been trying to move toward more examples that explicitly import everything that's used (even if it's already imported for you), to try to make it more obvious where everything comes from.
7) Pipeline is the only way to access fundamental data. You're correct that Pipeline delivers data in a walk forward matter. The idea is that whenever you run a pipeline, it delivers data with the knowledge it had of the world on each simulation date. We believe that delivering data in a "point-in-time" fashion like this is crucial to avoiding lookahead bias when researching ideas. That said, based on your earlier questions it sounds like you may not have used Pipeline in research yet. When you run a pipeline in research, you can choose to run it for a date range and work with the entire output at once, which may be what you're looking for. Check out the resources I linked in earlier answers and let me know if working with Pipeline in Research gets you what you're looking for.
8) The Data Reference is your best bet for finding out these sorts of details on any data available on Q. Note that we update data as soon as we get it from our vendors. For many data fields, there's a pretty strict update schedule (e.g. daily pricing), but for others like Earnings, it can depend on when the company reports their earnings as well as when our vendor provides us the data.

Let me know if you find these answers helpful, these are great questions and they highlight the fact that we need to better organize our learning resources to make some of these answers easier to find. We're hoping that the new documentation helps a lot with the learning curve, so if you have any feedback on it (good or bad), we'd love to hear it.

Thanks!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Fly Juice

Thank you. Your responses are spot on. Please clarify one thing for me. When I get data from a pipeline in the "research" mode, I understand that I can specify a date range, start and end. But there should be one more date to specify "as of which date"? Basically when you are using it in "algo" mode, the system calculates everything based on as of simulation day for each callback event. When I run it "research" mode how do I specify the "as of date" in addition to the start and end date? Most of the time logically my as of date will be equal to my end date. But this doesn't have to be the case depending on my analysis.

Dan Whitnable

In the IDE (ie an algo) a pipeline returns a 'flat' dataframe indexed only by securities. The data is 'as of' the simulation day (as @Fly Juice observed).

In the research environment (ie a notebook) a pipeline returns a multi-indexed dataframe. It has an added date index. It returns the exact same data returned in the IDE but now there is one 'slice' for each requested day. It's as if all the outputs returned in the IDE were 'stacked' into a single dataframe. One can retrieve the data that an algo would have seen on a particular date by indexing the dataframe on that date.

So, the the answer to the question "When I run in "research" mode how do I specify the 'as of date' " is... you don't. The 'as of' date is as of each day in the pipeline dataframe.

The obvious use case for this approach is to be able to analyze how an algo will perform by analyzing the data available to an algo. Moving ideas into a working algo is primarily what the research environment is tailored towards. However, there are other use cases which this approach may not be entirely applicable. Pipeline data isn't adjusted as of a single point in time and therefore isn't really suitable for plotting price and volume data or calculating returns. For these use case it's best to use one of the variations of the get_pricing methods. These methods do adjust price and volume data 'as of' the method's end date. The end date is the 'as of date'.

All the above pertains to daily OHLCV data where the as of date is implied simply by the data date (ie the close for today is 'as of' today). However, there are other types of data where the 'as of' date may be some time earlier than today. This is true for data such as fundamentals, estimates, etc. Often included in these datasets is a separate field for asof_date. While the pipeline only exposes data which would have been available as of the pipeline date that doesn't mean it wasn't available before that. Typically look at this associated 'asof_date' to know the earliest date which data would have been available to impact trading.

There are several posts on this topic for a deeper dive. Maybe start here https://www.quantopian.com/posts/usequitypricing-and-get-pricing-data-difference or here https://www.quantopian.com/posts/research-updates-get-pricing-and-jupyter-notebook-upgrade.

Hope that helps.

Disclaimer

Fly Juice

Thank you @Dan Whitnable. Your response was very informative. However based on your description I see a big problem here.

I would like to run a regression analysis between fundamental factors and returns. Since the only way to access the fundamentals is the pipeline API, assuming that I have a 250 day analysis window, my fundamental information will have different adjustments on different days for the same equity stock. In order for analysis to be properly done all information should be adjusted as of end date. Otherwise cross-sectional factor analysis is impossible.

Dan Whitnable

Accounting for adjustments. Not a problem. There's different tools and different approaches depending upon what one wishes to do.

I'm going to jump immediately to the encouraged, and elegant, solution to factor analysis. There is a module called Alphalens. (see https://www.quantopian.com/tutorials/alphalens). This does a lot of the heavy lifting of factor analysis for you. It uses the output of a pipeline to get ones factors or signals. This is important to ensure all our factor data is what would actually have been available each day. It uses get_pricing to fetch adjusted prices. This is important to account for any future splits and/or dividends (which pipeline doesn't have visibility to). The best of both worlds.

Definitely look at the docs in the link above to understand and use Alphalens. However, I've attached a simple example to stoke some curiosity. There's more the Alphalens module can do, but this notebook shows how a few lines of code can produce powerful insights.

The notebook above was looking at the stocktwits.bull_minus_bear factor. Let's see if that single factor has any predictive relationship to future returns. In other words, does the stocktwits.bull_minus_bear signal provide any meaningful trading information. There are really just 4 simple steps to using Alphalens.

Step 1. Get our factor data from a pipeline. In this case the latest sentiment (sentiment = stocktwits.bull_minus_bear.latest). We have a sentiment value for each security and each day in our pipeline. We'll use that as the factor we want to analyze.

Step 2. Get some actual pricing data. Use get_pricing for split adjusted data to calculate returns from. Make sure to fetch pricing data for all assets and for at least as long as the pipeline output.

Step 3. Calculate the returns from this pricing data and merge it with the pipeline output data. There is a convenience method one can use in Alphalens called get_clean_factor_and_forward_returns.

Step 4 . Run any of the many Alphalens analysis tools. Use the pipeline output from get_clean_factor_and_forward_returns output as the input.

That's it!

One can do this manually, but the process is the same. Get pipeline factor output. Get pricing using get_pricing, calculate returns and align it with the pipeline data. Use the merged data for any analysis.

Hope that all makes sense. Take a look at the attached notebook.

Disclaimer

Fly Juice

@Dan Whitnable

stocktwits.bull_minus_bear is never adjusted as it doesn't depend on price or share count. This is not answer to my question. Try sales_per_share for example and take Apple 2014 period. Alphalens cannot fix that spike in sales_per_share, even if you get the price with get_pricing (which will probably not be available in "algo" mode once you switch there) you don't have any means to properly adjust sales_per_share or any fundamental data that requires adjustment after split, dividend, or similar event. Basically Apple will show up in different factor quantile right before and after the adjustment, where in reality nothing fundamental happened to Apple company. Alphalens cannot fix what it doesn't know.

Dan Whitnable

@Fly Juice Perhaps give a specific example of what you are trying to accomplish. There are a number of approaches to normalizing data.

Different factors can pose different challenges. Consider sales_per_share. Several challenges here. One could have two companies with the same sales, same profit, etc. Only difference is company A has 1M shares selling at $2/share while company B has 2M shares selling at $1. Company A sales_per_share will be twice that of company B. A similar issue arrises if there is a stock split. The issue is not with how this value is or isn't adjusted but with the choice of factors. Perhaps a better factor choice would be Fundamentals.sales_yield. That divides the sales_per_share by the share price. With this choice of factors company A and Company B will have the same values for Fundamentals.sales_yield as expected. The value will also remain unchanged across a stock split.

One can also accomplish the same thing manually by creating a new factor

            sales_per_share =  Fundamentals.sales_per_share.latest  
            close_price = USEquityPricing.close.latest  
            my_sales_yield = sales_per_share / close_price

Take a look at the attached notebook and the final plot. The value of my_sales_yield doesn't exhibit any spike around the AAPL stock split.

This is just one example of how to normalize factors so to present meaningful results. Alphalens, or other analysis, can then work with these normalized values.

Hope that helps.

Disclaimer

Fly Juice

@Dan Whitnable You have a good point regarding the factor selection. So the spike in Price to Book ratio is just a data quality issue then? If so how many such issues one should expect to encounter since 2014 point in time availability?

Thanks

Dan Whitnable

@Fly Juice Yes, the Price to Book ratio seems to have a bad datapoint. One will find data issues in all financial data. Quantopian data is no more or less error prone than typical financial data. Errors aren't frequent but they do occur. Can't give any an estimate of how often.

This not only occurs in historical data but also real live trading data. The same anomalies and data quality issues will show up with about the same frequency in real trade data. So, it's sort of a philosophical question. Should one train and analyze on scrubbed 'clean' data (knowing it's not what one would actually see in live trading) or on the actual data (knowing it's wrong)?

Good catch and good questions!

Disclaimer

Fly Juice

Thank you @Dan Whitnable. You have been very helpful.

Jamie McCorriston

Hi Fly,

Dan and I spent some time discussing your questions a bit further and I think there's one feature of Pipeline that's missing from our combined answers: if you define a computation that operates over a historical window of data, that entire window will be adjusted as of the current simulation date (a.k.a. the perspective date). So if you define a YoY change in sales factor (see attached notebook), the value from the previous year will include adjustments that happened between the date a year ago and the current perspective date. Any restatements that occurred during that time period will also be applied.

The reason you're seeing a cliff-looking timeseries in the plots based on close price and sales per share is because the factor in the pipeline is always computing the latest value. When you get the results of all the different simulations dates by running your pipeline over a year, the perspective date will shift. But on a single day, if you compute over a window longer than 1 day, the computation will be appropriately made over a fully adjusted window on each day. This concept of perspective dates and sliding lookback windows is unique to Pipeline (and not available in functions in the Research API), which is one of the reasons we suggest you use it as your primary tool!

Does this help?

Disclaimer

Fly Juice

@Jamie McCorriston I see what you are saying. Nevertheless we should have the option to look at the past with adjustments done as of today without preparing many custom factors. When prices are jumping like this it is very difficult to do things like cointegration analysis for example between stocks. Or make a model with an analysis window in the past and apply it to a hold-out set to validate it.

Basically we need two versions of "run_pipeline"
1) run_pipeline1 : current version where everything is adjusted with the reporting date
2) run_pipeline2 : new version where everything is adusted with the end date

And these should really be available in "algo" mode in order to build on the fly models. Anyhow we cannot use these to cheat in back testing because end date or reporting date can be restricted to be maximum simulation date.

Jamie McCorriston

@Fly: You're right, doing things like cointegration analysis or training a machine learning model on a training set once and then using the model in an algorithm is not well supported in Pipeline right now. I believe the best way to support that type of workflow is most likely to be able to perform the one-time analysis/computation (in research, for example), and then load the result of that step in an algorithm.

However, I do not necessarily agree that the one time analysis or computation should be conducted on data that is entirely adjusted as of the end date. Generally speaking, I would be concerned that adjusting everything in the "training set" as of the end of that period would introduce lookahead bias that isn't representative of the expected out-of-sample case. Instead, I would recommend using fields that are invariant across splits (we usually refer to this property as being "window safe"). For instance, using daily returns instead of daily price. The example that's easiest to remember for myself is the 1-for-7 AAPL split in June 2014. If I trained a model in July 2014 using daily close price, my whole model would be under the impression that AAPL's price was ~$100 instead of $700 (which was actually it's price for most of that time). As such, I don't think that the model would be trained appropriately.

What do you think? I'm interested to hear if you have a different opinion. We're always looking for more feedback and new ideas when we design platform improvements!

Disclaimer

Fly Juice

@Jamie McCorriston In my experience of doing this stuff for a long time in other platforms, backwards adjustments don't introduce any look ahead bias whatsoever (most adjustments involve just multiplying with a coefficient, and statistically this means nothing). In order for a rule set that acts on the price level (not returns) to be applicable into the future the model should be constructed with the adjustments applied as of today. This is a realistic case expecially for pairs/trading (cointegration stuff). And I still think it applies to many other cases. I am so sure about this because after all these adjustmens I did in other systems the edges I discovered are so miniscule that there cannot be any look ahead bias. A look ahead bias almost immediately results in crazy returns in simulations, or abnormally small drawdowns.

You've successfully submitted a support ticket.

Our support team will be in touch soon.