Let's say that I don't know which symbols I want to focus on and want to look all symbols to find symbols with strong correlation. How can I do that?
Let's say that I don't know which symbols I want to focus on and want to look all symbols to find symbols with strong correlation. How can I do that?
Take a look at the pipeline tutorial https://www.quantopian.com/tutorials/pipeline. The 'pipeline' approach supports exactly the case you are asking about. Namely, one starts with the entire universe of securities that Quantopian has data (roughly 9000), then select a smaller subset based upon some data or calculations (in your case correlation). The smaller subset is then traded.
You may want to look at https://www.quantopian.com/posts/new-correlation-and-linear-regression-factors . This has an algorithm which finds securities which have high and low correlation to the market (actually SPY). This post may also be helpful https://www.quantopian.com/posts/quantopian-lecture-series-updated-spearman-rank-correlation-notebook-1. If what you are looking for is a way to find securities which are correlated with each other (and not to a single security) you should find this post very insightful https://www.quantopian.com/posts/hierarchical-risk-parity-comparing-various-portfolio-diversification-techniques (the notebook is very good. CLONE IT).
Any kind of correlation determination will be computation intensive and you will probably need to narrow the initial universe of securities just so you don't run out of memory and/or timeout the algorithm. Of the 9000 or so securities which Quantopian tracks roughly half are funds of some sort (ETFs ETNs etc). Of the remaining half, roughly half of those are not 'well traded' (eg pink sheet stocks, preferred stocks, etc). So, if you are looking at stocks, a good 'universe' to begin a narrowed down search would be the Q1500US. This is roughly equivalent to the top 1500 US stocks by market cap. See https://www.quantopian.com/help#built-in-filters.
Note that one would typically decide up front if you want an algorithm to trade stocks or if you want to trade funds. If you are trading stocks then typically one would not use a fixed set of stocks, but rather use filters in the pipeline to dynamically select the subset of stocks you want to 'focus on'. Funds, on the other hand, are so varied and have very little data available about them in the Quantopian datasets that you would probably want to manually select a static subset to focus on.
In any case, play around with the pipeline functions in the research (notebook) environment to understand how it works and how to sort and filter securities by, in your case, a correlation factor.
Good luck.
Any kind of correlation determination will be computation intensive and you will probably need to narrow the initial universe of securities just so you don't run out of memory and/or timeout the algorithm.
In the research platform, you may be able to manage memory by looping over random pairs of securities, performing the analysis, and storing the result. With random sampling, you can get a feel for what the distributions look like. Otherwise, you are crunching on N*(N-1)/2 = 1500*(1499)/2 = 1,124,250 pairs in memory all at once.
@Dan, can you point me to the section that shows how to query the raw data for all equities? I've looked through the Pipeline tutorial multiple times but couldn't find one.
If you've looked at the tutorial on pipeline then maybe look over the documentation in the manual (https://www.quantopian.com/help#pipeline-title). Here is a post that also may help? https://www.quantopian.com/posts/screen-vs-filter
A couple of things. Use the pipeline approach to get any of the data from any of the datasets that Quantopian offers. The list of datasets is here https://www.quantopian.com/data. The two most common datasets are https://www.quantopian.com/data/quantopian/us_equity_pricing and https://www.quantopian.com/data/morningstar/fundamentals. Find a dataset of interest. Each dataset has individual fields of data that you can access. Sometimes there are a lot of fields (eg in the Morningstar fundamentals) so the fields are grouped into categories. The fields (technically they are 'BoundColumn' objects) are accessed like this:
# First import any of the datasets you will use
# There is usually info on the specific dataset page on how to import each dataset
from quantopian.pipeline.data.builtin import USEquityPricing
import quantopian.pipeline.data.morningstar as morningstar
# Second, create instances of each of the specific fields you will use
# The '.latest' method is an easy way to get the 'raw' data
# The individual fields are referenced as dataset_name.field_name
# For example to get the close price and volume
close = USEquityPricing.close.latest
volume = USEquityPricing.volume.latest
# Some datasets are big and need an additional category to reference them (like fundamentals from morningstar)
# So use dataset_name.category_name.field_name like below
eps = morningstar.earnings_report.diluted_eps.latest
# close, volume,and eps are called 'factors' and represent the columns of 'raw' data you want
# Now define a pipeline object to return those data.
my_pipe = Pipeline(
columns = {
'close' : close,
'volume' : volume,
'eps' : eps,
},
)
Now execute the pipeline (typically in the 'before_trading_start' method) by using the 'pipeline_output' method. The output will be a dataframe with all the data you requested. By default there will be data for ALL securities. Each row is indexed by a security (so approx 9000 rows). In this case there will be three columns (close, volume, eps). You can now access the data using any of the pandas dataframe methods and slicing functionality.
# Run the pipeline to retrieve the data
context.output = pipeline_output('my_pipe')
# Use the retrieved data in the dataframe as you wish
eps_over_20_stocks = context.output.query('eps >20')
# To get a list of those stocks, or more specifically 'equity objects', reference the index
my_equity_list = eps_over_20_stocks.index
Hope that helps?
I'd highly recommend starting off using the research or notebook environment to see how all this works. I find it also very helpful to develop in this environment before porting the code over to an algorithm. It can be very insightful visually interacting with results in real-time.
Take a look at the attached notebook.
I'm taking this task offline and using historical data from yahoo. It'll add some noise, as it's not adjusted, but it does let me take price change for a day on a security and see how accurate a pass with k-means clustering is (only allow one centroid, then find accuracy with a max acceptable distance parameter from the centroid and a % capture of all data points parameter) when you plot that against a 5 day and 10 day forward look based on penny movement and % movement separately (yes, I'm letting it see the future to see if there's significant alpha). That would let me not bog down on Q servers or timeout and get some sets of probable trading pairs. k-means would be too expensive to run in an algo though. So, it's just filtering it down to pairs/baskets in research.
High positive correlation you'ld want the one in the set that has the best % move. High negative correlation, you'ld be finding the other side of your trading pair/hedge. For lower more consistent returns, I think you'ld play all of both sides weighted to avg % move or some other factor.
Just a theory. Not tested yet.