Maybe start with some basics.
A pipeline is an object who's primary function is to fetch data in an optimal way. You define all the data fields you want to get and it then returns that data with a call to the 'pipeline_output' method. You can define 'raw' data fields (eg 'volume', 'market cap', etc) or you can define calculated fields based upon raw data (eg 'demean', 'rank', 'simple moving average'). The pipeline objects job is to optimize the database queries and any associated calculations and return the results as a nice powerful pandas dataframe. The columns of the dataframe are the data fields you defined and the rows are each security.
Now, you can implement some simple boolean logic within a pipeline through the use of masks and a final screen. This will limit the securities (ie the rows in the returned dataframe) to a particular subset. However, just because you can doesn't mean you always should. This isn't required and isn't at times even desirable.
A dataset is, well, a set of data. One example is 'USEquityPricing'. Each dataset typically has multiple fields. In the case of USEquityPricing those fields are 'open', 'high', 'low', 'close', and 'volume' data. Quantopian provides a number of datasets (https://www.quantopian.com/data) which you can then choose any of their associated fields to add as columns in a pipeline definition or use as inputs to a calculated pipeline column.
These concepts are introduced in the tutorials https://www.quantopian.com/tutorials/pipeline.
@Martin
You stated "I'm looking to make a strategy from two datasets". That's no problem. Go ahead and pull data fields from as many datasets as you wish (though realistically there are memory and processor constraints) and put them into your pipeline definition.
" I am looking to create a single pipeline created from the two datasets basing on varying conditions, independent conditions for each dataset".
The approach I'd recommend is to separate your data from your logic. Create a pipeline object to return only the data. Create separate code to define your conditions. Don't do any filtering or logic in the pipeline definition. Any logic that can be done with the pipeline methods can just as easily be done using pandas methods with the returned dataframe. Start by defining the data you need. In this case 'last_close', 'gain', 'sector'
def pipe_definition(context):
universe = Filters.Q1500US()
last_close = USEquityPricing.close.latest
gain = Factors.Returns(inputs=[USEquityPricing.close], window_length=30, mask=universe)
sector = Sector(mask=universe)
return Pipeline(
columns = {
'last_close' : last_close,
'gain' : gain,
'sector' : sector,
},
)
Then in your 'before_trading_start' method (or elsewhere in the code) use the pandas dataframe methods to manipulate, calculate, and select stocks which are your 'conditions'. You can have as many of these conditions as you wish.
# Assume the pipeline output was assigned to 'context.output'
condition_a = 'sector == 206 and gain > 0 and last_close > 5.0'
context.a_stocks = (context.output.
query(condition_a).
nlargest(TARGET_STOCKS, 'gain').
index)
condition_b = 'sector == 100 and gain < 0'
context.b_stocks = (context.output.
query(condition_b).
nsmallest(TARGET_STOCKS, 'gain').
index)
I've attached an algorithm with something like this implemented.
Hope this gives you some ideas. The big takeaway... separate the data from the logic. This is is good programming practice in general but will certainly make for a lot clearer pipeline definition. Then get to know the pandas dataframe methods and use them. There are dataframe methods for just about any sorting, selecting, and calculating one would wish to do. They're your friends.