Understanding the Pipeline

@Aleksei Dremov. Two very good questions.

The first question is easy "Why can't I just do: attach_pipeline(make_pipeline())". The answer is you definitely can. The attach_pipeline method expects a reference to a pipeline object. It doesn't care if that reference is in the form of a variable (eg 'my_pipe') or the output of a function (eg 'make_pipeline()'). Both work the same. I sometimes explicitly set a variable, as in the former case, simply to highlight there are really two things happening. First, defining a pipeline, and second, attaching that pipeline to the algo. Just personal preference.

The second question requires some explaining "Why is it not: context.output = pipeline_output(my_pipeline())". Pipelines run asynchronous to the user code. They calculate chunks of results (typically six months worth) all at one time while the rest of the algo is executing one day at a time. Pipelines do this entirely for speed. It's much faster to fetch of a lot of data once rather than a little bit of data many times. Behind the scenes, zipline (the backtest engine) does something like this

Run the first 'chunk' of each pipeline associated with this algo all at once.

Save the output of each pipeline with the pipeline object.

Run the user algo code one day at a time.
The algo calls pipeline_output each day to fetch the current days pipeline output.
Note the pipeline output has already been calculated -the above fetch is just a lookup.

So, the first step is to tell zipline that a pipeline needs to be run for the algo. This is what the attach_pipeline method does. This method needs to be called in the initialize method before any other algo code is run. Zipline will then take care of running the pipeline and ensures all the data is ready when it's needed in the algo. After a pipeline is attached to an algo it is referenced by a user-defined name. The algo then calls pipeline_output to get the current days data. At that point, the pipeline_output method isn't actually running anything. It's simply looking up the dataframe which has already been calculated for that day.

For technical reasons, the pipeline_output method uses the pipeline name rather than the object reference. This is mostly due to the way Python handles scope for variables. By referencing just a user-defined name those scope issues are avoided. So, the simple answer is the way the pipeline_output is defined it expects a user-defined name rather than the actual pipeline object reference. This is why one cannot do the following

    # The pipeline_output method expects a pipe name and NOT an object reference  
    # This is why the following doesn't work  
    context.output = pipeline_output(my_pipeline())

Hope that helps. There is a little more description on how pipes work in the docs and also this post. If interested, the actual code for the various pipeline methods can be found on Github here.

Hope that answers your questions.