Notebook

The Bean Report

Approaching Soybean Futures on Quantopian

In [2]:
content_image
Out[2]:

Introduction

Understanding a futures contract, in my opinion, requires considering the inherent economic differences between purchasing a physical good at the prevailing market rate, and purchasing the same good, but with a promise from the seller to deliver said good at a future date. These economic differences were best described to me by a former bean trader at the Chicago Board of Trade. My humble attempt to replicate this explanation is below:

Imagine that you are a beef farmer, and you feed your herd of cows using soybean meal. Maintaining soybean meal on inventory is an absolute necessity, as failure to feed your cows would certainly result in the failure of your business. In addition, imagine there is only one soybean farmer in your area. You depend on this farmer to supply this critical input to your business. One day, you notice that your inventory of beans is running low. As a result, you call up the soybean farmer and ask if you can purchase some more beans. You and the farmer agree that 100 dollars per bushel is a fair price, and the farmer says that as long as you pick up the beans within 24 hours, he will honor your agreement on price. The deal works out great for both you and the farmer. The next day, you decide that you do not want to risk having your bean inventory run low again, a prudent decision because the farmer may sell out of beans before the next time you run low. You call up the farmer again, and request to buy more beans, but this time you would like him to deliver the beans to you, and you want this delivery to occur in 90 days with with cash paid on delivery. The farmer says, "sure thing, I can do that for 120 dollars per bushel…"

What happened? Is your good friend the bean farmer trying to take advantage of you with this higher price? You ask him, and he responds with the following reasons to justify his price increase:

He has to keep the beans you ordered stored in his silo for the 90 days before delivery. This will add 5 dollars to his price quote as it represents the money he will lose from his ability to rent out that space in his silo to other people. He has to deliver the beans. This will cost 10 dollars to him in gas and truck depreciation. A cost he will pass on to you. Your request for beans in 90 days restricts him from selling the same beans you have on hold the next day. He explains that he could sell those beans tomorrow and deposit the cash in a 90 day certificate of deposit at the bank, which would provide a total of 5 dollars in profits from interest income at the end of the 90 days.

It turns out that the farmer is being reasonable. Your request to reserve the beans and receive delivery in 90 days create generate more costs, or economic differences. The farmer intends to pass these costs on to you or it would not be logical to make the deal. His storage costs, delivery costs, and opportunity cost of capital all make up what can be described as the “Cost of Carry”, which along with the market price of the physical asset that underlies a futures contract, provides the basis for the valuation of the futures contract.

Section 1 - Pulling in Data

The problem...Wrangling in the data

In [4]:
content_image
Out[4]:

Pull in some equity pricing, including a soybean ETF

In [49]:
#### Using Quantopian's 'get_pricing()', we create a pandas data frame below
### The 'spot_prices' variable will be the variable storing the data frame.
## The parameters neccesary to return the frame are described below
# After creating the variable, check to ensure you received the data you requested.

spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], # A list of equity symbols to include
    fields='close_price',  # the 'field' we want to utilize
    start_date='2017-01-01', # Starting date of the data
    end_date = '2017-03-31', # Ending date of the data
    frequency='daily', # Frequency of the data
)    
spot_prices.head(3) # use the '.head()' method of pandas to view the first 3 rows of the data frame
Out[49]:
Equity(41912 [SOYB]) Equity(28320 [USO]) Equity(23921 [TLT]) Equity(33651 [BSV]) Equity(24962 [PSA]) Equity(8554 [SPY]) Equity(40516 [XIV]) Equity(3766 [IBM]) Equity(4707 [MCD]) Equity(4151 [JNJ]) Equity(8347 [XOM])
2017-01-03 00:00:00+00:00 18.93 11.450 119.153 79.248 221.489 224.195 50.24 165.873 118.741 115.034 90.056
2017-01-04 00:00:00+00:00 19.22 11.570 119.631 79.278 223.144 225.539 52.80 167.912 118.607 114.884 89.065
2017-01-05 00:00:00+00:00 19.24 11.695 121.483 79.397 225.581 225.380 52.99 167.366 118.805 116.107 87.737

Know your Pandas!

In [6]:
content_image
Out[6]:
In [50]:
print "Number of Rows:",len(spot_prices)
Number of Rows: 62
In [51]:
#Tom is my favorite Panda. He gets it...
spot_prices.tail(3) # Check to insure the last row matches with the end_date requested
Out[51]:
Equity(41912 [SOYB]) Equity(28320 [USO]) Equity(23921 [TLT]) Equity(33651 [BSV]) Equity(24962 [PSA]) Equity(8554 [SPY]) Equity(40516 [XIV]) Equity(3766 [IBM]) Equity(4707 [MCD]) Equity(4151 [JNJ]) Equity(8347 [XOM])
2017-03-29 00:00:00+00:00 18.45 10.38 121.33 79.690 219.95 235.59 74.68 173.96 128.86 124.97 82.03
2017-03-30 00:00:00+00:00 18.33 10.57 120.37 79.675 219.38 236.27 74.42 173.87 129.33 124.65 83.68
2017-03-31 00:00:00+00:00 18.12 10.64 120.72 79.730 218.90 235.72 73.03 174.17 129.61 124.51 82.00

Visualize pricing between all equities in 1Q, 2017

In [52]:
spot_delta = spot_prices.pct_change(periods=1, axis=0) # Get the percent change between closings
spot_delta = spot_delta.dropna() # drop the nan (there will be one at the end)
spot_delta.plot() # quick code for plotting a data frame  
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f59b8cdd090>

Pull in pricing of May and July bean futures for same time frame.

References To Determine Futures Data To Utilize:

Lifespan of Data Available - https://www.quantopian.com/posts/continuous-future-data-lifespans#590a60804de40b0a781ddbc0

Jamie's Notebook (w/ contract symbols and month symbols) - https://www.quantopian.com/posts/futures-data-now-available-in-research

CME's expiration calendar - http://www.cmegroup.com/tools-information/calendars/expiration-calendar/

In [53]:
from quantopian.research.experimental import history # Dependency needed to pull in futures data

# See References above to determine which futures you want to use
sy_contracts = symbols(['SYK17', 'SYN17'])

# Pandas frame for futures is below, same work as we did with equities. 
# (Excuse the lack of logic of the variable name, I was copying Jamie's code)
sy_consecutive_contract_price = history(
    sy_contracts, # Instead of defining the symbols within this argument, I just called the list dd 
    fields='price', 
    frequency='daily', 
    start_date='2017-01-01', 
    end_date='2017-03-31'
)
sy_consecutive_contract_price.head()
Out[53]:
Future(1016201705 [SYK17]) Future(1016201707 [SYN17])
2017-01-03 00:00:00+00:00 10.045 10.110
2017-01-04 00:00:00+00:00 10.220 10.282
2017-01-05 00:00:00+00:00 10.210 10.278
2017-01-06 00:00:00+00:00 10.048 10.120
2017-01-09 00:00:00+00:00 10.138 10.210

View returns between both contracts.

In [54]:
bean_delta = sy_consecutive_contract_price.pct_change(periods=1, axis=0)
bean_delta = bean_delta.dropna()
bean_delta.plot(title='Soy Futures')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f59b7c13a10>

Combine the equities and futures into one frame.

In [55]:
# Merging pandas can be a challenge, key with this method is to match up starting and ending dates of both frames
data = spot_prices.join(sy_consecutive_contract_price) 
data.head(3)
Out[55]:
Equity(41912 [SOYB]) Equity(28320 [USO]) Equity(23921 [TLT]) Equity(33651 [BSV]) Equity(24962 [PSA]) Equity(8554 [SPY]) Equity(40516 [XIV]) Equity(3766 [IBM]) Equity(4707 [MCD]) Equity(4151 [JNJ]) Equity(8347 [XOM]) Future(1016201705 [SYK17]) Future(1016201707 [SYN17])
2017-01-03 00:00:00+00:00 18.93 11.450 119.153 79.248 221.489 224.195 50.24 165.873 118.741 115.034 90.056 10.045 10.110
2017-01-04 00:00:00+00:00 19.22 11.570 119.631 79.278 223.144 225.539 52.80 167.912 118.607 114.884 89.065 10.220 10.282
2017-01-05 00:00:00+00:00 19.24 11.695 121.483 79.397 225.581 225.380 52.99 167.366 118.805 116.107 87.737 10.210 10.278
REMINDER!!! - Futures closing prices (in these frames) are later than equities
View columns names
In [56]:
print data.columns
Index([      Equity(41912 [SOYB]),        Equity(28320 [USO]),
              Equity(23921 [TLT]),        Equity(33651 [BSV]),
              Equity(24962 [PSA]),         Equity(8554 [SPY]),
              Equity(40516 [XIV]),         Equity(3766 [IBM]),
               Equity(4707 [MCD]),         Equity(4151 [JNJ]),
               Equity(8347 [XOM]), Future(1016201705 [SYK17]),
       Future(1016201707 [SYN17])],
      dtype='object')
Alter column names
In [57]:
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']
In [58]:
print data.columns
Index([u'beans_spot', u'oil_spot', u'lt_rate_spot', u'st_rate_spot',
       u'storage_spot', u'SPY', u'XIV', u'IBM', u'MCD', u'JNJ', u'XOM',
       u'beans_may', u'beans_july'],
      dtype='object')

Section 2 - Data Insights and Feature Engineering

Make columns that show days to delivery for both contracts

In [59]:
import datetime
data['May_Delivery'] = datetime.datetime(2017,5,12) # May expiration date per CME 
data['July_Delivery'] = datetime.datetime(2017,7,14) # July expiration date per CME
data['may_dtd'] = data.May_Delivery - data.index # Subtract to get days to delivery
data['july_dtd'] = data.July_Delivery - data.index # Subtract to get days to delivery
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000 # Convert to days as an integer
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000 # Convert to days as an integer
del data['May_Delivery'] # delete 
del data['July_Delivery'] # delete
data.head() # know your pandas
Out[59]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd
2017-01-03 00:00:00+00:00 18.930 11.450 119.153 79.248 221.489 224.195 50.24 165.873 118.741 115.034 90.056 10.045 10.110 129.0 192.0
2017-01-04 00:00:00+00:00 19.220 11.570 119.631 79.278 223.144 225.539 52.80 167.912 118.607 114.884 89.065 10.220 10.282 128.0 191.0
2017-01-05 00:00:00+00:00 19.240 11.695 121.483 79.397 225.581 225.380 52.99 167.366 118.805 116.107 87.737 10.210 10.278 127.0 190.0
2017-01-06 00:00:00+00:00 18.940 11.680 120.338 79.297 226.275 226.246 53.68 168.239 119.897 115.535 87.718 10.048 10.120 126.0 189.0
2017-01-09 00:00:00+00:00 19.115 11.320 121.334 79.387 224.907 225.519 53.82 166.404 119.570 115.615 86.291 10.138 10.210 123.0 186.0

Create The Cost Of Carry Feature

In [60]:
data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd']  
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd'] 

Chart Cost of Carry For Both Contracts in 1Q 2017

Earlier plots used the '.plot()' method on a a pandas frame. Below is a five step method for the most important aspects of building a chart with matplotlib

In [61]:
# Step 1 - import the library (always use plt as the variable because it is easiest to find examples on the internet)
import matplotlib.pyplot as plt
# Step #2 - Select your X and Y data
# For the X values, we are using  'data.index' because that is the dates in our pandas frame
# For the Y values, we are using 'data.
# Addintionally, we add a label within each plot we want to see so we can map the color to the data
plt.plot(data.index,data['dCoC_may'],label="Beans For May Delivery")
plt.plot(data.index,data['dCoC_july'],label="Beans For July Delivery")
# Step #3 - add the legend, this will display the labels we added as the third argument above
plt.legend(loc='upper left')
# Step #4 - add a title to the chart, an X label, and a Y label
plt.title("Comparing Implied Cost of Carry Between May and July Soybeans")
plt.xlabel("Date")
plt.ylabel("Average Daily Implied Cost of Carry")
# Step #5 - Tell the python gods to show the chart on the screen
plt.show()
We know May and July Futures traded evenly, so can we profit from the increasing spread between the cost of carry of May beans and the cost of carry of July beans? How about we sell May Bean futures on Jan 3, and buy Spot Beans on Jan 3?
In [62]:
### Delete ME ????????
print ("Price of Spot Beans on Jan 3:",data['beans_spot'].iloc[0])
print ("Price of May Beans on Jan 3:",data['beans_may'].iloc[0])
print ('Beans shares bought on Jan 3rd ($10K value):',int(10000/18.93))
print ('May beans contracts sold on Jan 3rd ($10K value):',int(10000/10.045))
print ("Price of Spot Beans on March 31:",data['beans_spot'].iloc[-1])
print ("Price of May Beans on March 31:",data['beans_may'].iloc[-1])
print ('Spot Bean Trade Profit:',(10000-18.12*528)*-1)
print ('Futures Trade Profit:',10000-(995*9.448))
print ("Trade Profit:",599.23999999-432.6399999)
('Price of Spot Beans on Jan 3:', 18.93)
('Price of May Beans on Jan 3:', 10.045)
('Beans shares bought on Jan 3rd ($10K value):', 528)
('May beans contracts sold on Jan 3rd ($10K value):', 995)
('Price of Spot Beans on March 31:', 18.120000000000001)
('Price of May Beans on March 31:', 9.4480000000000004)
('Spot Bean Trade Profit:', -432.6399999999994)
('Futures Trade Profit:', 599.2399999999998)
('Trade Profit:', 166.60000008999998)

How often does cost of carry diverge between the front dated month and the next delivery date as was shown between the may and july beans in 1Q of 2017? Chart cost of carry for 1Q 2014-2016

2014 Chart

In [63]:
spot_prices = get_pricing(
    ['SOYB'], 
    fields='close_price', 
    start_date='2014-01-01', 
    end_date = '2014-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK14', 'SYN14'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2014-01-01', 
    end_date='2014-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
data.columns = ['beans_spot','beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2014,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2014,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']




plt.plot(data.index,data['dCoC_may'],label="Beans For May Delivery")
plt.plot(data.index,data['dCoC_july'],label="Beans For July Delivery")
plt.legend(loc='upper left')
plt.title("Comparing Implied Cost of Carry Between May and July Soybeans")
plt.xlabel("Date")
plt.ylabel("Average Daily Implied Cost of Carry")
plt.show()

2015 Chart

In [64]:
spot_prices = get_pricing(
    ['SOYB'], 
    fields='close_price', 
    start_date='2015-01-01', 
    end_date = '2015-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK15', 'SYN15'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2015-01-01', 
    end_date='2015-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
data.columns = ['beans_spot','beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2015,5,12) #fix later
data['July_Delivery'] = datetime.datetime(2015,7,14) # fix later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']

plt.plot(data.index,data['dCoC_may'],label="Beans For May Delivery")
plt.plot(data.index,data['dCoC_july'],label="Beans For July Delivery")
plt.legend(loc='upper left')
plt.title("Comparing Implied Cost of Carry Between May and July Soybeans")
plt.xlabel("Date")
plt.ylabel("Average Daily Implied Cost of Carry")
plt.show()

2016 Chart

In [65]:
spot_prices = get_pricing(
    ['SOYB'], 
    fields='close_price', 
    start_date='2016-01-01', 
    end_date = '2016-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK16', 'SYN16'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2016-01-01', 
    end_date='2016-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
data.columns = ['beans_spot','beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2016,5,12) #fix later
data['July_Delivery'] = datetime.datetime(2016,7,14) # fix later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']

plt.plot(data.index,data['dCoC_may'],label="Beans For May Delivery")
plt.plot(data.index,data['dCoC_july'],label="Beans For July Delivery")
plt.legend(loc='upper left')
plt.title("Comparing Implied Cost of Carry Between May and July Soybeans")
plt.xlabel("Date")
plt.ylabel("Average Daily Implied Cost of Carry")
plt.show()

Chart cost of carry for the entire time frame

In [66]:
spot_prices = get_pricing(
    ['SOYB'], 
    fields='close_price', 
    start_date='2014-01-01', 
    end_date = '2017-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK17', 'SYN17','SYK16','SYN16',
                       'SYK15', 'SYN15','SYK14','SYN14'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2014-01-01', 
    end_date='2017-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)

data.columns = ['beans_spot','beans_may_17','beans_july_17','beans_may_16','beans_july_16',
               'beans_may_15','beans_july_15','beans_may_14','beans_july_14']

data['May_Delivery_17'] = datetime.datetime(2017,5,12)
data['July_Delivery_17'] = datetime.datetime(2017,7,14) 
data['may_dtd_17'] = data.May_Delivery_17 - data.index
data['july_dtd_17'] = data.July_Delivery_17 - data.index
data['may_dtd_17'] = data['may_dtd_17'].astype(int) / 86400000000000
data['july_dtd_17'] = data['july_dtd_17'].astype(int) / 86400000000000
del data['May_Delivery_17']
del data['July_Delivery_17']
data['dCoC_may_17'] = (data['beans_spot'] - data['beans_may_17']) / data['may_dtd_17'] 
data['dCoC_july_17'] = (data['beans_spot'] - data['beans_july_17']) / data['july_dtd_17']

data['May_Delivery_16'] = datetime.datetime(2016,5,12)
data['July_Delivery_16'] = datetime.datetime(2016,7,14) 
data['may_dtd_16'] = data.May_Delivery_16 - data.index
data['july_dtd_16'] = data.July_Delivery_16 - data.index
data['may_dtd_16'] = data['may_dtd_16'].astype(int) / 86400000000000
data['july_dtd_16'] = data['july_dtd_16'].astype(int) / 86400000000000
del data['May_Delivery_16']
del data['July_Delivery_16']
data['dCoC_may_16'] = (data['beans_spot'] - data['beans_may_16']) / data['may_dtd_16'] 
data['dCoC_july_16'] = (data['beans_spot'] - data['beans_july_16']) / data['july_dtd_16']

data['May_Delivery_15'] = datetime.datetime(2015,5,12)
data['July_Delivery_15'] = datetime.datetime(2015,7,14) 
data['may_dtd_15'] = data.May_Delivery_15 - data.index
data['july_dtd_15'] = data.July_Delivery_15 - data.index
data['may_dtd_15'] = data['may_dtd_15'].astype(int) / 86400000000000
data['july_dtd_15'] = data['july_dtd_15'].astype(int) / 86400000000000
del data['May_Delivery_15']
del data['July_Delivery_15']
data['dCoC_may_15'] = (data['beans_spot'] - data['beans_may_15']) / data['may_dtd_15'] 
data['dCoC_july_15'] = (data['beans_spot'] - data['beans_july_15']) / data['july_dtd_15']


data['May_Delivery_14'] = datetime.datetime(2014,5,12)
data['July_Delivery_14'] = datetime.datetime(2014,7,14) 
data['may_dtd_14'] = data.May_Delivery_14 - data.index
data['july_dtd_14'] = data.July_Delivery_14 - data.index
data['may_dtd_14'] = data['may_dtd_14'].astype(int) / 86400000000000
data['july_dtd_14'] = data['july_dtd_14'].astype(int) / 86400000000000
del data['May_Delivery_14']
del data['July_Delivery_14']
data['dCoC_may_14'] = (data['beans_spot'] - data['beans_may_14']) / data['may_dtd_14'] 
data['dCoC_july_14'] = (data['beans_spot'] - data['beans_july_14']) / data['july_dtd_14']


##### Charting Again.....

# Step 2
plt.plot(data.index,data['dCoC_may_17'],label="Beans For May 17 Delivery")
plt.plot(data.index,data['dCoC_july_17'],label="Beans For July 17 Delivery")
plt.plot(data.index,data['dCoC_may_16'],label="Beans For May 16 Delivery")
plt.plot(data.index,data['dCoC_july_16'],label="Beans For July 16 Delivery")
plt.plot(data.index,data['dCoC_may_15'],label="Beans For May 15 Delivery")
plt.plot(data.index,data['dCoC_july_15'],label="Beans For July 15 Delivery")
plt.plot(data.index,data['dCoC_may_14'],label="Beans For May 14 Delivery")
plt.plot(data.index,data['dCoC_july_14'],label="Beans For July 14 Delivery")
# Step 3
plt.legend(loc='upper left')
# Step 4
plt.title("Comparing Implied Cost of Carry Between May and July Soybeans")
plt.xlabel("Date")
plt.ylabel("Average Daily Implied Cost of Carry")
# Something new
plt.ylim(0,1) # Adjust y limits if the chart if not looking good on that scale
# Step 5 
plt.show()

Futures markets as they should be....

In [8]:
content_image
Out[8]:

Lets make the 2017 data our 'out of sample' data, and 2011-2016 our 'in sample' data. Models will be trained on 2011-2016 1Q data, and predicted on 2017 data.

Get actual delivery dates for all these contracts in future

Copy Paste Append, Repeat

In [11]:
content_image
Out[11]:
In [67]:
# Build a frame with in sample data 
########## 2011 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2011-01-01', 
    end_date = '2011-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK11', 'SYN11'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2011-01-01', 
    end_date='2011-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2011,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2011,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

all_data = data # Make a different frame

########## 2012 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2012-01-01', 
    end_date = '2012-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK12', 'SYN12'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2012-01-01', 
    end_date='2012-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2012,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2012,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

#all_data = data # Make a different frame
all_data = all_data.append(data)

########## 2013 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2013-01-01', 
    end_date = '2013-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK13', 'SYN13'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2013-01-01', 
    end_date='2013-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2013,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2013,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

#all_data = data # Make a different frame
all_data = all_data.append(data)

########## 2014 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2014-01-01', 
    end_date = '2014-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK14', 'SYN14'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2014-01-01', 
    end_date='2014-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2014,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2014,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

#all_data = data # Make a different frame
all_data = all_data.append(data)

########## 2015 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2015-01-01', 
    end_date = '2015-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK15', 'SYN15'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2015-01-01', 
    end_date='2015-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2015,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2015,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

####### Append 2015 to 2014 ########################
all_data = all_data.append(data)
#####################################################

########## 2016 ###########################################################################
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2016-01-01', 
    end_date = '2016-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK16', 'SYN16'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2016-01-01', 
    end_date='2016-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']

data['May_Delivery'] = datetime.datetime(2016,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2016,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']
#####################################################################################################

####### Append 2016 to 2014 and 2015 ########################
all_data = all_data.append(data)
#####################################################

# play heads and tails to confirm
all_data.head(3)
Out[67]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july
2011-01-03 00:00:00+00:00 NaN 39.06 92.803 80.144 102.760 126.500 123.80 146.848 75.979 62.237 74.119 13.882 13.918 129.0 192.0 NaN NaN
2011-01-04 00:00:00+00:00 NaN 38.07 92.893 80.124 100.298 126.380 121.35 147.088 73.727 62.742 74.517 13.785 13.810 128.0 191.0 NaN NaN
2011-01-05 00:00:00+00:00 NaN 38.50 90.886 79.945 101.231 127.077 125.99 146.450 74.074 62.742 74.328 13.978 13.998 127.0 190.0 NaN NaN
In [68]:
all_data.tail(3)
Out[68]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july
2016-03-29 00:00:00+00:00 18.128 9.79 131.06 80.490 274.45 205.14 25.43 149.37 123.99 109.150 84.51 9.170 9.232 44.0 107.0 0.203591 0.083140
2016-03-30 00:00:00+00:00 17.970 9.73 129.72 80.545 273.42 206.07 26.07 148.40 125.83 109.030 84.52 9.082 9.152 43.0 106.0 0.206698 0.083189
2016-03-31 00:00:00+00:00 18.021 9.70 130.53 80.614 275.70 205.56 25.88 151.48 125.86 108.175 83.61 9.098 9.168 42.0 105.0 0.212452 0.084314

Dealing with NaNs

In [13]:
content_image # Change font on 'NaN' 
Out[13]:
In [69]:
print "Some Nans in the ETF:",all_data.beans_spot.isnull().sum()
print ("Dropping First 70 Rows...")
all_data = all_data.dropna()
print "New Count of Nans in Data:",all_data.beans_spot.isnull().sum()
Some Nans in the ETF: 70
Dropping First 70 Rows...
New Count of Nans in Data: 0
In [22]:
all_data.head(2)
Out[22]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july
2012-01-03 00:00:00+00:00 22.29 39.69 118.884 80.620 134.002 127.033 6.86 185.538 98.121 65.244 85.472 12.37 12.460 130.0 193.0 0.076308 0.050933
2012-01-04 00:00:00+00:00 22.16 39.79 117.471 80.654 130.284 127.123 7.00 184.820 98.727 64.877 85.582 12.40 12.488 129.0 192.0 0.075659 0.050375
In [23]:
all_data.tail(2)
Out[23]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july
2016-03-30 00:00:00+00:00 17.970 9.73 129.72 80.545 273.42 206.07 26.07 148.40 125.83 109.030 84.52 9.082 9.152 43.0 106.0 0.206698 0.083189
2016-03-31 00:00:00+00:00 18.021 9.70 130.53 80.614 275.70 205.56 25.88 151.48 125.86 108.175 83.61 9.098 9.168 42.0 105.0 0.212452 0.084314

Lets make a target variable, spot price of beans the next day

In [21]:
content_image
Out[21]:
In [70]:
all_data['beans_next_day'] = all_data['beans_spot'].shift(-1) # Move all data one row down 
all_data = all_data.dropna() # delete last row
all_data.head(3) # Know your pandas
Out[70]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july beans_next_day
2012-01-03 00:00:00+00:00 22.29 39.69 118.884 80.620 134.002 127.033 6.86 185.538 98.121 65.244 85.472 12.370 12.460 130.0 193.0 0.076308 0.050933 22.160
2012-01-04 00:00:00+00:00 22.16 39.79 117.471 80.654 130.284 127.123 7.00 184.820 98.727 64.877 85.582 12.400 12.488 129.0 192.0 0.075659 0.050375 21.850
2012-01-05 00:00:00+00:00 21.85 39.18 117.262 80.610 131.811 127.501 7.15 183.934 99.134 64.817 85.263 12.183 12.272 128.0 191.0 0.075523 0.050147 21.502
In [71]:
all_data.tail(3)
Out[71]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ XOM beans_may beans_july may_dtd july_dtd dCoC_may dCoC_july beans_next_day
2016-03-28 00:00:00+00:00 18.020 10.01 129.69 80.270 269.70 203.25 24.03 148.39 123.19 108.23 84.22 9.093 9.162 45.0 108.0 0.198378 0.082019 18.128
2016-03-29 00:00:00+00:00 18.128 9.79 131.06 80.490 274.45 205.14 25.43 149.37 123.99 109.15 84.51 9.170 9.232 44.0 107.0 0.203591 0.083140 17.970
2016-03-30 00:00:00+00:00 17.970 9.73 129.72 80.545 273.42 206.07 26.07 148.40 125.83 109.03 84.52 9.082 9.152 43.0 106.0 0.206698 0.083189 18.021

Plot a heatmap

Reference Seong's Notebook - https://www.quantopian.com/posts/research-clonable-do-you-want-parameter-optimization-click-here-to-get-started-heat-maps-included

In [72]:
import numpy as np
import matplotlib.pyplot as pyplot
import pandas as pd
from collections import defaultdict

X_axis = [x for x in all_data.columns]
Y_axis = [x for x in all_data.columns]

Pearson = defaultdict(dict)

for feature_x in X_axis:
    for feature_y in Y_axis:    
        Pearson_R = np.corrcoef(all_data[feature_x],all_data[feature_y])[0,1]
        Pearson[feature_x][feature_y] = Pearson_R

Pearson = pd.DataFrame(Pearson)
Pearson.index.name = "Y Axis"
Pearson.columns.name = "X Axis"


def heat_map(df):
    fig = pyplot.figure()
    ax = fig.add_subplot(111)
    axim = ax.imshow(df.values,cmap = pyplot.get_cmap('RdYlGn'), interpolation = 'nearest')
    ax.set_xlabel(df.columns.name)
    ax.set_xticks(np.arange(len(df.columns)))
    ax.set_xticklabels(list(df.columns))
    ax.set_ylabel(df.index.name)
    ax.set_yticks(np.arange(len(df.index)))
    ax.set_yticklabels(list(df.index))
    ax.set_title("Pearson R Between All Data")
    plt.xticks(rotation=90)
    pyplot.colorbar(axim)
    
#: Plot our heatmap
heat_map(Pearson)

Section 3 - Develop Machine Learning Models and Test Predictions

Rise of the Machines

In [15]:
content_image
Out[15]:
In [73]:
print "Know your Pandas!"
print "Length of Data Frame:",len(all_data)
Know your Pandas!
Length of Data Frame: 296

Train a few ML models.

In [74]:
# Checking which models can be imported
from sklearn.ensemble import ExtraTreesRegressor # Extra Trees Model #1
#from sklearn.neural_network import MLPClassifier # Not this one yet :(
from sklearn import linear_model # Lasso model #2
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.naive_bayes import GaussianNB
from sklearn import tree # Decision Tree Model #3
from sklearn.ensemble import RandomForestClassifier
from sklearn.isotonic import IsotonicRegression
from sklearn.ensemble import GradientBoostingRegressor # Gradient Boosting Regressor #4
In [75]:
all_data = all_data.dropna()
cols_to_use = ['beans_spot','beans_may','beans_july','may_dtd','july_dtd','dCoC_may','dCoC_july',
              'oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
              'SPY','XIV','IBM','MCD','JNJ','XOM'] # These columns will be plugged into the model
In [76]:
print (len(all_data))
296
In [77]:
#1
#http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeRegressor.html
rfr = ExtraTreesRegressor(n_estimators=15, max_depth=4, n_jobs=-1, random_state=17, verbose=0)
model_1 = rfr.fit(all_data[cols_to_use], all_data['beans_next_day'].values)
all_data['trees_pred'] = model_1.predict(all_data[cols_to_use])

#2
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
reg = linear_model.Lasso(alpha = 0.1)
model_2 = reg.fit(all_data[cols_to_use], all_data['beans_next_day'].values)
all_data['lasso_pred'] = model_2.predict(all_data[cols_to_use])

#3 
# http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
clf = tree.DecisionTreeRegressor()
model_3 = clf.fit(all_data[cols_to_use], all_data['beans_next_day'].values)
all_data['d_tree_pred'] = model_3.predict(all_data[cols_to_use])

#4 
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
gbr = GradientBoostingRegressor(n_estimators=25, max_depth=1)
model_4 = gbr.fit(all_data[cols_to_use], all_data['beans_next_day'].values)
all_data['gbr_pred'] = model_4.predict(all_data[cols_to_use])

all_data.head()
Out[77]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ ... beans_july may_dtd july_dtd dCoC_may dCoC_july beans_next_day trees_pred lasso_pred d_tree_pred gbr_pred
2012-01-03 00:00:00+00:00 22.290 39.69 118.884 80.620 134.002 127.033 6.86 185.538 98.121 65.244 ... 12.460 130.0 193.0 0.076308 0.050933 22.160 21.958738 22.335042 22.160 22.094363
2012-01-04 00:00:00+00:00 22.160 39.79 117.471 80.654 130.284 127.123 7.00 184.820 98.727 64.877 ... 12.488 129.0 192.0 0.075659 0.050375 21.850 22.037390 22.231920 21.850 22.094363
2012-01-05 00:00:00+00:00 21.850 39.18 117.262 80.610 131.811 127.501 7.15 183.934 99.134 64.817 ... 12.272 128.0 191.0 0.075523 0.050147 21.502 21.678980 21.994559 21.502 21.750433
2012-01-06 00:00:00+00:00 21.502 39.23 118.167 80.709 131.216 127.232 7.26 181.812 99.888 64.242 ... 12.120 127.0 190.0 0.074567 0.049379 22.140 21.720583 21.670597 22.140 21.750433
2012-01-09 00:00:00+00:00 22.140 39.06 117.968 80.749 131.037 127.441 7.36 180.886 98.925 64.351 ... 12.498 124.0 187.0 0.078403 0.051561 22.072 22.007340 22.128152 22.072 22.094363

5 rows × 22 columns

In [78]:
stack_cols = ['beans_spot','beans_may','beans_july','may_dtd','july_dtd','dCoC_may','dCoC_july',
             'trees_pred','lasso_pred','d_tree_pred','gbr_pred',
             'oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
             'SPY','XIV','IBM','MCD','JNJ','XOM']

stacked = ExtraTreesRegressor(n_estimators=15, max_depth=4, n_jobs=-1, random_state=17, verbose=0)
stacked_model = stacked.fit(all_data[stack_cols], all_data['beans_next_day'].values)
all_data['stack_pred'] = stacked_model.predict(all_data[stack_cols])
all_data.head(3)
Out[78]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ ... may_dtd july_dtd dCoC_may dCoC_july beans_next_day trees_pred lasso_pred d_tree_pred gbr_pred stack_pred
2012-01-03 00:00:00+00:00 22.29 39.69 118.884 80.620 134.002 127.033 6.86 185.538 98.121 65.244 ... 130.0 193.0 0.076308 0.050933 22.160 21.958738 22.335042 22.160 22.094363 22.133667
2012-01-04 00:00:00+00:00 22.16 39.79 117.471 80.654 130.284 127.123 7.00 184.820 98.727 64.877 ... 129.0 192.0 0.075659 0.050375 21.850 22.037390 22.231920 21.850 22.094363 21.902897
2012-01-05 00:00:00+00:00 21.85 39.18 117.262 80.610 131.811 127.501 7.15 183.934 99.134 64.817 ... 128.0 191.0 0.075523 0.050147 21.502 21.678980 21.994559 21.502 21.750433 21.530734

3 rows × 23 columns

Q platform provides feature importance data for 2 of the 4 models utilized for the stacked model. Bar charts of these importance is below

In [79]:
#print stacked_model.feature_importances_ # Not working for extratrees
#print model_1.feature_importances_ # Not working for extratrees
#print model_2.feature_importances_ # Not an attribute for lasso
print model_3.feature_importances_ #.format('.65f') # Decision Tree Regressor
print model_4.feature_importances_ # Gradient Boosting Regressor
print ("Use these lists for bar charts")
[  1.07775304e-01   6.78169431e-02   8.03035820e-01   5.52975805e-04
   1.47436138e-03   1.50767518e-03   1.06282179e-02   4.41908566e-04
   4.79576457e-04   5.17862089e-04   1.03019158e-03   3.48141522e-04
   4.41415938e-04   1.91422789e-03   1.26456928e-03   5.93656641e-05
   7.11443129e-04]
[ 0.28  0.4   0.04  0.    0.    0.    0.    0.08  0.    0.    0.08  0.    0.
  0.    0.12  0.    0.  ]
Use these lists for bar charts
In [34]:
### Reference - https://pythonspot.com/en/matplotlib-bar-chart/

 
objects = cols_to_use ##### ALTER THIS WITH YOUR COLUMNS
y_pos = np.arange(len(objects))
performance = [  1.08052637e-01,   6.75205609e-02,   8.02818292e-01,   6.20421619e-04,
   1.41083053e-03,   1.47561752e-03,   1.06339815e-02,   4.64172859e-04,
   5.71266355e-04,   3.62098658e-04,   1.02916181e-03,   3.59705139e-04,
   4.59235080e-04,   1.81326427e-03,   1.39243322e-03,   3.48512822e-04,
   6.67808510e-04] ##### ALTER THIS WITH YOUR DATA
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)

#### ALTER DESCRIPTIONS ############
plt.ylabel('Numeric Importance To Model')
plt.title('Feature Importance - Decision Tree Regressor Model')
plt.xticks(rotation=90) 
plt.show()
In [35]:
import matplotlib.pyplot as plt # plt.rcdefaults()
import numpy as np
#import matplotlib.pyplot as plt
 
objects =cols_to_use
y_pos = np.arange(len(objects))
#performance = [0.48,  0.04,  0.08,  0,    0,    0,    0,    0.12,  0,    0,    0.08,  0,    0,
#  0.08,  0.08,  0,    0.04]

performance = [ 0.28,  0.4,   0.04,  0.,    0.,    0.,    0.,    0.08,  0.,    0.,    0.04,  0.,    0.,
  0.,    0.16,  0.,    0.,  ]
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Numeric Importance To Model')
plt.title('Feature Importance - Gradient Boosting Regressor Model')
plt.xticks(rotation=90) 
plt.show()

Chart model predictions on "in sample" data

In [80]:
plt.scatter(all_data.index,all_data['beans_spot'],color='red',label="Beans Spot Price")
plt.scatter(all_data.index,all_data['stack_pred'],color='blue',label="Predicted Spot Price For Beans")
plt.legend(loc='upper left')
plt.title("Comparing Stacked Model Predictions Against Actual Bean Prices")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()

Test the model on "out of sample" data

In [81]:
spot_prices = get_pricing(
    ['SOYB','USO','TLT','BSV','PSA',
    'SPY','XIV','IBM','MCD','JNJ','XOM'], 
    fields='close_price', 
    start_date='2017-01-01', 
    end_date = '2017-03-31', 
    frequency='daily',
)    
#spot_prices.head(3)

sy_contracts = symbols(['SYK17', 'SYN17'])

sy_consecutive_contract_price = history(
    sy_contracts, 
    fields='price', 
    frequency='daily', 
    start_date='2017-01-01', 
    end_date='2017-03-31'
)
#sy_consecutive_contract_price.head()

data = spot_prices.join(sy_consecutive_contract_price)
#print (data.columns)
#data.columns = ['beans_spot','beans_may','beans_july']
#data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot','beans_may','beans_july']
data.columns = ['beans_spot','oil_spot','lt_rate_spot','st_rate_spot','storage_spot',
                'SPY','XIV','IBM','MCD','JNJ','XOM',
                'beans_may','beans_july']


data['May_Delivery'] = datetime.datetime(2017,5,12) # fix with real date later
data['July_Delivery'] = datetime.datetime(2017,7,14) # fix with real date later
data['may_dtd'] = data.May_Delivery - data.index
data['july_dtd'] = data.July_Delivery - data.index
data['may_dtd'] = data['may_dtd'].astype(int) / 86400000000000
data['july_dtd'] = data['july_dtd'].astype(int) / 86400000000000
del data['May_Delivery']
del data['July_Delivery']

data['dCoC_may'] = (data['beans_spot'] - data['beans_may']) / data['may_dtd'] 
data['dCoC_july'] = (data['beans_spot'] - data['beans_july']) / data['july_dtd']

# Start Predictions
data['trees_pred'] = model_1.predict(data[cols_to_use])
data['lasso_pred'] = model_2.predict(data[cols_to_use])
data['d_tree_pred'] = model_3.predict(data[cols_to_use])
data['gbr_pred'] = model_4.predict(data[cols_to_use])

# stacked Prediction
data['stack_pred'] = stacked_model.predict(data[stack_cols])
In [82]:
test_score = data
test_score['beans_next_day'] = test_score['beans_spot'].shift(-1)
test_score = test_score.dropna()
print "R2 of Model:",stacked_model.score(test_score[stack_cols],test_score['beans_next_day'])

#Returns the coefficient of determination R^2 of the prediction.
#The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
R2 of Model: 0.484957903699
In [83]:
plt.scatter(data.index,data['beans_spot'],color='red',label="Beans Spot Price")
plt.scatter(data.index,data['stack_pred'],color='blue',label="Predicted Spot Price For Beans")
plt.legend(loc='upper left')
plt.title("Comparing Stacked Model Predictions Against Actual Bean Prices")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()

Get rate of change between predictions

In [84]:
data['signal_delta'] = data['stack_pred'] - data['stack_pred'].shift(1)
data = data.dropna()
data.tail(2)
Out[84]:
beans_spot oil_spot lt_rate_spot st_rate_spot storage_spot SPY XIV IBM MCD JNJ ... july_dtd dCoC_may dCoC_july trees_pred lasso_pred d_tree_pred gbr_pred stack_pred beans_next_day signal_delta
2017-03-29 00:00:00+00:00 18.45 10.38 121.33 79.690 219.95 235.59 74.68 173.96 128.86 124.97 ... 107.0 0.198977 0.080841 18.615580 18.581606 18.01 18.633327 17.832561 18.33 0.000000
2017-03-30 00:00:00+00:00 18.33 10.57 120.37 79.675 219.38 236.27 74.42 173.87 129.33 124.65 ... 106.0 0.202558 0.081179 18.290208 18.502070 18.01 18.297616 17.981882 18.12 0.149321

2 rows × 24 columns

Chart the deltas between predictions and the corresponding return

In [86]:
plt.plot(data.index,data['signal_delta'],color='red',label="Signal Delta")
#plt.scatter(data.index,data['stack_pred'],color='blue',label="Predicted Spot Price For Beans")
plt.legend(loc='upper left')
plt.title("Examining Large Deltas in Model Predictions")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()