We are doing the final testing on upgrading the Quantopian platform to use pandas 0.18, NumPy 1.11.1, and SciPy 0.17.1. When that testing is complete we will make the changes on our production servers. We are planning on upgrading the backtest servers first, followed by the live trading servers the following day. The exact date is pending the completion of testing, but we're aiming for Monday/Tuesday.
The vast majority of algorithms are unaffected by this change.
pandas 0.18 and NumPy 1.11 contain many useful additions and improvements. These releases also contain a small number of breaking changes that may affect Quantopian algorithms. Some algorithms will no longer run. Community members with a live algorithm that will be affected (broker-backed, contest, or zipline) have already received an email from us.
We will update this thread as the upgrade is performed.
Here are some of the common breakages that we've seen, and how to work around them. They might be useful if you run across an older algorithm that no longer runs.
Timezone-aware Datetime Columns
The most important change for Quantopian users is the pandas 0.17 addition of timezone-aware datetime columns to pandas DataFrames. This can cause Timestamps read from DataFrames that were previously tz-naive to now be tz-aware, leading to errors in comparison operators like == or <. Very few Quantopian APIs are directly affected by this change, but algorithms that construct DataFrames containing datetimes columns may have changed behavior because of this update.
Changes in Broadcasting Behavior Between DataFrame and Series
NOTE: In this section, all the examples that refer to +/add also apply to other binary arithmetic operators (e.g. -/sub, */mul, etc.) as well.
In pandas, when you add a DataFrame and a Series, the Series is interpreted as a row and broadcast to every row in the DataFrame:
In [8]: df
Out[8]:
a b
x 0 1
y 1 2
z 2 3
In [9]: ab_series
Out[9]:
a 10
b 100
dtype: int64
In [10]: df + ab_series
Out[10]:
a b
x 10 101
y 11 102
z 12 103
This can lead to unexpected behavior when we want pandas to interpret a Series as a column instead of a row:
In [13]: xyz_series
Out[13]:
x 10
y 100
z 1000
dtype: int64
In [14]: df + xyz_series
Out[14]:
a b x y z
x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
The right way to add a Series to a DataFrame column-wise is to use DataFrame.add with axis=0:
In [16]: df.add(xyz_series, axis=0)
Out[16]:
a b
x 10 11
y 101 102
z 1002 1003
A particularly common case where you often want to add a Series to a DataFrame as a column is when working with timeseries data. This is so common that older versions of pandas used to special-case the behavior of adding a Series and a DataFrame when both objects had DatetimeIndexes:
Old Behavior:
In [4]: returns
Out[4]:
AAPL MSFT
2014-01-01 0.10 -0.02
2014-01-02 0.05 0.10
2014-01-03 -0.01 -0.04
In [5]: benchmark
Out[5]:
2014-01-01 0.10
2014-01-02 0.05
2014-01-03 0.06
Freq: D, dtype: float64
In [6]: returns + benchmark
frame.py:3194: FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index
FutureWarning)
Out[6]:
AAPL MSFT
2014-01-01 0.20 0.08
2014-01-02 0.10 0.15
2014-01-03 0.05 0.02
While often convenient, this special case made it harder for users to understand the rules for broadcasting and led to confusing behavior when an operation that worked with datetimes stopped working with differently-indexed data. For these reasons, the pandas team deprecated the datetime special case in pandas 0.8.0 and finally removed the behavior in pandas 0.17.0. Consequently, trying to add a datetime-indexed DataFrame to a like-indexed Series will no longer implicitly use column-wise addition:
New Behavior:
In [7]: returns + benchmark
Out[7]:
2014-01-01 00:00:00 2014-01-02 00:00:00 2014-01-03 00:00:00 AAPL MSFT
2014-01-01 NaN NaN NaN NaN NaN
2014-01-02 NaN NaN NaN NaN NaN
2014-01-03 NaN NaN NaN NaN NaN
Users whose algorithms do columnwise arithmetic between Series and DataFrame should update their code to use the corresponding explicit methods. See the pandas docs for full details.
Stricter Int/Float Type Checking
Several APIs in pandas and numpy used to warn and coerce floats to integers. Many of these APIs now raise errors when they receive floats.
Most notably, using a float key with DatafFrame.iloc will now raise an error:
Old Behavior:
In [14]: df
Out[14]:
d e f
a 1 0 0
b 0 1 0
c 0 0 1
In [15]: df.iloc[1.0]
FutureWarning: scalar indexers for index type Index should be integers and not floating point
type(self).__name__),FutureWarning)
Out[15]:
d 0
e 1
f 0
Name: b, dtype: float64
New Behavior:
In [15]: df.iloc[1.0]
...
TypeError: cannot do positional indexing on <class 'pandas.indexes.base.Index'> with these indexers [1.0] of <type 'float'>
Rolling and Expanding Functions
In Pandas 18, the rolling_* and expanding_* families of functions (e.g. rolling_mean and expanding_mean) were changed to behave more like groupby.
Old Style:
In [13]: df
Out[13]:
a b c
0 1.788628 0.436510 5
1 -1.863493 -0.277388 5
2 -0.082741 -0.627001 5
3 -0.477218 -1.313865 5
4 0.881318 1.709573 5
In [14]: pd.rolling_mean(df, 3)
Out[14]:
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 -0.052535 -0.155960 5
3 -0.807817 -0.739418 5
4 0.107120 -0.077097 5
In [15]: pd.rolling_min(df, 3)
Out[15]:
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.863493 -0.627001 5
3 -1.863493 -1.313865 5
4 -0.477218 -1.313865 5
New Style:
In [6]: df.rolling(3).mean()
Out[6]:
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 -0.052535 -0.155960 5.0
3 -0.807817 -0.739418 5.0
4 0.107120 -0.077097 5.0
In [7]: df.rolling(3).min()
Out[7]:
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.863493 -0.627001 5.0
3 -1.863493 -1.313865 5.0
4 -0.477218 -1.313865 5.0
For more details on this change, see the pandas changelog
Bugs
At least one bug introduced in Pandas 0.17 is known to affect a small number of algorithms. DataFrame.tranpose is broken when the frame being transposed contains a column of the new tz-aware datetime dtype. This issue manifests as an AssertionError with a message of AssertionError: Number of Block dimensions (1) must equal number of axes (2).
Usage of the DataFrame.transpose() method is, in general, discouraged. transpose is very rarely necessary, and it can impose a significant performance penalty when applied to DataFrames containing multiple data types. In cases where transpose() is necessary, users may have to ensure that their frames do not contain tz-aware datetimes.