Soon: Upgrade to pandas 0.18

We are doing the final testing on upgrading the Quantopian platform to use pandas 0.18, NumPy 1.11.1, and SciPy 0.17.1. When that testing is complete we will make the changes on our production servers. We are planning on upgrading the backtest servers first, followed by the live trading servers the following day. The exact date is pending the completion of testing, but we're aiming for Monday/Tuesday.

The vast majority of algorithms are unaffected by this change.

pandas 0.18 and NumPy 1.11 contain many useful additions and improvements. These releases also contain a small number of breaking changes that may affect Quantopian algorithms. Some algorithms will no longer run. Community members with a live algorithm that will be affected (broker-backed, contest, or zipline) have already received an email from us.

We will update this thread as the upgrade is performed.

Here are some of the common breakages that we've seen, and how to work around them. They might be useful if you run across an older algorithm that no longer runs.

Timezone-aware Datetime Columns

The most important change for Quantopian users is the pandas 0.17 addition of timezone-aware datetime columns to pandas DataFrames. This can cause Timestamps read from DataFrames that were previously tz-naive to now be tz-aware, leading to errors in comparison operators like == or <. Very few Quantopian APIs are directly affected by this change, but algorithms that construct DataFrames containing datetimes columns may have changed behavior because of this update.

Changes in Broadcasting Behavior Between DataFrame and Series

NOTE: In this section, all the examples that refer to +/add also apply to other binary arithmetic operators (e.g. -/sub, */mul, etc.) as well.

In pandas, when you add a DataFrame and a Series, the Series is interpreted as a row and broadcast to every row in the DataFrame:

In [8]: df  
Out[8]:  
   a  b  
x  0  1  
y  1  2  
z  2  3

In [9]: ab_series  
Out[9]:  
a     10  
b    100  
dtype: int64

In [10]: df + ab_series  
Out[10]:  
    a    b  
x  10  101  
y  11  102  
z  12  103

This can lead to unexpected behavior when we want pandas to interpret a Series as a column instead of a row:

In [13]: xyz_series  
Out[13]:  
x      10  
y     100  
z    1000  
dtype: int64

In [14]: df + xyz_series  
Out[14]:  
    a   b   x   y   z  
x NaN NaN NaN NaN NaN  
y NaN NaN NaN NaN NaN  
z NaN NaN NaN NaN NaN

The right way to add a Series to a DataFrame column-wise is to use DataFrame.add with axis=0:

In [16]: df.add(xyz_series, axis=0)  
Out[16]:  
      a     b  
x    10    11  
y   101   102  
z  1002  1003

A particularly common case where you often want to add a Series to a DataFrame as a column is when working with timeseries data. This is so common that older versions of pandas used to special-case the behavior of adding a Series and a DataFrame when both objects had DatetimeIndexes:

Old Behavior:

In [4]: returns  
Out[4]:  
            AAPL  MSFT  
2014-01-01  0.10 -0.02  
2014-01-02  0.05  0.10  
2014-01-03 -0.01 -0.04

In [5]: benchmark  
Out[5]:  
2014-01-01    0.10  
2014-01-02    0.05  
2014-01-03    0.06  
Freq: D, dtype: float64

In [6]: returns + benchmark  
frame.py:3194: FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index  
  FutureWarning)  
Out[6]:  
            AAPL  MSFT  
2014-01-01  0.20  0.08  
2014-01-02  0.10  0.15  
2014-01-03  0.05  0.02

While often convenient, this special case made it harder for users to understand the rules for broadcasting and led to confusing behavior when an operation that worked with datetimes stopped working with differently-indexed data. For these reasons, the pandas team deprecated the datetime special case in pandas 0.8.0 and finally removed the behavior in pandas 0.17.0. Consequently, trying to add a datetime-indexed DataFrame to a like-indexed Series will no longer implicitly use column-wise addition:

New Behavior:

In [7]: returns + benchmark  
Out[7]:  
            2014-01-01 00:00:00  2014-01-02 00:00:00  2014-01-03 00:00:00  AAPL  MSFT  
2014-01-01                  NaN                  NaN                  NaN   NaN   NaN  
2014-01-02                  NaN                  NaN                  NaN   NaN   NaN  
2014-01-03                  NaN                  NaN                  NaN   NaN   NaN

Users whose algorithms do columnwise arithmetic between Series and DataFrame should update their code to use the corresponding explicit methods. See the pandas docs for full details.

Stricter Int/Float Type Checking

Several APIs in pandas and numpy used to warn and coerce floats to integers. Many of these APIs now raise errors when they receive floats.

Most notably, using a float key with DatafFrame.iloc will now raise an error:

Old Behavior:

In [14]: df  
Out[14]:  
   d  e  f  
a  1  0  0  
b  0  1  0  
c  0  0  1

In [15]: df.iloc[1.0]  
FutureWarning: scalar indexers for index type Index should be integers and not floating point  
  type(self).__name__),FutureWarning)  
Out[15]:  
d    0  
e    1  
f    0  
Name: b, dtype: float64

New Behavior:

In [15]: df.iloc[1.0]  
...
TypeError: cannot do positional indexing on <class 'pandas.indexes.base.Index'> with these indexers [1.0] of <type 'float'>

Rolling and Expanding Functions

In Pandas 18, the rolling_* and expanding_* families of functions (e.g. rolling_mean and expanding_mean) were changed to behave more like groupby.

Old Style:

In [13]: df  
Out[13]:  
          a         b  c  
0  1.788628  0.436510  5  
1 -1.863493 -0.277388  5  
2 -0.082741 -0.627001  5  
3 -0.477218 -1.313865  5  
4  0.881318  1.709573  5

In [14]: pd.rolling_mean(df, 3)  
Out[14]:  
          a         b   c  
0       NaN       NaN NaN  
1       NaN       NaN NaN  
2 -0.052535 -0.155960   5  
3 -0.807817 -0.739418   5  
4  0.107120 -0.077097   5

In [15]: pd.rolling_min(df, 3)  
Out[15]:  
          a         b   c  
0       NaN       NaN NaN  
1       NaN       NaN NaN  
2 -1.863493 -0.627001   5  
3 -1.863493 -1.313865   5  
4 -0.477218 -1.313865   5

New Style:

In [6]: df.rolling(3).mean()  
Out[6]:  
          a         b    c  
0       NaN       NaN  NaN  
1       NaN       NaN  NaN  
2 -0.052535 -0.155960  5.0  
3 -0.807817 -0.739418  5.0  
4  0.107120 -0.077097  5.0

In [7]: df.rolling(3).min()  
Out[7]:  
          a         b    c  
0       NaN       NaN  NaN  
1       NaN       NaN  NaN  
2 -1.863493 -0.627001  5.0  
3 -1.863493 -1.313865  5.0  
4 -0.477218 -1.313865  5.0

For more details on this change, see the pandas changelog

Bugs

At least one bug introduced in Pandas 0.17 is known to affect a small number of algorithms. DataFrame.tranpose is broken when the frame being transposed contains a column of the new tz-aware datetime dtype. This issue manifests as an AssertionError with a message of AssertionError: Number of Block dimensions (1) must equal number of axes (2).

Usage of the DataFrame.transpose() method is, in general, discouraged. transpose is very rarely necessary, and it can impose a significant performance penalty when applied to DataFrames containing multiple data types. In cases where transpose() is necessary, users may have to ensure that their frames do not contain tz-aware datetimes.