Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Google Search Terms predict market movements

Stock prices reflect the trading decisions of many individuals. For the most part, quantitative finance has developed sophisticated methods that try to predict future trading decisions (and the price) based on past trading decisions. However, what about that the information gathering phase that precedes a trading decision? Two recent papers in Nature’s Scientific Reports suggest that Google searches and Wikipedia usage patterns contain signal about this information gathering phase that can be exploited in a trading algorithm. As is (unfortunately) very common, there is no published code in the paper that we can use to easily replicate the results. The algorithms are very simple, though, so I coded both of them on Quantopian. They indeed seem to perform quite favourably and thus roughly replicate the results of the paper, as you can see below. The original simulations have not included modeling transaction costs or slippage which we include here. In that regard, we can show that these strategies still seem to work under more realistic settings.

This algorithm looks at the Google Trends data for the word ‘debt’. According to the paper, that word has the most predictive power.

This data is not as easy to automate within Quantopian, but it’s relatively easy to do so manually. I downloaded the csv file and edited it to get it into the right format. I uploaded the resulting file here.

If you want to use my data on ‘debt’ feel free to do so. If you want to use a the Google Trend for a different word, you can download the CSV, edit it to look like mine, and place it in a public Dropbox or some other webserver.

If there is enough interest we can make this data more accessible (if you want to help me with this, an automated Python script that parses the csv returned by Google Trends to the format I posted would be well appreciated).

http://www.nature.com/srep/2013/130425/srep01684/images/srep01684-f3.jpg

For this algorithm, once the weekly average is smaller than the moving average of the delta_t (in this case delta_t == 5 weeks), we buy and hold the S&P500 for one week. If the weekly average is larger than the moving average then we sell and re-buy the S&P500 after one week. The original paper uses the Dow Jones Industrial Average, the S&P500 is highly correlated however.

Suggestions for improvement (please post improvements as replies to this thread):

  • The authors used many different search queries, listed here . If you upload different queries in the same csv format as I did we can explore those as well.
  • delta_t == 3 is what the authors of the paper used. It would be interesting to see how the algorithm performs when this is changed.
  • The underlying algorithm is a very basic moving average cross-over. Certainly a more clever strategy might be able to do a much better job.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

111 responses

That's a very interesting result Thomas.

I have attached a tweaked version of your backtest. This one starts with $100k and risks 75% of available margin every week. E.g. $150k the first week and then more or less depending on profit & loss.

If we look at the results yearly it becomes clear that this system performs fairly average except for being very strong in bear markets.

2004: algo 23% SPY 14%
2005: algo 10% SPY 7%
2006: algo 3% SPY 8%
2007: algo -15% SPY 5%
2008: algo 13% SPY -40%
2009: algo 137% SPY 26%
2010: algo 28% SPY 15%
2011: algo 41% SPY 6%
2012: algo 9% SPY 14%

Edit: more years, and starting in December the previous year (to calculate 5 week indicator)

I just looked at your CSV file. I think you have a slight time travel problem.

The date stamps in your CSV file are for the week ahead. Meaning you are getting the Google Trends search results prior to them being collected by Google.

You should instead use the date at the end of the week.

I found a Python script for accessing Google Trends.

https://github.com/suryasev/unofficial-google-trends-api

I wonder if we could plug this directly into our algos.

Hi Dennis,

Thanks for posting the margin version of the algorithm. It's also an interesting observation that algorithm seems to capitalize on bear markets. You're also correct about the time-shift. I'll upload a new csv file with the fixed dts.

The pyGTrends.py is certainly useful. However, I think it only fetches the csv but doesn't reformat it which we'll require.

Here is a SED script that cleans up the Google Trends CSV.

# skip to Week report  
1,/^Week,/ {

  # process header for Week report  
  /^Week,/ {

    # rename Week column to Date  
    s/^Week,/Date,/

    # print header row  
    P  
  }  
  # delete skipped lines  
  D  
}

# remove start date (leaving end date in first column)  
s/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] - //

# remove row that does not have numeric value in second column  
s/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9],[^0-9]*$//

# remove all lines afer Week report (match first blank line)  
/^$/,$ d

Thanks Thomas,

I haven't dug into the papers, but might there be a bias here? For example, your backtest is using Google Trends data on the string "debt" over the 2008/2009 debt crisis, an extreme event. Is there any indication that this would be a viable strategy going forward from today, with a different search string? How would we have any idea what string would be predictive?

I've wondered if there were more conventional "big red flags" that all was not well in the financial world. One reference is http://www.efficientfrontier.com/ef/0adhoc/darkside.htm where William Bernstein comments (on a TIPS yield curve dated 3/8/2008):

"Conclusion: The debt markets are so out of whack that we are now at a point where credit risk is being rewarded more than equity risk, something that should never happen in a world where equity investors own only the residual rights to earnings. This cannot last for very long: either spreads will tighten rapidly, equity prices will fall rapidly, or both. (Or, chortle, earnings will grow more rapidly.) Stay tuned."

Grant

Thomas,

Here's an interesting reference:

http://www.newyorkfed.org/research/capital_markets/Prob_Rec.pdf

Note the increasing predicted probability of a recession, starting in 2005-2006.

If you are interested in macro-trends, you might also have a look at the charts and data available here:

http://www.aheadofthecurve-thebook.com/

I read the book awhile back, and my take-away was that when consumers have money in their pockets, they spend it, driving corporate earnings and stock prices. My read is that presently, the U.S. government is putting money into pockets by financing artificially low interest rates (but I ain't no economist, so I could be way off).

Grant

Here's a python script that looks for Google Trends CSV files in an 'input' folder and writes modified CSV files to an 'output' folder. Both folders should already exist.

import glob

# process all CSV files in 'input' folder, write modified CSV to 'output' folder  
for filename in glob.glob("input/*.csv"):  
    # flag to skip input lines  
    skip = 1  
    # open input file  
    with open(filename) as f:  
       # open output file for writing  
       with open(filename.replace("input","output"), "w") as o:  
           # get all lines  
           lines = f.readlines()  
           # process each line  
           for line in lines:  
              # check for start of Week report  
              if skip and line.startswith("Week,"):  
                  # stop skipping lines  
                  skip = 0  
                  # output modified CSV header  
                  o.write( line.replace("Week,", "date,") )  
              elif not skip:  
                  # match blank lines  
                  if line == '':  
                     # skip remainder of input file  
                     skip = 1  
                     break  
                  else:  
                     # remove start date (leaving end date in first column)  
                     data = line.split(" ") # split on space  
                     if len(data) == 3:  
                         fields = data[2].split(",") # split on comma  
                         # count number of fields  
                         if len(fields) == 2 and fields[1] <> '':  
                            # numeric value is present, write row to output  
                            o.write( data[2] )  
                         else:  
                            # numeric value missing after comma, report done  
                            skip = 1  
                            break  
                     else:  
                         # input line doesn't conform  
                         skip = 1  
                         break

So I just re-ran the 'debt' keyword backtest using an up-to-date CSV file (using the end-of-week dates to avoid time travel).

Sadly it doesn't have the same punch as before.

Dennis: Thanks for doing that. The scripts will also be very helpful in automating this.

I also ran it with the correct timing and got similar (disappointing) results. What is curious is that the above (wrong) result seems to be pretty similar to the results of the paper. Not sure what we did wrong, the paper states:
"We use Google Trends to determine how many searches n(t – 1) have been carried out for a specific search term such as debt in week t – 1, where Google defines weeks as ending on a Sunday, relative to the total number of searches carried out on Google during that time." So it's pretty clear that my above code was not what the authors described.

I had almost the exact same thought process. It's always disappointing to debunk a promising strategy. Hopefully we'll figure out a way to use the effort anyway.

I created a script to automate the downloading and cleaning up of the Google Trends CSV data here: https://gist.github.com/tlmaloney/5650699. Let me know if you run into any issues. I seem to have hit my Google Trends quota limit for number of queries from my IP address.

Thomas, code looks great -- thanks!

Do you know what the quota limit is? Certainly that'd make it very hard for us to integrate this.

The paper by Preis, Moat, and Stanley has numerous flaws in it, and it seems unlikely that it can be replicated by anyone. Chief among them is the datamining bias of selecting the best of 100 search terms based on in-sample performance and expecting that same performance to be an unbiased estimate of future performance. (The authors evidently also do not understand how shorting works, or statistical hypothesis testing.) Even if you can reconcile the biases in their backtests, and somehow get the same Google Trends data that they do, it would be difficult to see through this selection bias.

Does anyone here success in replicating Figure 2? I think that they got 326% by summing returns every week. I tried many ways but failed to get the same result. Only with geometric return and delta=4 or 5, I can get 3.23 return. But in this case, profit should start with 100 not 0.

I found another interesting discussion for the paper in http://sellthenews.tumblr.com/

@Sangno: Are you trying a 1-to-1 replication? I'm also working on an IPython Notebook to do just that but I'm not confident enough yet. Maybe I should post it so that we can join efforts?

I am so happy that you also was trying to replicate the paper. I used SAS package to test their paper, but for more clear understanding each step, I used Excel too. Here is my approach.

  1. Sample data
    We need two sample data; DJIT and Google trends for 'debt' term. I collected DJIT data from http://www.djaverages.com/?go=industrial-index-data&report=performance with index price history from Jan 5, 2004 to Feb 22, 2011. I drew a graph of DJIT and get the same result with Figure 1. I think that DJIT data doesn't have any problem issue.

I collected the trend of 'debt' from Google Trends, http://www.google.com/trends/explore?q=debt#q=debt&geo=US&date=1%2F2004%2086m&cmpt=q. I restricted my search range with 'debt' term, U.S. geography, and the period from Jan 2004 to Feb 2011. I downloaded this result with 'Download as csv' in the same site. Here are the first six rows.

Week debt
2004-01-04 - 2004-01-10 63
2004-01-11 - 2004-01-17 66
2004-01-18 - 2004-01-24 63
2004-01-25 - 2004-01-31 66
2004-02-01 - 2004-02-07 61
2004-02-08 - 2004-02-14 62

  1. matching DJIT data and debt data
    I merged debt data with DJIT by including the nearest date of debt to the date of DJIT. For instance, the end date of the first week is Jan 10, 2004, and this debt data is matched with the first occurrence of djit after Jan 10. Because of holidays and some other reasons, some weeks doesn't have transaction on Monday. So, we need to find the first transaction day after the end day of a week of the term data. Here is the first several lines.

Week debt sdate edate ddate djit
2004-01-04 - 2004-01-10 63 1/4/2004 1/10/2004 1/12/2004 10485.17702
2004-01-11 - 2004-01-17 66 1/11/2004 1/17/2004 1/20/2004 10528.6635
2004-01-18 - 2004-01-24 63 1/18/2004 1/24/2004 1/26/2004 10702.51163
2004-01-25 - 2004-01-31 66 1/25/2004 1/31/2004 2/2/2004 10499.18265
2004-02-01 - 2004-02-07 61 2/1/2004 2/7/2004 2/9/2004 10579.03279
2004-02-08 - 2004-02-14 62 2/8/2004 2/14/2004 2/17/2004 10714.88173
2004-02-15 - 2004-02-21 61 2/15/2004 2/21/2004 2/23/2004 10609.62473
2004-02-22 - 2004-02-28 62 2/22/2004 2/28/2004 3/1/2004 10678.14178

  1. Moving average
    When Google Trends report trends data, it seems not to confirm exactly on Sunday. Suppose that you collect trend data on Jan 11, 2004 for the week 2004-01-04 to 2004-01-10. On Sunday (Jan 11), you may see "Partial Data" in Google trend because Google trends didn't finalize trends data for the previous week. Note: Google uses a week from Sunday to Saturday. In page 2 of the paper, the authors states that "where Google defines weeks as ending on a Sunday.." I think this is not correct.

Because of the incomplete data on Sunday, the authors think it is necessary to make a moving average with the transaction of the last 3 days. Thus, we need to make an average with three days data. The first row could be just one, and the second row could be the average of the first and second. Here is the first few rows. debt3 indicates the average of three days.

Week debt sdate edate ddate djit debt3
2004-01-04 - 2004-01-10 63 1/4/2004 1/10/2004 1/12/2004 10485.17702 63
2004-01-11 - 2004-01-17 66 1/11/2004 1/17/2004 1/20/2004 10528.6635 64.5
2004-01-18 - 2004-01-24 63 1/18/2004 1/24/2004 1/26/2004 10702.51163 64
2004-01-25 - 2004-01-31 66 1/25/2004 1/31/2004 2/2/2004 10499.18265 65
2004-02-01 - 2004-02-07 61 2/1/2004 2/7/2004 2/9/2004 10579.03279 63.33333333
2004-02-08 - 2004-02-14 62 2/8/2004 2/14/2004 2/17/2004 10714.88173 63
2004-02-15 - 2004-02-21 61 2/15/2004 2/21/2004 2/23/2004 10609.62473 61.33333333
2004-02-22 - 2004-02-28 62 2/22/2004 2/28/2004 3/1/2004 10678.14178 61.66666667

  1. delta (page 2)
    delta = n(t) - debt3(t-1)

Here are the first few rows.

Week debt sdate edate ddate djit debt3 delta
2004-01-04 - 2004-01-10 63 1/4/2004 1/10/2004 1/12/2004 10485.17702 63 0
2004-01-11 - 2004-01-17 66 1/11/2004 1/17/2004 1/20/2004 10528.6635 64.5 3
2004-01-18 - 2004-01-24 63 1/18/2004 1/24/2004 1/26/2004 10702.51163 64 -1.5
2004-01-25 - 2004-01-31 66 1/25/2004 1/31/2004 2/2/2004 10499.18265 65 2
2004-02-01 - 2004-02-07 61 2/1/2004 2/7/2004 2/9/2004 10579.03279 63.33333333 -4
2004-02-08 - 2004-02-14 62 2/8/2004 2/14/2004 2/17/2004 10714.88173 63 -1.333333333
2004-02-15 - 2004-02-21 61 2/15/2004 2/21/2004 2/23/2004 10609.62473 61.33333333 -2
2004-02-22 - 2004-02-28 62 2/22/2004 2/28/2004 3/1/2004 10678.14178 61.66666667 0.666666667
2004-02-29 - 2004-03-06 61 2/29/2004 3/6/2004 3/8/2004 10529.4783 61.33333333 -0.666666667
2004-03-07 - 2004-03-13 61 3/7/2004 3/13/2004 3/15/2004 10102.89483 61.33333333 -0.333333333

  1. trading and return
    if delta(t-1) > 0 then take short position.
    if delta(t-1> < 0 then take long position.
    if delte(t-1) = 0 then no action.
    I think the authors take no action for the tie break. When they explained transaction fees, they pointed out the maximum number of transaction is only 104. If they take short or long for the tie break, they should described the number of transactions is 104, rather than the maximum number.

For the short position, the return is log(p_t-1) - log(p_t), and for the long position, the return is log(p_t) - log(p_t-1).

Here are the first few rows.

Week debt sdate edate ddate djit debt3 delta ret
2004-01-04 - 2004-01-10 63 1/4/2004 1/10/2004 1/12/2004 10485.17702 63 0 0
2004-01-11 - 2004-01-17 66 1/11/2004 1/17/2004 1/20/2004 10528.6635 64.5 3 0
2004-01-18 - 2004-01-24 63 1/18/2004 1/24/2004 1/26/2004 10702.51163 64 -1.5 -0.016377051
2004-01-25 - 2004-01-31 66 1/25/2004 1/31/2004 2/2/2004 10499.18265 65 2 -0.019181035
2004-02-01 - 2004-02-07 61 2/1/2004 2/7/2004 2/9/2004 10579.03279 63.33333333 -4 -0.007576593
2004-02-08 - 2004-02-14 62 2/8/2004 2/14/2004 2/17/2004 10714.88173 63 -1.333333333 0.012759588
2004-02-15 - 2004-02-21 61 2/15/2004 2/21/2004 2/23/2004 10609.62473 61.33333333 -2 -0.009872009
2004-02-22 - 2004-02-28 62 2/22/2004 2/28/2004 3/1/2004 10678.14178 61.66666667 0.666666667 0.006437245
2004-02-29 - 2004-03-06 61 2/29/2004 3/6/2004 3/8/2004 10529.4783 61.33333333 -0.666666667 0.014020047
2004-03-07 - 2004-03-13 61 3/7/2004 3/13/2004 3/15/2004 10102.89483 61.33333333 -0.333333333 -0.04135678

  1. Accumulating return
    We have two ways to accumulating returns: summing from the beginning to the ending, and geometric returns (1+r)(1+r)....
    In my case, I have the geometric returns 2.027 and the sum of the returns 0.824.

Here are the first and last few rows.
Week debt sdate edate ddate djit debt3 delta ret sret cret
2004-01-04 - 2004-01-10 63 1/4/2004 1/10/2004 1/12/2004 10485.17702 63 0 0 0 1
2004-01-11 - 2004-01-17 66 1/11/2004 1/17/2004 1/20/2004 10528.6635 64.5 3 0 0 1
2004-01-18 - 2004-01-24 63 1/18/2004 1/24/2004 1/26/2004 10702.51163 64 -1.5 -0.016377051 -0.016377051 0.983622949
2004-01-25 - 2004-01-31 66 1/25/2004 1/31/2004 2/2/2004 10499.18265 65 2 -0.019181035 -0.035558085 0.964756044
2004-02-01 - 2004-02-07 61 2/1/2004 2/7/2004 2/9/2004 10579.03279 63.33333333 -4 -0.007576593 -0.043134678 0.95744648
2004-02-08 - 2004-02-14 62 2/8/2004 2/14/2004 2/17/2004 10714.88173 63 -1.333333333 0.012759588 -0.03037509 0.969663103
2004-02-15 - 2004-02-21 61 2/15/2004 2/21/2004 2/23/2004 10609.62473 61.33333333 -2 -0.009872009 -0.040247099 0.96009058
2004-02-22 - 2004-02-28 62 2/22/2004 2/28/2004 3/1/2004 10678.14178 61.66666667 0.666666667 0.006437245 -0.033809854 0.966270918
2004-02-29 - 2004-03-06 61 2/29/2004 3/6/2004 3/8/2004 10529.4783 61.33333333 -0.666666667 0.014020047 -0.019789806 0.979818082
.... 2011-01-16 - 2011-01-22 63 1/16/2011 1/22/2011 1/24/2011 11980.51975 62.66666667 7.666666667 -0.011972994 0.835101399 2.050606652
2011-01-23 - 2011-01-29 67 1/23/2011 1/29/2011 1/31/2011 11891.93241 63 4.333333333 0.007421755 0.842523154 2.065825752
2011-01-30 - 2011-02-05 61 1/30/2011 2/5/2011 2/7/2011 12161.62995 63.66666667 -2 -0.022425689 0.820097466 2.019498187
2011-02-06 - 2011-02-12 61 2/6/2011 2/12/2011 2/14/2011 12268.19208 63 -2.666666667 0.008723994 0.828821459 2.037116276
2011-02-13 - 2011-02-19 67 2/13/2011 2/19/2011 2/22/2011 12212.79189 63 4 -0.004525986 0.824295473 2.027896317

  1. Drawing a graph
    I couldn't get 326% although the line is similar with Figure 2.

@Thomas -- Sorry I don't know what the quota limit is. But I may have been mistaken about my hitting the quota limit. I was relying on an error message from https://github.com/pedrofaustino/google-trends-csv-downloader which I was using to retrieve Google Trends CSV and I realized this may not be the underlying cause. I ended up refactoring my script to use the python package called mechanize, so it no longer use pedrofaustino's code, and I am no longer running into a quota limit. Update at https://gist.github.com/tlmaloney/5650699.

Based on https://support.google.com/trends/answer/87282?hl=en&ref_topic=13975, you have to be very careful about how you use Google Trends data as a signal. Since the data is scaled based on the peak level in the time series, the data is not a time series representing point-in-time observations. For instance, let's say we've observed raw (unscaled) data:

time,raw_level
1,0.3
2, 0.6
3, 0.4

would then be scaled to:

time,scaled_level
1,50
2,100
3,67

but if the next point is

time,raw_level
4,0.9

the time series gets rescaled:

time,scaled_level
1,33
2,67
3,44
4,100

You can derive a signal off of the percent changes, but not the levels themselves.

@Sangno: I started doing it in Python and pandas: https://github.com/twiecki/replicate_google_trends I'm pretty sure it's not correct yet but if you want to help out that'd be appreciated. I'm also happy to add anyone to the repo if you give me your github account name.

@Thomas: Good news regarding the quota limit. Code looks useful. The scaling certainly is an issue.

Btw. you have to open the .ipynb file in the IPython Notebook http://ipython.org/notebook.html
You can also view it online here: http://nbviewer.ipython.org/urls/raw.github.com/twiecki/replicate_google_trends/master/goog_repl.ipynb

Again, this is very much work-in-progress and I would be very surprised if there weren't obvious bugs. I'll add more comments soon but let me know any questions or suggestions.

@Thomas Understood. I created a fork and a separate analysis at https://github.com/tlmaloney/replicate_google_trends. That's probably all the work I will do, because I don't think this strategy has legs, for reasons mentioned in this conversation. As Grant mentions above, the search term 'debt' itself is a free parameter.

@Thomas: I put my excel file at https://github.com/leesanglo/Replicating_google_trends/ Because Google trends data is varying according to the search period and geographic area, how about we firstly replicate the paper exactly and check the results with the same dataset. I am wondering whether we get 326% by following their trading strategy and Google trends of debt term. If I can, I would like to help to replicate google trends. My account in github is leesanglo.

@Thomas: Great, thanks. I'll incorporate those changes. I'm actually now more interested in whether the results are reproducible to begin with.

@Sangno: I added you to the repo, thanks for offering to help out. Your analysis at the beginning already seems very promising so maybe we can replicate it in the IPython NB to make it easier to share and present. Note that we should probably include Thomas' changes first.

The sellthenews.tumblr.com blog post says this papers method of calculating cumulative returns on shorting introduces a (1−p(t+1)/p(t))^2 bias. How is this bias derived? and more importantly how much of this "bank error" could contribute to the difference in results that we've seen from the tests above?

The bias is equal to log(p(t)/p(t+1)) - log(2 - p(t+1)/p(t)), which is the 'corrupted' form of log returns minus the real form. Applying the Taylor expansion to both logs gives the order of the error. It turns out this bias is somewhat small, even for weekly returns: on the order of 1 or 2% a year. That said, you should not be able to replicate the Preis paper in Quantopian (unless Quantopian incorrectly deals with shorting, which I doubt). You should, in theory, be able to replicate Figure 2 just using pandas and python, and replicating the bad accounting around shorts.

The authors of the paper note that they sampled the Google Trends data several times, so it might not be deterministic! If that is the case, I predict you will have great difficulties replicating their work.

@Brent: I think that he might want to point out the difference between arithmetic return and logarithmic return. In general, the difference of return between two types of measures increase larger if the rate of return is larger and larger or smaller and smaller. For example, suppose you have an investment $100. If the stock price increases $200, the arithmetic return is 100% by (200-100)/100, but the logarithmic return is log(2) = 69.3% around 70%.

He pointed out that the gap is larger for the extreme increase or decrease. For example, if the above stock price falls into 0, then arithmetic return is (100-0)/100 = 0%, but logarithmic return is log(0) = infinitive. Then, in the short position, theoretically, investors get unlimited gains, but not unlimited loss.

I think the explanation of a positive bias is about the variance between two measures, not the absolute difference. Please refer http://www.cdiadvisors.com/papers/CDIArithmeticVsGeometric.pdf , you will see a basic formula.

@Shabby: I think that when we incorporate Google trends data into trading rules, the most crucial problem is that on Mondays Google trends still do not finalize trends for the previous weeks. On Mondays, Google Trends shows “partial data” for the previous week, but for the trends data before one or two weeks, trends data is fixed. Because trends data are normalized by the most frequent term for a given period, if we specify the same period and the same geographic area, we will get the same trend data before around two weeks. So, I think we should have around 300% return from a back test, even if we acknowledge the incomplete data for the last two or three weeks for Google trends.

I have emailed the authors for their Google Trends data. I will let you know if I hear back.

@Thomas: What is a coincidence! I also emailed to the corresponding author of the paper, Dr. Preis about the Google Trends data yesterday. If I get some news from the authors, I also let you know. If we do not get any news, I'm going to contact a journal editor.

Haha -- I also wrote the author a couple of days ago but didn't hear back yet.

Maybe no one got news from the corresponding author of the paper. So, I have emailed editors in Scientific Reports about the paper. If I get some response, I let you know.

@Sangno: Great, let us know if you get any reply. Also, have you had a chance to look at the IPy NB? I'd be more comfortable if we could all agree that it's theoretically doing the right thing. Certainly input from the author would help tremendously...

Meanwhile there is another blog post referring to this thread over at sellthenews which I think hits the nail on the head in regards to replicability.

Hello, guys.

At last, I got a response from a editor in the Journal. He will contact the corresponding author and required to make materials, data, and associated protocols available to readers on request. Before we request a necessary data, I think we need to replicate the paper exactly the same way and summarize our findings.

@Thomas: Could you make another program or modify the program with the same period data of debt term? They used Google trends data for debt term from January 5, 2004 to February 22, 2011, and with U.S. country only. Also, would you replicate with 'culture' term?

One way to verify the result of our program is whether we get 33% profits when we apply a 'Dow Jones strategy'. If we use DJIA instead of Google trend data for some terms, that is a Dow Jones strategy (p. 5). Please draw cumulative returns and compare the outcome with the authors' outcome. You can see the authors' outcome in the last page of the supplementary information (http://www.nature.com/srep/2013/130425/srep01684/extref/srep01684-s1.pdf).

Using the DJIA data, I got 32.36% profit for debt term. But, I think it seems that they employed (p(t) - p(t-1)) / p(t-1) formula rather than ln(pt) - ln(pt-1). For the buy and hold return, if I use the former formula, I got 16.47% but If I use the latter formula, I got 15.25%. If we have a time, it is better for us to write the report like http://arxiv.org/abs/1112.1051 paper. If I will get another news, I will let you know.

That's great, Sangno. Thanks for sharing. Lets hope he actually sends us the data.

I did some more work on the replication, fixing some bugs, adjusting the time range, etc:
http://nbviewer.ipython.org/urls/raw.github.com/twiecki/replicate_google_trends/master/goog_repl.ipynb

One thing I'm not sure about is that I should actually do the log calculation at the bottom the other way around. This, however, leads to losing strategy.

The paper looks very appropriate, I'll give it a read.

Hi Sangno,

Tom and I got the same email as well. Its very forthcoming of the author I think. I looked at the R code and it looks good.

Tom started to compare the data sources here:
http://nbviewer.ipython.org/urls/raw.github.com/twiecki/replicate_google_trends/master/compare_goog_data.ipynb

It doesn't seem that a scaling would influence a moving average but now we can compare the signals directly which should help.

I've looked briefly at the R code, need to think more about their accounting methodology. It's not exactly how they describe it in their paper. I think it's fishy. I'm going through a hypothetical price process. Let's say at week 1 the price is $100, week 2 it's $50. I start with $100 so I can buy or short 1 share in the beginning. Let's say I short at week 1. I end up with $150 after I buy back the shorted share. That's a 50% return, but by their methodology I would have gotten a 100% return. Maybe I'm wrong, but I think their short accounting is incorrect.

My general issue with all systems of this sort is that they have no plausible economic basis. There's no "story" of a market structure defect or behavioral bias that could account for the excess return, especially one this dramatic.

In the absence of that, all you are left with is either gross errors (look-ahead) in the analysis, or simple curve-fitting by another name...

@Simon, while I mostly agree with you there is a flip side. By looking at non-financial keywords we might be able to get insight into subtle shifts in consumer sentiment before it turns into changes in buying habits. Properly harnessed that could allow us to predict future changes in economic activity. Think of it as consumer market research. So perhaps it could be useful in predicting sector dominance.

Maybe, but I am very skeptical. Data mining 100+ terms, multiplied by whatever parameters they optimized for the moving averages is far to many degrees of freedom, compared with ~450 weeks of possible trade entries.

This seems to be the very definition of data mining. I suspect the work "tech" would have been equally impressive in the 1994-2000 timeframe had data been available.

EDIT: and as others have mentioned, if the results disappear when correcting a look-ahead bias, you can be confident the causality was never there, and people were just searching for reasons why the market dropped after they read that it did. What is more plausible:

1) prescient smart money managers search google for "debt", before placing big orders that presage market moves
2) retail masses search google for "debt" (as in "debt ceiling", "debt limit", "sovereign debt" etc), after reading about it in USA Today because that is what the business writers blamed the 10% correction on

I think that until now, it seems that their findings are just anecdotal in terms of high figures. The implementation of the paper contains double precision issue and calculating return of short selling issue. At the first time, I thought that it seemed that 326% return using 'debt' term in Google Trends was too high although we acknowledged that Google Trends have to some degree a relationship with Dow Jones.

In terms of their theoretical evidence, other studies support their findings. Mao (http://arxiv.org/pdf/1112.1051.pdf) compared the predicting ability of financial market among survey, news, twitter, and search engine data (i.e. Google Trends) from 2008 to 2011. Interestingly, the study found that DJIA and Dow Jones have relatively strong correlations with volatility (VIX), DJIA, and Volume. "The Google Insights for Search (GIS) time series have a positive correlation with the VIX and trading volumes, but negative correlations with DJIA, which indicate that as more people search on financial terms, their market will be more volatile (i.e. high VIX), and trading volumes will be higher, while DJIA prices will more lower."

Also, the study made Granger Causality Analysis between GIS and financial indicators. It turns out that Dow Jones Volume doesn't lead GIS, but GIS leads Dow Jones Volume until 2 weeks. So, the study set lag 2 week (i.e. Preis et al.'s paper set 3 week lags).

Choi and Varian (http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf) studied the forecasting ability using Google Trends. Using AR1 model, they found that the MAE during the recession (2007/12 to 2009/01) is 8.86% without Google Trends, and 6.96% with Google Trends, and improvement of 21.4%.

Google Finance has already adopted the idea and provided the service using Google Trends. http://www.google.com/finance/domestic_trends

OK, I did more progress on the IPy NB and I can replicate the findings now: http://nbviewer.ipython.org/urls/raw.github.com/twiecki/replicate_google_trends/master/goog_repl.ipynb

Turns out even a simple strategy is very difficult to get correct as the devil really is in the detail. Quantopian makes this much easier in general (but see below for why it didn't work here).

My conclusions are as follows:
- There are some minor problems (shorting among other things) as Sangno pointed but nothing too major IMHO.
- Google Trends qualitatively changed their data from what Preis based their analysis on. Using more recent data the results are way less impressive. It'd be interesting to know why the data are actually different, it can't be a scaling issue. It does seem that Trends now reports an integer from 0-100 so that would allow differences to creep in.
- Currently Quantopian has an execuction delay of one bar, (in our case daily) so orders made on Monday will only get filled Tuesday thus changing the results.
- The author was very helpful by sending the code and data that allowed us to really track down the bugs in our code. Ideally that code would have been available somewhere online from the get go.

@Sangno: I really liked the Mao paper. Much more thorough analysis. The domestic trends also look cool and worth exploring!

Interesting, thanks for the notebook. I'm curious about this line:

data['log_returns'] = data.order_preis * np.log(data.djia_preis.shift(-1)) - data.order_preis * np.log(data.djia_preis)  

I'm not a python/pandas expert, but my reading of this is that the current bar's order determines what the return was last week? (aka - how is this shift(-1) not introducing future-snooping bias -- I must be misreading this code)

@Simon: Yes, that actually was a bug I had before (this is the fix). The idea is that we are buying (selling) at the price this week and selling (buying) at the price next week. This shift essentially achieves this (-1 will give next week's price). You are correct though that this calculation of returns wouldn't work in a walk-forward setting. But with a fixed strategy (i.e. selling (buying) after a week) that is adhered to no matter what this should be correct (I hope).

Alternatively I could do the returns calculation looking backward but the result should be identical. That would actually be more intuitive...

Does that make sense?

I see. I had had terrible times avoiding future snooping when doing python like this, which is why it raised a red flag. Have you been able to replicate in a zipline?

EDIT: never mind :)

I agree. It's kinda funny -- I started out doing algorithmic trading with zipline/quantopian. Doing this by hand really highlights the difficulties and pitfalls of doing this manually (although pandas helps quite a bit). And this is a super simple strategy!

As to zipline (although you know the answer already ;) ): Not yet, but I think the order execution should be changed to allow for basic testing. I think slippage, commission, execution delay etc are critical components. But in my work-flow I try to get it to work without any complexities to make sure it's doing the right thing first. Then later I add those to see if and where it breaks down. Do others do this similarly?

Yes, when I was trying to reconcile results between a pandas ipython session and a zipline, I turned off all the volume slippage and everything. I still needed to shift series forward in my pandas to align the "trade" dates with when the zipline would end up executing things. I finally got everything to match, but it was a long night.

So many of these academic systems use Market-on-Close and Market-on-Open orders, those will be a real help (if they haven't been implemented already, I haven't been following!). The trick is just to make sure that the event during which one can plan MOO orders has access to just the opening price.

I've run a simple analysis of all the search terms used in the original paper, and find that any 'effect' is simply due to data-mining bias (or 'selection bias'). I should note that while the 'backtest' I use is fairly crude, it is probably only biased in being mildly optimistic (free trades, free shorts, optimal position sizing, trading in an index, etc.). If it were the case that something looked significant, I would want to rerun it in a high-fidelity backtesting framework, but it is not needed in this case.

Excellent analysis! Printing this out for the subway ride home.

Excellent analysis! Printing this out for the subway ride home.

Interesting concept, but does not survive proper statistical analysis, and elimination of bias. First, the search term has to make sense. Using "debt" may provide a good signal for that period of time, it is skewed by the times, and should not be relied upon going forward. I would imagine you'd get similar results using the search terms "Justin Bieber" or "Kim Kardashian".

Search terms like "how to buy stocks", "best brokerage account", "hot penny stocks", or "mortgage my house to buy stocks" would be better correlated to the herd.

Ken Simpson
CrystalBull.com

@shabby shef: Great analysis indeed. I do think the wikipedia paper is more convincing in this way as it's more of a distributional analysis (similar to what you did) . I wonder if your analysis would look different for e.g. the financial terms.

I suspect that the 'Sharpe' reported in the 'Risk Metrics' tab in these conversations is miscomputed by the zipline system. For example, the Sharpe reported in @Dennis C's first backtest above is 8.35, which seems far too high. Poking around the zipline risk module, I do not see the mean of returns computed anywhere, and so it seems that Sharpe is computed by dividing total returns by volatility, instead of mean returns. This is broken. Can anyone confirm this?

When this paper came out I tested it in Excel which is easy since the data is weekly. I made a couple of modifications in the testing. Most importantly, I found that 'CNBC' was a better search term than 'Debt'. My rule was to short the SPY if the current week's search value was higher than the median of the last 11 weeks and to buy the SPY if the value was lower. The purpose of the median is to reduce the impact of extreme values. Even factoring in transaction costs my returns were something like 600+% compared with 40% for the ETF (cumulative returns). I am uncomfortable with this approach because of quirks I've noticed in the Google trends data. If you download the data for the same search term a few weeks later it will change. I expected some change due to scaling which wouldn't be an issue. However it's messier than that. In one download the value may rise from week x to x+1 and in a download a few weeks later it will fall. That doesn't make a lot of sense to me. I'd appreciate any insight you have into the workings of Google Trends. I apologize for not posting the CSV. This is my first day in Quantopia and I'm just figuring my way around. Also, it appears I'm going to have to learn Python.

I have noticed such changes in the historical data too. Whatever normalization process they run also adjusts the historical values in some way other than simply scaling the values to 0-100. The documentation page is pretty vague. They only mention that the data is divided by some common variable.

Instead of looking for search terms to apply retrospectively, what if one cross checks the data for "trending" terms on Google with some filtering (eg exclude "kim khardashian"). I know there are companies which do this with twitter and reuters/bloomberg already but don't know about their success rate.

From my experience as a short term (<1 month) trader information dissemination is an inverse exponential process. Very little know about it, then some people start talking about it, then more and more until it hits the front page of every national newspaper. This dynamic is directly reflected in market volatility leading up to an event or events. I'm speculating it should be possible to determine a range for that inflexion point. Recent examples are Greece, Cyprus and Fukishima.

I am new to the site - just joined yesterday - playing around with the Google trends script to use "recession" as keyword instead. Not quite sure if I;ve got dates right. Thanks!

Excellent posts, thanks for sharing, only just joined Quantopian.

A new analysis of Google Trends based on the Preis paper was uploaded to Arxiv:
http://arxiv.org/abs/1307.4643

Predicting financial markets with Google Trends and not so random keywords

Challet Damien, Bel Hadj Ayed Ahmed

We check the claims that data from Google Trends contain enough data to predict future financial index returns. We first discuss the many subtle (and less subtle) biases that may affect the backtest of a trading strategy, particularly when based on such data. Expectedly, the choice of keywords is crucial: by using an industry-grade backtesting system, we verify that random finance-related keywords do not to contain more exploitable predictive information than random keywords related to illnesses, classic cars and arcade games. We however show that other keywords applied on suitable assets yield robustly profitable strategies, thereby confirming the intuition of Preis et al. (2013)

Hello Thomas !
I have a question : did you test your algorithm with Preis' googleTrends data (i mean the exact same data) ? if yes, how did you manage to get it ?

Thanks !

Hi Benoit,

Yes, Tobias Preis was kind enough to send the data on which the paper was based. All we had to do was ask. You can find our replication here:
http://nbviewer.ipython.org/github/twiecki/replicate_google_trends/blob/master/goog_repl.ipynb

And the full repo containing the data here: https://github.com/twiecki/replicate_google_trends

Thomas

Thank you very much Thomas ! it'll be very helpful for our researches.

Benoît

I also updated the repo with a readme now.

In your calculations did you count some (virtual) fees for a broker for instance ?

Benoît

The Quantopian version by default simulates things like transaction cost, slippage, and order delay (and the strategy is quite sensitive to that it seems). The IPython replication does not.

Thomas Wiecki, I am a high schooler writing an english paper on algorithmic trading. I have figured that you seem to know a lot about the topic. If you wouldn't mind answering some interview questions, send me an email at [email protected] Thanks

Nathan, Sounds great. Just responded to your email.

@Thomas in repo there is same code i can't understand ,could you help me

data['rolling_mean'] = pd.rolling_mean(data.debt, delta_t).shift(1)  
    data['rolling_mean_preis'] = pd.rolling_mean(data.debt_preis, delta_t).shift(1)

    data['order'] = 0  
    data['order'][data.debt > data.rolling_mean.shift(1)] = -1 # is this a bug for shift .as you already shift  
    data['order'][data.debt < data.rolling_mean.shift(1)] = 1 #  
    data['order'].ix[:delta_t] = 0

    data['order_preis'] = 1  
    data['order_preis'][data.debt_preis > data.rolling_mean_preis] = -1 # no shift ?  
    data['order_preis'][data.debt_preis < data.rolling_mean_preis] = 1 #  
    data['order_preis'].ix[:delta_t] = 0  

my question is rolling_mean is shift when it is create,but you shift in this line
data['order'][data.debt > data.rolling_mean.shift(1)] = -1 and not shift in this line
data['order_preis'][data.debt_preis > data.rolling_mean_preis] = -1

This might be a left-over. I don't think the 'order' signal is used anywhere. Instead 'order_preis' is the actual replication. It should thus be safe to delete the middle code section. But yeah, you are right, it makes no sense to shift it twice!

Feel free to submit a pull request that fixes the issue.

Thomas, thanks for posting this and keeping it updated, especially that follow up paper!

Thomas (or anyone who would know), do you know if Preis' NAV are available ? to have a look at it otherwise than with his debt curve.
Benoît

Seen this?:
Do Google Trend Data Contain More Predictability than Price Returns?
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2405804

The csv file was removed so here is an updated version which also uses the original Preis data for the debt search word.

I also refactored the code a bit to clean it up.

There's a follow-up paper by the same group: http://www.pnas.org/content/early/2014/07/23/1324054111.short

Quantifying the semantics of search behavior before stock market moves

Technology is becoming deeply interwoven into the fabric of society. The Internet has become a central source of information for many people when making day-to-day decisions. Here, we present a method to mine the vast data Internet users create when searching for information online, to identify topics of interest before stock market moves. In an analysis of historic data from 2004 until 2012, we draw on records from the search engine Google and online encyclopedia Wikipedia as well as judgments from the service Amazon Mechanical Turk. We find evidence of links between Internet searches relating to politics or business and subsequent stock market moves. In particular, we find that an increase in search volume for these topics tends to precede stock market falls. We suggest that extensions of these analyses could offer insight into large-scale information flow before a range of real-world events.

Here is the backtest result for the first algorithm using Google Trends CSV data for "bankruptcy" instead of "debt". Not quite as good but still exceptional bear market performance and an interesting comparison. One of the few decent keywords I have found so far following the same/slightly modified strategy. (tough finding positive correlations with individual equities/ETFs)

Hunter, thanks for sharing!

It would be interesting to test the political keywords they argue to be predictive in their last paper.

I'm still learning python so i'm trying to figure this out, is there a way to use multiple keywords at the same time? You could use both positives and negatives as triggers. (like 'bankruptcy' and 'growth')

Maybe this is just due to.. leverage effect... since small order size the performance degrades... some even went to negative.. territory if the performance is not that strong... and you reduce order size.

IMO a major problem is data inconcistency of Google Trends. I have found examples (I will document them if you want), where trend between 2 dates is opposed (ie on one period of time, the interest for a keyword is increasing between 2 dates, while on an other, it's decreasing).

This strategy would work way better with absolute values, but Google seems not interested in making them public.

Google search terms is literally the holy grail. Doubtful they ever release the full data set but we can dream.

There's also this recent ECB paper that discusses the predictability of google search terms.

http://www.ecb.europa.eu/pub/pdf/scpsps//ecbsp9.en.pdf?177000b829d4450b007f3d3a612cab18

http://blogs.wsj.com/economics/2015/07/14/social-media-sentiment-presages-market-moves-ecb-paper/

Hi James,

I'm going to read the paper you linked, thank you. My opinion was based on Preis works. Even if its strategy works well in backtesting, Google always display it results with a delay of one or two days. For example, let's look at the keyword "debt" on a daily basis : https://www.google.com/trends/explore#q=debt&date=today%201-m&cmpt=q&tz=Etc%2FGMT-2

We can see here that latest value is wednesday. While this has no importance when backtesting (at we're looking at the past, data is available), my concerns are about the application on real, day-to-day, live trading.

Maybe you've heard of a method to solve this problem ? In the meantime, I'm going to read the report.

Not sure... if Preis with this methods... averages out the data and deem... it effective... fewer volume means buy and high volume means sell.... I have backtest it after 2011 uses average unlike... the prior... Im not sure... if Preis totally agrees with this methods.. and deem... this strategy... effective.... since in his example... his not using average.... Which is More effective... wat do u think...?

Currently in the process on preparing a paper for publication which utilizes Google Tends to measure the impact generated from international terrorist attacks. I was curious to see what effect my research, meaning the frequency of google searches related to terrorism, might have on gold stocks. Definitely an interesting result, especially in later years.

@David Gordon
Your leverage needs to be checked, since anything over 3x is unattainable.

Have you tried using random terms. Such as puppies, flowers, blue, cards. Just to test whether its not related to overall increases in search volume ?

@Wiecki I think this is leverage. can it still outperform with out the leverage part? thanks.. ;)

@chan I'm new to this so apologies if my questions/comments seem foolish. I did a minute test of your algorithm, and the results have around the negative millions% return, far opposed to the daily backtest. What between the two creates such a difference in the two tests and if I use the algorithm in real time trading will it trade according to the minute test or the daily? Thanks.

@ luke its not my algo I just clone it form above... havent tried.. testing it in minute mood thou.... but its interesting.... if you get that kind of results.... with the same code in the algorithm....

If you download the Google Trend data now, compared to the one in Wiecki's upload in the backtest, you will notice some numbers has been re-normalized or have some small differences here and there.

I am kind of wondering how this value is adjusted.

I get no open position at all on my backtests...
Did anything major changed?
Did I forget to add something?

Hi Pedro,

Can you share your backtest so I can take a look?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I am trying to make a simple modification to the code, replacing debt with flu and the S&P with Merck, Sanofi or Pfizer. I think there's a bug as it keeps buying and selling in the same day. Any help would be appreciated, thanks!

Edit: thanks Jamie, I got it to work. For those interested there's only a weak signal from buying Merck, Sanofi and Pfizer on flu search increases.

Hi Tim,

It looks like there's multiple order logic in your code between lines 43 and 50. I would expect it to be selling and buying every day! (specifically, line 43 followed by line 48). I would suggest using the order_target() or order_target_percent() functions instead!

Hi All,

I was just wondering how to modify the algorithms to take into consideration our current cash amount/total current position/maximum allowed leverage and stuff like that.

-Amro

Hi Amro,

You can use attributes of the portfolio object such as context.portfolio.cash or context.portfolio.positions.

You can get your leverage from the account object using context.account.leverage but in order to keep it under a maximum, you will need to include logic in your trades that manually guards against over ordering. For example, you can use order_target_percent() to place orders up to a certain fraction of your portfolio, and use get_open_orders() to ensure that you don't place duplicate orders when fills don't happen instantaneously.

I signed up to Quantopian especially to engage in this brilliant discussion.
I have been a fan of the site for months but did not sign up because I'm not good at programming. What I do think I'm good at however is investing.

Now I like the premise in the studies mentioned, but to be honest I don't care about its outcomes. In other words It would be more interesting to use this methodology but under a different investment philosophy.

That philosophy, or at least mine is based on two notions: 1) Buy and hold is the best strategy for stocks. 2) You should be greedy when others are fearful, and brave when others are greedy. By brave I mean willing to hold on to your stocks even though you know a correction is coming.

Such a philosophy shouldn't be concerned about timing the buy, but timing is what makes you beat or lose to the S&P and not your stock picking ability. If you can time your buys of the S&P 500 right you will beat all those fund manager hots hots who focus on picking stocks. This is where the google trends method comes in handy.

Now the idea is that the public will be keen to know about the stock market when its at an extreme. I just looked up "stock market" trends for the past 12 years and the results are interesting. Unfortunately I don't know how to share that chart so you can either try for yourself or someone can post that chart.

The interest over time chart shows a sudden spike interest in particular months. Now because I think programmers like to quantify things, We will refer to a spike when the interest over time value exceeds the 40 barrier. This happened 7 times in the last 12 years:

1) February 2004. 2) January 2008. 3) September-November 2008. 4) March 2009. 5) August 2011. 6) August 2015. 7) January 2016.

With the exception of February 2004, every other incident coincided with either a huge decline in the S&P 500 or a major bottom or both.

There is a simple explanation for this of course; people care about newsworthy events. So it is only natural that the trend will go up when there is a major event, that event is likely to be a crash than a bubble for one reason:

1) In the past (at least) , news outlets (BBC, CNN, NY Times) treated stock market crashes as top news nationally, but new stock market highs as business news only and not national ones.

The conclusion of this imo is to buy stocks and holds them when people are keen to know about the stock market (interest above 40) and remain in cash when people are not. It would be cool if someone did the back test to verify the results.

The next step for me is to see if we can do this on an annual basis and if there is a way to select a particular or a number of stocks. My premise would be that the stocks people are worried about the most are irrationally priced to the downside and should be an even better opportunity than buying the S&P 500 as a whole. Not sure how to apply that to trends however.

This thread seems to die a death because of the unreliability of Google Trends which are changed and amended after the event. This makes meaningful back testing impossible.

The concept itself seems to have great promise. But unfortunately the owners of the data (Google) are the only ones able to profit from it since only they can correct the retrospective changes Thomas and others have discovered.

Or has there been background research not reported on this thread which overcomes this fundamental flaw?

Well, I think its possible. But as I said, I know very little about programming so let me explain my reasoning.
How I see it, the google trends graph looks like a stochastic of search volumes. When the search volume reaches the highest point during the covered time period, the value becomes 100. The remaining points on the graph are re-calculated based on the 100 area. When a new high in searches is hit, a new 100 figure is established, and everything is recalculated from that.

Obviously it would be difficult to back test a moving average strategy should be difficult. However a breakout strategy should be doable. Whenever search for "stock market" reaches new high, buy and hold the S&P 500. Now if there is a way to (after satisfying the s&p 500 criterion) go through each component of the index at the time and see which companies were at an their highest share during the same period) and then buy those. I think a breakout strategy can be automated.

I ran this test manually on the most recent "Spike" I referred to in the earlier post. Now if you bought the S&P 500 at the beginning of the year, you'd be up 0.41% ytd. Now if you bought the S&P 500 when the google trends "spiked" (the open of February 2016), you'd be up 6% as of last Friday's close.

However if you bought the select S&P companies that were at an all time high when google trends "spiked" earlier this year (a total of 12 companies did that), you would be up around 40.7%

Such a method deserves that we try our best to back-test it if we can. Of course there are downsides to buying select companies based on this method rather than the s&p index, but I will discuss them in due time.

What is the best way to alter these algorithms, due to "@batch_transform" being discontinued?

Is anyone returning to this thread now that Google have real-time data?

I'm still trying to figure out how to download the data every 15mins for a series of 100 search terms. Can anyone help out?

@Jamie Burton

Google Trends has always had real-time data but they expanded the availability recently. I don't think Quantopian includes support for getting Google Trends data in real-time. At least it didn't when I looked into this a while ago.

I am interested in shorting some stocks mainly ones focused on China (thesis explained: https://forum.basic-capital.com/t/is-china-the-next-big-short-2018/105)

I wonder whether search would give you a sizeable advantage in this case?

Financial news plays an enormous role in dictating stock prices. Large search volume stems comes from investors looking for more information on current events. Current trends are made so by the news. As soon as people begin searching things in high volume, I would imagine the market already knows about the events?

Another thought: I would imagine google trends could give you an advantage for a while until another hedge fund came along. In other words, I think it is a strategy that could be very successful in the short-term, but might struggle int he long-term...

I am wondering how to get the closing prices of the DJIA on the first trading day of every week. I know how to get the daily prices for the entire period, but I don't know how to extract only the first trading days of the weeks from this. Does anyone know how I could do this in R?

John,

In Q2 it is as simple as:

def initialize(context):  
    schedule_function(get_week_start_closing_prices, date_rules.week_start(), time_rules.market_close())

def get_week_start_closing_prices(context, data):  
    price = data.current(symbol('DIA'), 'price')  
    record(week_start_closing_prices = price)  
    print price  

I found google trends data is inconsistent regarding timeframes used to retrieve
try to get data for overlapping periods and you will see that intersected area proportion is not constant though it has to be

Hi Guys. I just joint Quantopian and notice this is a brilliant discussion on Preis's paper "Quantifying Trading Behavior in Financial Markets Using Google Trends". However, this thread seems to be quiet for quite long time. Is it because of the unreliability of Google Trends which are changed and amended after the event? Recently, I come across another paper "Forecasting Stock Market Movements using Google Trend Searches". https://link.springer.com/article/10.1007/s00181-019-01725-1 This paper seems to enhance Preis's original work. Any thoughts on this paper?

I found that google trends are extremely inconsistent, tried to fetch one year period with day precision using overlapping periods and scale - otherwise it apparently information leak from future ( you know that it scaled on the highest spike)
so not worth to waste time

@Ivan. I agree with you on GT data inconsistency. Is there any way to eliminate/minimize the inconsistency? For example, lets say we export GT data (say "debt") for the period of 01/01/2010 - 31/12/2010. If we do the exporting one time a week, and do it for 10 times (i.e. for ten weeks). Then, we get 10 series of "debt" GT data。 Then, we take the average, which may be closer to the "True" value. Any thoughts?

@David, do not know, inconsistencies were huge for same days when used in overlapping periods so I gave up. Will try at some point estimate sentiment in russia with yandex statistics.

Can we mesh something like this with a value-contrarian buy list? The value-contrarian buy list has an upward bias already, and if we trade these stocks nicely and the mesh it again with a perceptron or something to that affect I think we can really get it cooking.