@ Adam Blackwell
I don't have a full list, because it changes, but they are coming mostly from yahoo finance's news feeds. The main constituents are found here: http://sentdex.com/how-sentdex-works/, Reuters, Bloomberg, WSJ, LA Times, CNBC, Forbes, Business Insider, and Yahoo Finance. If you go to finance.yahoo.com, and look through the sources there, that's where almost everything is coming from.
The closest source to "blog" or social media that Sentdex touches is Seeking Alpha for finance. It's my opinion that social media is rather useless for finance, and my findings from researching it said the same. There was some promise in what I found with something like StockTwits, but nothing that compares to straight finance journalism.
At the moment, almost nothing from the original source is stored, aside from some noun phrases and related words, which is a relatively new development, in an attempt to create a spidering algorithm, like what Google does with links, only, instead of links, doing it with concepts. In this recent development, I am also storing the source link, more for the consumer to follow up on the links, but I suppose it could be used as well to filter out sources, though this isn't tied into the main sentiment analysis database, it's a separate project at the moment. URLs take up a lot of space in a datebase over time, which is why I have refrained from using them. I used to have them, but had no purpose for them, so it wound up being a cost decision to remove them.
The crawlers are running constantly, I have multiple servers, and the main crawler is threaded into about 30 crawlers, which break down to track various numbers of companies by their update volume. For example, one crawler tracks about 30 of the lesser-reported companies at once, while another single crawler is dedicated solely to tracking AAPL. Usually, updates come in within 60 seconds of being posted, but, it can possibly take slightly longer for the companies that are tracked by crawlers that track multiple, slower, companies. If say 10 were updated at once, it may take more than 60 seconds, but it's pretty constant. Some sources might have changed, especially some of the smaller ones, but the main source as being the Yahoo Finance feed (not to be read as "yahoo" is the main source, yahoo syndicates financial news reports, so you can go to here: http://finance.yahoo.com/q?s=aapl and see "headlines" from various sources for AAPL). The sentdex algorithm has remained unchanged since inception.
The API is a direct connection to the database, so the the API should not have any sort of delay to it.
Geographic sentiment is solely generated with the Twitter API, so this is social media. You could model it if you wanted. You can also search specific keywords on the globe via this link: http://sentdex.com/geographic-sentiment-search/
It will take some time to load that link, which is why it's not really linked to on my site. It's just something I've been developing, but you could use it to search for various trends. For example, weather can affect various options markets, so you could search for people's sentiment for "weather," like this http://sentdex.com/geographic-sentiment-search/?q=weather, or something like that. That's a 1% firehose, however. To get really epic, I'd need something better than that.
For articles, I do not think geolocation of the author is too relevent. For some topics, it can be, but I don't think it carries enough weight in finance, so really the geolocation matters most when using social media data.
You can also search for geo data on a specific company, like Google or something: http://sentdex.com/geographic-sentiment-search/?q=google .. but again, I am not sure how valuable this one would be. So far, the best thing I have thought of is for related topics like weather sentiment, where negative would be severe heat, droughts, too much rain...whatever... and how it would affect crops in those areas, and then subsequently affect futures / opts prices.
You could also track "economy" or something: http://sentdex.com/geographic-sentiment-search/?q=economy
If you track a specific term as well, you will get up to 10% of the firehose. There are a lot of singular finance-based keywords one could track. Unfortunately, Sentdex is a 1-man operation, and the budget is not much. Running twitter data is actually pretty expensive. Their API sucks up a ton of ram. Via my home PC, I have no problem running tons of data, but, via a VPS, I am always maxing out RAM before the max from the free API.
On to politics data, I have exposed all sources there, actually. The source list for politics is fixed, as the amount of politics data gets absurd fast! Again, they are listed here: http://sentdex.com/how-sentdex-works/, and that list is: CNN, Fox, USA Today, ABC, MSNBC, CNBC, CBS, Huffington Post, Yahoo Politics, Washington Post, and Reuters.
Also the political sentiment is probably the most source-transparent of everything so far. The aforementioned concept crawler was first born here, and some of the fruits are public there. This same update is actually "live" for stocks, forex, and commodities, though the UI/UX isn't, same for the existence of FX and commodities!
So if we go here: http://sentdex.com/political-analysis/ and choose maybe "war". You wind up her:e http://sentdex.com/political-analysis/?i=war&tf=30d, which gives you the 30 days of sentiment for "war."
Then, above the graph, you see those words, which are colored and sized. Color is the sentiment, and the size is volume. You can then click on "Charleston" as an example and actually get sources that contributed to the political topic of "war" around the sub topic of "Charleston." These are the direct source links that you can click and visit.
I have kept the "political" sentiment pretty much separate from the finance stuff, you are also absolutely right in your assumptions that it is US-based. Everything can be expanded, but it's all a question of funding. I am super excited about the geographic sentiment, but running it accounts already for about 95% of the costs to run all of Sentdex. In the grand scheme of things, Sentdex is very cheap still for me to run. At the moment, I am trying to get some APIs and data selling out there, to fund some more expansions and ideas of mine. The sky is the limit.
As for political data samples, there's one major database, I usually just give that out when people want to play around. There's a second growing database for the concepts work, but that area is still in its infancy.
As I was telling Josh and Seong earlier today, the sentiment signals are very new, as new as my account here. Sentdex has historically just been a pet-project of mine, mostly to suit my own specific curiosity, without really focusing on what others might actually want. The sentdex database of raw sentiment has been available for a while, but almost no one could figure out what to do with the raw sentiment numbers, and almost everyone with a connection was a doctoral student doing a dissertation... or at least someone willing to claim as much so I would go easy on my pricing. People kept asking me how to read sentiment, and I've always had my own rules, but converting them to a super simple strategy took some work and understanding about the system.
I had back tested some strategies with the Sentdex algo, and it did well, but my back testing code was never anything to write home about. Then I decided to try out quantopian, read about the fetcher and wrote that signals API real quick so I could work with the fetcher. I realized that the sentiment signals API is probably exactly what people have been hoping for all along. Something nice and simple that can be plugged in without needing to really come to grasps with the Sentdex system at all.
Same thing with the sentiment graphs on site. That used to be all of what sentdex was, just the historical graphs of the MAs applied to raw sentiment... but people don't seem to like them or want that. I am a data fiend at heart, so I like it... but people just want the signals. Some people want historical stuff to test against, but they want nice clean data, never the raw sentiment, except for a few people. Been trying to satisfy that a bit more lately, mostly so I can satisfy my own interests some more with cool new projects with the data.
Thanks for your interest!