News Stock Market Prediction with Reddit

Here is my final report from my last research project. We used headlines from the r/worldnews subreddit predict the DJIA index trend, combining a variety of technologies like Azure HDInsight with Hive for Big Data processing in the cloud and KNIME for the advanced and text analytics pipeline. The conclusions are not what we expected them to be at all, and they show the high importance and value of early Exploratory Data Analysis and making sure you have access to the right data.


Free Azure Machine Learning? Yes, Please!

With Azure Machine Learning being released to General Availability this week (Feb 18th, 2015), more interesting news come to life.

There is a couple of (somewhat confusing) options to try and use AzureML. Better to be informed before you jump in and register your account with Azure…

AzureML Free Tier

With GA, Microsoft decided to release a free tier to make easy for you to try the service. The difference with the classic Azure trial is that you don’t need an Azure account for this (which requires a valid credit card).

Another difference is what you can do with this type of account: you’re not on trial time (one month, one year), but bound by other type of limitations such as: data storage (10GB), number of modules per experiment (100), max experiment duration (1 hour) and performance (throttled).

Still this is the best option if the only thing you want to do is to give AzureML a try, or even use it as a development environment before you move into production.

To use this, just go to https://studio.azureml.net/ and sign-in with your Microsoft Account.

Azure Free Trial

This is the classic Azure trial: you will be given 1 month and $200 that you can use to try any Azure service, including AzureML. It will require for you to register a new Azure account, and enter your credit card information.

AzureML Pay-per-use

After your one month trial expires, you can check the current prices here.

Different options for different goals

If you just want play and try some small experiments: Use the Free Tier. Most small experiments will be run just fine.

If you are ready to take your experiments to the next level, and release to production: Start with the Azure one month trial. After one month, you will be billed at the regular rates.

Azure Machine Learning: Data Mining 2.0

Azure Machine Learning (aka AzureML) is one of the new products/services in this new bold world of ‘cloud first, mobile first’ that Microsoft is endeavouring. It helps you create predictive analytics from your data in a very quick and simple way, and easily integrate this with allyour applications. And you can do that armed just with your browser!

But I think I’ve heard about this before… Haven’t I?

Remember a couple of years ago everything was 2.0? Web 2.0 was the paradigm everyone swore by, adding ‘social’ and ‘services’ around all we already knew by then.

That is how I feel about Azure Machine Learning: it is a great, improved 2.0 version of the old Data Mining concept we’ve known for years (SQL Server implemented this with its SSAS Data Mining feature). Don’t take me wrong, I’m not saying that because this already existed one will quickly discard it. I think Microsoft took a page of its own book, and put a lot of thinking on how to bring that into 2015. And that is great!

Out with the old…

If you remember, Analysis Services Data Mining always had a couple of algorithms you can use:

  • Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset.
  • Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset.
  • Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.
  • Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
  • Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow.

To use them you would create a model in SSAS, load data (with help provided by SSIS) to train the model, and then you can use them through DMX (Data Mining eXtension) queries. Doing DMX queries involved connecting to SSAS using native windows-only proprietary drivers and then sending these queries to get back your results.

… and in with the new!

The principle behind AzureML is pretty much the same. Couple of notorious diferences here:

– You don’t need SSAS: In fact, you don’t even need SQL Server at all: no database, no SSIS, no SSAS. This is a pure online service, born into and for the cloud. There’s been talks about bringing it to on-premise, but honestly I don’t think that is going to happen any time soon (and nobody would blink an eye either).

– Data loading and manipulation inside the tool: As mentioned before, you don’t need SSIS. Your expermient designer in AzureML has a workflow view that resembles SSIS in the sense that you have components to scrub and manipulate data before loading into your model. One less thing to worry about.

– No DMX or weird query languages to use: As this is a cloud service, the output of your model is a web service. Anybody (with the correspondingAPI key) can call it and make use of your model. This makes your model available and online-ready in really no time.

– Integration with R: R is ‘THE’ language to create models. In the old world, you could still create your own models using the SSAS Data Mining SDK (using C++ or C#) but they would still have to be compiled into native windows code, deployed, managed and available only through SSAS. Being able to take any R algorithm available and use as a component makes this very much open for experimentation.

– One click deployment to Azure: To deploy your old data mining model used to require creating some kind of component (or service) to wrap the SSAS DMX call. Deploying to the cloud is literally done in one click, and you are ready to go. There’s even boilerplate code provided for you to call the production-ready web service from C#, Python and R.

– Really low entrance barrier: No infrastructure setup, no licensing costs, no development tools setup. The only thing you need to do is register to the AzureML service online and pay for the processing cost when you run your model. That’s it!

Summary

AzureML is one of those products (services?) that makes me excited about the future of Business Intelligence. So easy to setup, work with and deploy that is kind of a crime not using it!

Now, this is still a 1.0 version of a product. Features that are still not there or missing:

– Heavy data encryption: Training models often involve highly sensitive / private data. Everybody requires a trusted and heavily encrypted transport for this data. This is where most of the asks are going to come from: people coming from the Enterprise world concerned about their data travelling through public networks.

– Easy model retrain: Model re training is something it should be done frequently. Once you train your model, you need to keep it up to date to respond to environment changes and also potential decreasing accuracy. There is no easy way to automate this right now.

– More algorithms: This is mitigated by the fact that you can infinitely expand by using R, but still this is where most of the grow will come from. Also, Microsoft recently bought Revolution Analytics, so I would expect more algorithms and features added.

Your next steps

If you’re interested in using AzureML, just register a new account (there’s a 1 month, $200 trial) and just start using it. Some resources you can use to start learning it are:

Books

Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes

By: Roger Barga; Valentine Fontama; Wee Hyong Tok
Publisher: Apress
Pub. Date: November 26, 2014
Print ISBN-13: 978-1-484-20445-0
Pages in Print Edition: 188

Videos

– If you only have 5 minutes or less, watch this: Azure ML Overview: this is a great 5 minutes overview of what AzureML is.

https://www.youtube.com/watch?v=uJhVZ58b8Fs&list=PL8nfc9haGeb4SjrnQWPuJsSitvxN9hSdc

– If you have one hour, watch this: Intro to Azure Machine Learning: The full product tour, with demos, from TechEd 2014.

https://www.youtube.com/watch?v=kZ04LnSjWek

If you have more time, you can start watching this YouTube video playlist.

mongoDB – What’s great (and not so great) about it

mongoDB is a relatively new database management system, one of the prime examples of the No-SQL database movement (if such a thing exists). In No-SQL databases, that can also be referred to as ‘non-relational databases’, you don’t represent data tables that store rows and their relations. Each No-SQL database has its own particular way of modelling, storing and representing data.

This NoSQL movement is basically promoting the shift of development and logic on database querying and processing out of the database systems (and SQL language) and into the developer and programming world. I think programmers never liked the SQL language, or never had the time or patience to understand its declarative nature (a declarative language is one where you express a computational logic and not so much a program flow). There were many attempts to lower the impedance mismatch between those world over the years: object-oriented databases, ORMs (object-relational mappers) and even LINQ in the .NET world and their equivalents in some other languages and platforms like Java. I think NoSQL is just another attempt on that, but more specific: their objective is targeted specifically to manage huge amounts of data (popularly known as “Big Data“). Summarizing, where in a relational database you would use SQL to pull data out of the database, in the NoSQL world you would use your system’s programming language.

In the case of mongoDB, data is stored in form of “documents” which are basically JSON strings, some sort of object serialization. If you are a JavaScript or web developer, you are in good luck today, because you are very familiar with JSON, and the way it represents information. If not, you will have a slight learning curve, but nothing to steep to be honest.

Another interesting characteristic on mongoDB is schema management: in a relational database, you first model a table, where you specify the types of data you will be able to store (columns) and their data types. In mongoDB there is no such thing, every data item you store is just a serialization and it can be completely different from any other stored in the same collection.

I’ve been working with mongoDB for the last couple of months, in an experimental way, but now I’m starting to work on it for a project full time.
I had the chance to compare it (more philosophically) with other database systems I worked with, and I’ve come to like it to some extent, although still leaves me with some doubts and wishes in several aspects.

The good

Here are some of the things I really like about mongoDB:

– Free and open source: This model works well for small projects, but you will find costs as you grow. You will want a more robust infrastructure, and mongoDB requires more hardware than other database systems in order to be fault tolerant. Also, you will want some kind of support from mongoDB, and you will have to pay for it. Also, open source means you can take a look at the source code, but mongoDB (the company) still owns the product and the project’s destiny. This means you can start small with free, and then keep growing as you need more.

– Scalable almost to infinity: this is not to say that you will need that, but is more scalable than traditional relational database systems. With the SQL Servers and Oracles of the world, if you want to scale, you would buy a bigger server (more RAM, more HDD, more processing power): this is called scaling vertically. You can see there is a limit to how big your server can be, right? With mongoDB, you will get more inexpensive hardware and add them to a cluster that behaves as just one big server to the application layer: this is called scaling horizontally. There is virtually no limit to how many servers you can add to a cluster.

– Simple JSON API: This is what makes it so popular. Everybody and their mothers who know who to program in JS can now use a very simple API to access a database.

– Very good documentation: All the information you can need is available at mongodb.org. If you need some hand holding, they even provide online courses at education.mongodb.com

The bad, and the ugly

Things I really don’t like about it:

– Not so great in the enterprise environment: mongoDB (the company) is clearly putting all their efforts to push this into the Enterprise landscape, with different degrees of success. I’ve seen some really awesome use cases (like implementations of Customer 360 view apps created in incredible record times) but also some very awful implementations.

– JSON: Yeap, I think this is their blessing and curse. The fact that everybody can simply use this makes it very easy for anybody with absolutely no understanding of database modelling or theory, to make things a mess in record time.

– DBA tooling is poor: And this is something that has been improved over time. As mongoDB relies heavily on their community to create management / monitoring / optimization tools, there is not a clear path or toolset that one can use to work or even develop. Sometimes, too many options can be a problem.

All in all, I would still recommend for you to take a look on it, just to get a glimpse on what the non-relational database world looks like. It is always good to broaden ones horizons.

The future of Business Intelligence / Big Data

A lot has been said in the last year about Big Data as the “future” of Business Intelligence, but Big Data is a very weird concept to me.

About Big Data

I understand this idea that we’re accumulating more and more data each year, but still Big Data is an elitist concept to me. How many companies in the world have real big data problems? I’m sure large corporations face this type of challenges more often lately, but I think the real revolution in Business Intelligence is hidden somewhere else…

The future of reporting

People are overwhelmed by the amounts of information they receive, and sometimes can be challenging to understand it. This is one of the most creative solutions I have seen in a long time. AT&T created a video bill for cellphone accounts, so when you get your e-bill you also get a link to a personalized video that explains all the items in your bill so you can follow it through at your own pace.

https://www.youtube.com/watch?v=3Mbkyo_Hz0k

This is not only a very innovative way of presenting information but a clever strategy to lower calls to the company by people trying to get explanations about all the items in their bills. I am not counting on it to replace regular reports, but is a great complement to e-billing strategies.

The future of search

We need tools that help people retrieve all the information they have. Big or small data, there is no storage or analysis challenge that can’t be solved today… by engineers! We need to put the data in the hands of other people: marketing, sales, designers, artists. We engineers already know how to through in a couple of SQL queries and get whatever we need, but is the people (who will NEVER learn SQL or use a simplified query tool) that need to start finding real uses for all the data that we already have.

This is the future of search:

https://www.facebook.com/video/video.php?v=10200156514653891

Spreadmarts…

Some time ago I read a report by TDWI (The Data Warehousing Institute) where they talk about something that’s very common in most companies: Spreadmarts.

What are spreadmarts? Is usual to see in every kind of organization the wide use given to Excel Spreadsheets. Every manager, every employee saves his or her own data selfishly in spread sheets that they can edit, change and transform according to need. Is this mix between “spreadsheets” and data marts what originates one of the most extensive problems in small and medium sized companies: Excel fever.

Every company starts using this tool in early stages, where they think that “everything can be solved by just using Excel”. This later becomes in a “temporarily … for ever” situation, where growth creates a crisis that leads to change. Also, this wide use of spreadsheets has many consequences, like “information silos” (where every person or department manages its own data) and has terrible outcomes like decisions taken using outdated or even wrong data.

Excel gives users such ease and dynamism to manage their data, and that’s very hard to get using other tools. That’s why, when facing the idea of keep using spreadsheets, it’s so important to put in the hand of users (specially “owners” of this information islands on companies) new tools to let them use information more easily. In some cases, it might be useful to keep using spreadsheets, maybe they can be automatically generated by some system.

One of the most important things to understand is that there’s no silver bullet, no one-size-fits-all solution for this, but just general guidelines, like aiming to information integration inside the company. You know that one of the most important capitals of your company is information, and special attention must be paid to its care, trying to avoid information silos as much as it can be. Another guideline I always have in mind is gradual tool changes and migrations. It’s natural that users might feel like they lost control, but it’s our duty to make them understand that collaborative work an integration lead to higher performance levels. Sometimes everybody has to cede a little bit of personal control for the greater good of the whole organization. And always, listen to the user. An imposed solution will never be well received and accepted, and sooner than later it will be misused or even worst, not used at all. And this might even reinforce the uses and practices one was trying to change.

Business Intelligence Presentation (Part 1 of 2)

I wanted to post this for some days now, and finally now I had the time to do it. This is a presentation I created to one of our customers back in Buenos Aires about Business Intelligence. As the original presentation was in spanish and then I translated it to english, some words or phrases may sound weird or funny.

Their specific question was: “What is business intelligence and what can we do with it?”. So here is the answer in slideshow format!

This is just the first part, the second part (covering Data Mining) will be posted in a few days…