Here is my final report from my last research project. We used headlines from the r/worldnews subreddit predict the DJIA index trend, combining a variety of technologies like Azure HDInsight with Hive for Big Data processing in the cloud and KNIME for the advanced and text analytics pipeline. The conclusions are not what we expected them to be at all, and they show the high importance and value of early Exploratory Data Analysis and making sure you have access to the right data.
This is a summary presentation about the final group project I worked on during this winter for the Data Mining course in the Masters of Data Science and Analytics program at Ryerson University.
In this project we use daily world news (and more specifically the /r/worldnews subreddit) to try to predict trends (up or down) on the Dow Jones Industrial Average daily prices. The idea for this project is not originally mine, and it was first posted as part of a Kaggle dataset, with many kernel submissions , and our project changed a couple of things:
- Reprocess the data from the source: Extract the /r/worldnews directly from the complete reddit dataset, get up/down from DJIA data coming from wsj.com
- Change analytics tool: Use KNIME instead of R, Python or the likes
- Spent some more time with EDA: And it wasn’t even enough, if we would have had more time we may have with the same conclusion way earlier
Using the complete Reddit dataset available (posts, comments, everything!) to reprocessing the data (and get to the same data as the Kaggle dataset) was a very interesting exercise: I used Azure HDInsight to rapidly create a cluster and Hive to process and filter the JSON files to extract just the subreddit content. The DJIA data is much smaller (and simple to manage) and then both of them were joined to obtain a dataset similar to the one from Kaggle.
In a future post, I will publish the project report paper we published with our detailed procedure and reports.
Thanks to the Social Media Analytics course I’m taking as part of my Masters in Data Science program, I found a very interesting paper about #FluxFlow that I had to summarize and present.
#FluxFlow is an analytics data visualization tool that helps identifying and understanding how ‘anomalous’ information spreads in social media. In the context of social media, “anomalous information” can be in most cases equated to rumors and ‘fake news’. Having a tool like this available to understand how this type of patterns work can help identifying and taking action over potentially harmful consequences.
The original paper (written by Jian Zhao, Nan Cao, Zhen Wen, Yale Song, Yu-Ru Lin, Christopher Collins) used for this research is available here for you to read plus a very concise and descriptive video here, and also the real #FluxFlow tool is here for you to see and understand. I created a super simple and brief presentation to summarize the tool and its potential applications to other scenarios.
Two weeks ago I started the second semester of the Masters in Data Science program and as part of it I am taking a course in Social Media Analytics. The first lab assignment for this course was on January 25 and the objective is to analyze Bell Let’s Talk social media campaign. Using a proposed tool called Netlytic (a community-supported text and social networks analyzer that automatically summarizes and discovers social networks from online conversations on social media sites) created by the course’s professor Dr. Anatoliy Gruzd I downloaded a tiny slice of #BellLetsTalk hashtagged data and created this super simple Power BI dashboard.
I have been wanting to play with Power BI’s Publish to Web functionality for quite some time and thought this was a great chance to give it a cool use. The data was exported from Netlytic as three CSV files and then imported into Power BI desktop. With the desktop tool I created a couple of simple measures (Total number of tweets and posts, Average number of tweets and posts per minute and so on) and then some simple visualizations.
KNIME is one of the many open source data analytics and blending tools available for free online.
This is a very basic presentation about KNIME I did at one of the labs as part of a Data Mining course in the Masters in Data Science and Analytics program at Ryerson University. The tool is really great and I ended up using it as the main analytics tool to deliver the final project for the same course.
As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.
- Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
- Documents to build the TF-IDF index can be on a local or HDFS path
- Index is stored in parquet format in HDFS
- Query terms and number of results are specified via command line arguments/li>
In the previous article on this series, I skipped the part where I downloaded data. At first I used my laptop and a downloader to get the files locally, which I ended up uploading to the Azure Data Lake Store folders. Another alternative that I wanted to give a try and will show you in this post, is downloading the data directly into an Azure VM to a file share.
You can mount file shares inside Linux VMs with the only restriction that the VM has to be within the Azure infrastructure (apparently this is a limitation caused by the fact that mounting a SMB file share in Linux does not support encryption just yet). That’s the reason why we need to spin up an Azure VM to do this, if not it would be possible to do it directly from your own laptop (you can do this using a Windows downloader if you mount the Azure File Share in windows too). In this case I can download all files and have the 160GB of data available, with the goal of moving only the required files to the Data Lake Store when needed to run analyitcs.
Creating the share to store the data
1. Get a connection string to your storage account. This is the simplest way I could find to create services associated with storage through CLI
azure storage account connectionstring show [STORAGE_ACCOUNT_NAME]
2. Copy the connection string returned and set it to the AZURE_STORAGE_CONNECTION_STRING environment variable. Don’t forget the double quotes!
3. Create the file share. You will be able to mount this from the VM you will create right after. By default, this share will have a limit of 5TB, sufficient enough for the 160GB we will download.
azure storage share create [SHARE_NAME]
Creating an Azure Linux VM using CLI
I’ve been good friends with Ubuntu for quite some time now, so I will create a minimal instance of an Ubuntu Server LTS. I only need to have the VM running while downloading and transferring files into the larger storage.
1. Register the network and compute providers
azure provider register Microsoft.Network
azure provider register Microsoft.Compute
2. Quick create the VM. After several trial and error runs, and reading some hidden documentation, I found the command line option to select the VM size (Basic_A0 is the smallest instance you can get). The command will prompt for the Resource Group Name, Virtual Machine Name, Location Name (has to be the same as the resource group!), Operating System, Username and Password. It will go through several steos (creating a storage account, creating a NIC, creating an IP configuration and public IP) and finally it will create your VM (I really appreciate that I don’t have to go through all those steps myself!).
azure vm quick-create -z Basic_A0 -Q UbuntuLTS
This command will come back with some info (notably the Public IP address and FQDN) that you can use to connect to your VM right away….
3. Connect to your newly minted VM using SSH, and the credentials you entered in the previous step.
4. Install tools to mount and mount the file share. I used “data” as my mount point, so I did a
mkdir data in my home directory.
sudo apt-get install cifs-utils
sudo mount -t cifs //[ACCOUNT_NAME].file.core.windows.net/[SHARE_NAME]./[MOUNT_POINT]-o vers=3.0,username=[ACCOUNT_NAME],password=[STORAGE_ACCOUNT_KEY_ENDING_IN_==],dir_mode=0777,file_mode=0777
If you want to check if this is working, you can copy a local file to the mount point and use the Azure Management portal to check if the file was uploaded correctly.
5. Install transmission, get the tracker file and start downloading. The -w option is to indicate where to download the files, in this case all data goes to the file share (as the VM HDD size is just too small).
sudo apt-get install transmission-daemon
sudo /etc/init.d/transmission-daemon start
sudo apt-get install transmission-cli
transmission-cli -w ./data 7690f71ea949b868080401c749e878f98de34d3d.torrent
6. Wait patiently for a couple of hours (around 5-6 hs) until your download completes… The next step would be to setup an Azure Data Factory pipeline to move the data from File Share to the Data Lake Store.
With some free time in my hands in between Coursera courses and classes not starting for the next couple of weeks, I wanted to use some of the new Azure Data Lake services and build a Big Data analytics proof of concept based on a large public dataset. So I decided to create these series of posts to document the experience and see what can be created with them.
To also play with some new shiny tools recently available, I made all of these steps using thew new Ubuntu Bash on Windows 10 and Azure CLI. Now that Bash is available on Windows, I think the Azure CLI is the best tool to use, as scripts created with it can be run both on Windows and Linux without any modifications. (In other multi-plaform and OSS news, Microsoft also recently announced the availability of Powershell on Linux, but I still think that using bash makes more sense than PowerShell).
What is Azure Data Lake
Azure Data Lake is a collection of services to help you create your own Data Lake and run analytics on its data. The two services are called “Azure Data Lake Store” and “Azure Data Lake Analytics”. Why would you use this as opposite of creating your own on-premise data lake? Cost is the first reason that comes to mind, as with any cloud based offering. The smart idea of these two services is that you can scale up storage independently of compute, whereas with an on-prem Hadoop Cluster you would be scaling both hand-in-hand. With Azure Data Lake you can store as much data as you need and only use the analytics engine when required.
To use these services you need an Azure subscription and request access to the preview version of the Azure Data Lake Store and Azure Data Lake Analytics services. The turnaround time to get approved is pretty quick, around an hour or so.
What is the difference between Azure Data Lake Store and Blob Storage
Azure Data Lake Store has some advantages when compared to Blob Storage: it overcomes some of its space limitations and can theoretically scale up to infinite. You can run Data Lake Analytics jobs using data stored in either Blob Storage or Data Lake Store, but apparently you should get much better performance using Data Lake Store.
Also cost is another differential. Blob storage is cheaper than Data Lake Store.
Summary: Use Blob Storage for large files that you are going to be keeping for the long time. Copy your files to the Data Lake Store only when you need to run Analytics on them.
Data set: Reddit Public comments
I found this very interesting site called Academic Torrents where you can find a list of public large datasets for academic use. The reddit dataset is about 160GB compressed in bz2 files and composed of about 1.7 billion JSON comment objects from reddit.com between October 2010 and May 2015. The great thing about it is that is split into monthly chunks (one file per month) so you can just download one month of data and start working right away.
To download the contents you can use your downloader of choice (Also I only downloaded the files for year 2007 to run this proof of concept).
Setting up the Azure Data Lake Store
To run all these steps you first need to have the Azure CLI available in the Ubuntu Bash.
1. First step is to install Node.js. You can skip this if you have node already installed, or you are running this somewhere with Node.js already installed.
curl -sL https://deb.nodesource.com/setup_4.x | sudo -E bash -
sudo apt-get install -y nodejs
2. Then you need to download and install the Azure CLI
wget aka.ms/linux-azure-cli -O azure-cli.tar.gz
gzip -d ./azure-cli.tar.gz
sudo npm install -g ./azure-cli.tar
3. Run some validation to see the CLI got installed correctly
4. Now you need to connect the CLI to your subscription, and set it into Resource Manager Mode
azure config mode arm
5. If you don’t have a resource group, or you want to create a new one just for this. In this case, it is named dataRG
azure group create -n "dataRG" -l "Canada East"
6. Next, you need to register the Data Lake Store and Data Lake Analytics providers with your subscription.
azure provider register Microsoft.DataLakeStore
7. Create an Azure Data Lake store account. Keep in mind the service is only available on the East US 2 region so far. The account name in this case is redditdata
azure datalake store account create redditdata eastus2 dataRG
8. Create a folder. Here, I’m creating a folder “2007” to store the files from that year.
azure datalake store filesystem create redditdata 2007 --folder
9. As the files downloaded are compressed in bz2, first expand them. I only expanded one of them as I may want to try out using Azure Data Factory to do this.
bzip2 -d ./RC_2007-10.bz2
10. Upload files to the Data Lake store folder. In this case the uploads are the expanded file from the previous step and one of the compressed files.
azure datalake store filesystem import redditdata ./RC_2007-10 "/2007/RC_2007-10.json"
azure datalake store filesystem import redditdata ./RC_2007-11.bz2 "/2007/RC_2007-11.bz2"
After all these steps, you should have both files (compressed bzip2 and uncompressed json) uploaded to the Data Lake store.
Setting up Azure Data Lake Analytics
1. Register data lake analytics provider for your subscription. This is similar to what we have done in step 6 but now for Data Analytics. If you have this enabled, you many not need it at all.
azure provider register Microsoft.DataLakeAnalytics
2. Create an account. In this case I’m calling it “redditanalytics”, the region is still East US 2, and I’m using the dataRG resource group and the redditdata Data Lake Store, both of them created in the previous steps.
azure datalake analytics account create "redditanalytics" eastus2 dataRG redditdata
With all these steps we just setup the stage to dive deep into doing analytics on the data. That will come in a future post, as I’m currently figuring out how to do it. But so far we proved that using the Azure CLI in the Windows Bash works pretty well, and you can manage most (if not all) of your subscription through it. Azure Data Lake Store seems like a service created to exclusively work paired to the Data Lake Analytics, so I still have to see if the value delivered justifies using it.
I been reading a lot of criticism about the lambda architecture lately, and it reminded me a lot about that famous essay about Software Engineering. And this doesn’t mean the Lambda architecture is not good, but that just because one architectural pattern exists doesn’t mean you have to use it in every single case.
My brief romance with Lambda
I’ve only worked in a couple of small projects related with Big Data / Modern Data Architecture that involved a Lambda architecture. The one that comes to mind was a proof of concept for IoT using the Azure Platform: Azure IoT hub, Azure Stream Analytic Jobs, SQL Azure Data Warehouse and Azure ML.
The goal was to capture telemetry data generated by Raspberry Pi devices (using Windows 10 IoT) and sent to an IoT hub. The data was then read by a Stream Analytics Job that sent it to the SQL DWH (batch layer), a PowerBI dasbhoard and also to an Event Hub for posterior feedback back to the Pi (speed layer). This case was particularly simple because the dashboard only had to show data from the speed layer (so there were no joins done with the speed layer) and also the batch layer only reprocessed the AzureML model on a daily basis. So far so good, lambda was my friend.
But this is only a rare case where both the batch and speed layer go separate ways. Usually they are combined at the end to show data in a dashboard, thus requiring to assemble data from both worlds. Plus, the logic on the batch layer has to be rewritten using speed layer tools.
What are the alternatives?
Another alternative is using micro-batches. This is probably the one I feel more comfortable with, coming from the world of BI, Data warehouses and batch processing.
Another interesting idea is presented here by folks from Uber in an O’Reilly data post: change Hadoop API to add mini-batches by basically adding just to primitives: Upserts and Incremental Consumption.
As you can see, not everything has to be Lambda-fied. As with all architectural patterns in software engineering, you just have to make the design the right solution for the problem at hand.
With Azure Machine Learning being released to General Availability this week (Feb 18th, 2015), more interesting news come to life.
There is a couple of (somewhat confusing) options to try and use AzureML. Better to be informed before you jump in and register your account with Azure…
AzureML Free Tier
With GA, Microsoft decided to release a free tier to make easy for you to try the service. The difference with the classic Azure trial is that you don’t need an Azure account for this (which requires a valid credit card).
Another difference is what you can do with this type of account: you’re not on trial time (one month, one year), but bound by other type of limitations such as: data storage (10GB), number of modules per experiment (100), max experiment duration (1 hour) and performance (throttled).
Still this is the best option if the only thing you want to do is to give AzureML a try, or even use it as a development environment before you move into production.
To use this, just go to https://studio.azureml.net/ and sign-in with your Microsoft Account.
Azure Free Trial
This is the classic Azure trial: you will be given 1 month and $200 that you can use to try any Azure service, including AzureML. It will require for you to register a new Azure account, and enter your credit card information.
After your one month trial expires, you can check the current prices here.
Different options for different goals
If you just want play and try some small experiments: Use the Free Tier. Most small experiments will be run just fine.
If you are ready to take your experiments to the next level, and release to production: Start with the Azure one month trial. After one month, you will be billed at the regular rates.
I’ve just got a new Dell XPS 13 2015, and all I can say it good things about it. I’ve been a faithful Mac convert since 2004, but after 10 years, I feel its the right time to come back to the PC and Windows.
Even through all these years, I always kept working on Windows, MacOS seemed a more stable and uniform environment, but with Windows 8.1 and the coming Windows 10, I think Microsoft is really coming back. Besides, the quality of Ultrabooks in general now matches (if not surpasses) the ones from Apple.
Image taken from The Verge’s Dell XPS 13 review
Are there other Ultrabook options to consider?
I bought a Yoga 3 Pro earlier this year and ended up returning it after less than a week, because of its lousy performance. Don’t get me wrong, I loved the chassis and design in general, also the 2-in-1 factor seemed cool at the beginning. But honestly, couldn’t justify the machine being slow after just opening two or three tabs on IE, unacceptable.
Now, straight to the Dell XPS 13: this is the Ultrabook to have in 2015. I’ve been following the XPS 13 for a couple of weeks now, and it was nowhere to be found: neither Dell, Microsoft, BestBuy or any other online retailer had it in stock.
Screen: Touch screen or matte?
The first option you have to deal with is the screen: if you want a touch screen is around $100 more, but also the resolution is awesome: 3200×1800 (even higher than a MacBook Pro Retina Display). The only drawback is the glossiness… I love matte screens, I’m sure I will find a matte screen protector for this.
The brightness at its maximum is really good, also has an auto-brightness setting that works pretty well and saves you battery.
Final comment: Just go for the non-touch if you must, the real deal is the 3200×1800 QHD touch screen. The resolution is excellent.
i3, i5 or i7?
The processor is the second big decision to make: i3 is not an option for me (having discarded the Yoga 3 Pro for having a Core M, which is even better than the i3). The only real options are i5 vs i7. This was a tough call, as I found the i5 reduced $100, so the gap between these two was $300. Too much of a price difference just for a couple more GHz and cache. Honestly, don’t think the i7 is worth it, unless you plan to keep your computer for a long time.
SSD Space: 128GB, 256GB or 512GB
128GB is out of the question: you either get 256 or 512. If I would have found the 512 n stock I’d buy it, but 256GB was the only thing I could get. Besides, the good news is that (apparently) you can upgrade the storage. If not, you can add more storage via an SD card.
Where to buy? At the Microsoft Store of course!
There are lots of retailers that can sell you this, but your best bet is still the Microsoft Store. Their service is superb, comparable experience to what you get at an Apple Store. When I was at the store an still undecided between the i5 and i7, they didn’t try to upsell me straight to the i7, but walked me through the considerations they would have, and ended up recommending the i5. That’s really honest!
Other advantages are:
– Signature Edition PCs: Your Windows is pre-installed by Microsoft and with no manufacturer adware, malware or bloatware. This is excellent, now that we’ve heard what just happened to Lenovo and its infamous Superfish.
– Microsoft Complete for PCs: Kind of an extended-warranty, but at $129 it definitely makes sense! Apple charges around $300 for the same on their Macs. Whatever problem you have, you can go to the Microsoft store and they’ll fix it for you. It covers up to two damage incidents during the two year warranty, and they will give you a new PC for just $49. Really hope I don’t have to use it, but you never know..
I’m very happy with the Dell XPS 13, the non-bezel display is gorgeous, the keyboard is very comfortable and the performance of the i5 model is excellent. The portability is very similar to a MacBook Air 11 (and that is not a typo).
Overall, a very minimalistic machine with excellent performance and at a reasonable price.
Azure Machine Learning (aka AzureML) is one of the new products/services in this new bold world of ‘cloud first, mobile first’ that Microsoft is endeavouring. It helps you create predictive analytics from your data in a very quick and simple way, and easily integrate this with allyour applications. And you can do that armed just with your browser!
But I think I’ve heard about this before… Haven’t I?
Remember a couple of years ago everything was 2.0? Web 2.0 was the paradigm everyone swore by, adding ‘social’ and ‘services’ around all we already knew by then.
That is how I feel about Azure Machine Learning: it is a great, improved 2.0 version of the old Data Mining concept we’ve known for years (SQL Server implemented this with its SSAS Data Mining feature). Don’t take me wrong, I’m not saying that because this already existed one will quickly discard it. I think Microsoft took a page of its own book, and put a lot of thinking on how to bring that into 2015. And that is great!
Out with the old…
If you remember, Analysis Services Data Mining always had a couple of algorithms you can use:
- Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset.
- Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset.
- Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.
- Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
- Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow.
To use them you would create a model in SSAS, load data (with help provided by SSIS) to train the model, and then you can use them through DMX (Data Mining eXtension) queries. Doing DMX queries involved connecting to SSAS using native windows-only proprietary drivers and then sending these queries to get back your results.
… and in with the new!
The principle behind AzureML is pretty much the same. Couple of notorious diferences here:
– You don’t need SSAS: In fact, you don’t even need SQL Server at all: no database, no SSIS, no SSAS. This is a pure online service, born into and for the cloud. There’s been talks about bringing it to on-premise, but honestly I don’t think that is going to happen any time soon (and nobody would blink an eye either).
– Data loading and manipulation inside the tool: As mentioned before, you don’t need SSIS. Your expermient designer in AzureML has a workflow view that resembles SSIS in the sense that you have components to scrub and manipulate data before loading into your model. One less thing to worry about.
– No DMX or weird query languages to use: As this is a cloud service, the output of your model is a web service. Anybody (with the correspondingAPI key) can call it and make use of your model. This makes your model available and online-ready in really no time.
– Integration with R: R is ‘THE’ language to create models. In the old world, you could still create your own models using the SSAS Data Mining SDK (using C++ or C#) but they would still have to be compiled into native windows code, deployed, managed and available only through SSAS. Being able to take any R algorithm available and use as a component makes this very much open for experimentation.
– One click deployment to Azure: To deploy your old data mining model used to require creating some kind of component (or service) to wrap the SSAS DMX call. Deploying to the cloud is literally done in one click, and you are ready to go. There’s even boilerplate code provided for you to call the production-ready web service from C#, Python and R.
– Really low entrance barrier: No infrastructure setup, no licensing costs, no development tools setup. The only thing you need to do is register to the AzureML service online and pay for the processing cost when you run your model. That’s it!
AzureML is one of those products (services?) that makes me excited about the future of Business Intelligence. So easy to setup, work with and deploy that is kind of a crime not using it!
Now, this is still a 1.0 version of a product. Features that are still not there or missing:
– Heavy data encryption: Training models often involve highly sensitive / private data. Everybody requires a trusted and heavily encrypted transport for this data. This is where most of the asks are going to come from: people coming from the Enterprise world concerned about their data travelling through public networks.
– Easy model retrain: Model re training is something it should be done frequently. Once you train your model, you need to keep it up to date to respond to environment changes and also potential decreasing accuracy. There is no easy way to automate this right now.
– More algorithms: This is mitigated by the fact that you can infinitely expand by using R, but still this is where most of the grow will come from. Also, Microsoft recently bought Revolution Analytics, so I would expect more algorithms and features added.
Your next steps
If you’re interested in using AzureML, just register a new account (there’s a 1 month, $200 trial) and just start using it. Some resources you can use to start learning it are:
By: Roger Barga; Valentine Fontama; Wee Hyong Tok
Pub. Date: November 26, 2014
Print ISBN-13: 978-1-484-20445-0
Pages in Print Edition: 188
– If you only have 5 minutes or less, watch this: Azure ML Overview: this is a great 5 minutes overview of what AzureML is.
– If you have one hour, watch this: Intro to Azure Machine Learning: The full product tour, with demos, from TechEd 2014.
If you have more time, you can start watching this YouTube video playlist.