Big Data, Bernardo Najlis

Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark

Posted on December 11, 2016 by bnajlis

Introduction

As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.

Features Summary

Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
Documents to build the TF-IDF index can be on a local or HDFS path
Index is stored in parquet format in HDFS
Query terms and number of results are specified via command line arguments/li>

Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark” →

Analyzing Reddit Public Comments on Azure Data Lake and Azure Data Analytics (Part 1.5)

Posted on August 24, 2016 by bnajlis

In the previous article on this series, I skipped the part where I downloaded data. At first I used my laptop and a downloader to get the files locally, which I ended up uploading to the Azure Data Lake Store folders. Another alternative that I wanted to give a try and will show you in this post, is downloading the data directly into an Azure VM to a file share.

You can mount file shares inside Linux VMs with the only restriction that the VM has to be within the Azure infrastructure (apparently this is a limitation caused by the fact that mounting a SMB file share in Linux does not support encryption just yet). That’s the reason why we need to spin up an Azure VM to do this, if not it would be possible to do it directly from your own laptop (you can do this using a Windows downloader if you mount the Azure File Share in windows too). In this case I can download all files and have the 160GB of data available, with the goal of moving only the required files to the Data Lake Store when needed to run analyitcs.

Creating the share to store the data

1. Get a connection string to your storage account. This is the simplest way I could find to create services associated with storage through CLI

azure storage account connectionstring show [STORAGE_ACCOUNT_NAME]

2. Copy the connection string returned and set it to the AZURE_STORAGE_CONNECTION_STRING environment variable. Don’t forget the double quotes!

export AZURE_STORAGE_CONNECTION_STRING="[CONNECTION_STRING]"

3. Create the file share. You will be able to mount this from the VM you will create right after. By default, this share will have a limit of 5TB, sufficient enough for the 160GB we will download.

azure storage share create [SHARE_NAME]

Creating an Azure Linux VM using CLI

I’ve been good friends with Ubuntu for quite some time now, so I will create a minimal instance of an Ubuntu Server LTS. I only need to have the VM running while downloading and transferring files into the larger storage.

1. Register the network and compute providers

azure provider register Microsoft.Network azure provider register Microsoft.Compute

2. Quick create the VM. After several trial and error runs, and reading some hidden documentation, I found the command line option to select the VM size (Basic_A0 is the smallest instance you can get). The command will prompt for the Resource Group Name, Virtual Machine Name, Location Name (has to be the same as the resource group!), Operating System, Username and Password. It will go through several steos (creating a storage account, creating a NIC, creating an IP configuration and public IP) and finally it will create your VM (I really appreciate that I don’t have to go through all those steps myself!).

azure vm quick-create -z Basic_A0 -Q UbuntuLTS

This command will come back with some info (notably the Public IP address and FQDN) that you can use to connect to your VM right away….

3. Connect to your newly minted VM using SSH, and the credentials you entered in the previous step.

4. Install tools to mount and mount the file share. I used “data” as my mount point, so I did a mkdir data in my home directory.

sudo apt-get install cifs-utils sudo mount -t cifs //[ACCOUNT_NAME].file.core.windows.net/[SHARE_NAME]./[MOUNT_POINT]-o vers=3.0,username=[ACCOUNT_NAME],password=[STORAGE_ACCOUNT_KEY_ENDING_IN_==],dir_mode=0777,file_mode=0777

If you want to check if this is working, you can copy a local file to the mount point and use the Azure Management portal to check if the file was uploaded correctly.

5. Install transmission, get the tracker file and start downloading. The -w option is to indicate where to download the files, in this case all data goes to the file share (as the VM HDD size is just too small).

sudo apt-get install transmission-daemon sudo /etc/init.d/transmission-daemon start sudo apt-get install transmission-cli wget http://academictorrents.com/download/7690f71ea949b868080401c749e878f98de34d3d.torrent transmission-cli -w ./data 7690f71ea949b868080401c749e878f98de34d3d.torrent

6. Wait patiently for a couple of hours (around 5-6 hs) until your download completes… The next step would be to setup an Azure Data Factory pipeline to move the data from File Share to the Data Lake Store.

Analyzing Reddit Public Comments on Azure Data Lake and Azure Data Analytics (Part 1)

Posted on August 23, 2016 by bnajlis

With some free time in my hands in between Coursera courses and classes not starting for the next couple of weeks, I wanted to use some of the new Azure Data Lake services and build a Big Data analytics proof of concept based on a large public dataset. So I decided to create these series of posts to document the experience and see what can be created with them.

To also play with some new shiny tools recently available, I made all of these steps using thew new Ubuntu Bash on Windows 10 and Azure CLI. Now that Bash is available on Windows, I think the Azure CLI is the best tool to use, as scripts created with it can be run both on Windows and Linux without any modifications. (In other multi-plaform and OSS news, Microsoft also recently announced the availability of Powershell on Linux, but I still think that using bash makes more sense than PowerShell).

What is Azure Data Lake

Azure Data Lake is a collection of services to help you create your own Data Lake and run analytics on its data. The two services are called “Azure Data Lake Store” and “Azure Data Lake Analytics”. Why would you use this as opposite of creating your own on-premise data lake? Cost is the first reason that comes to mind, as with any cloud based offering. The smart idea of these two services is that you can scale up storage independently of compute, whereas with an on-prem Hadoop Cluster you would be scaling both hand-in-hand. With Azure Data Lake you can store as much data as you need and only use the analytics engine when required.

To use these services you need an Azure subscription and request access to the preview version of the Azure Data Lake Store and Azure Data Lake Analytics services. The turnaround time to get approved is pretty quick, around an hour or so.

What is the difference between Azure Data Lake Store and Blob Storage

Azure Data Lake Store has some advantages when compared to Blob Storage: it overcomes some of its space limitations and can theoretically scale up to infinite. You can run Data Lake Analytics jobs using data stored in either Blob Storage or Data Lake Store, but apparently you should get much better performance using Data Lake Store.

Also cost is another differential. Blob storage is cheaper than Data Lake Store.

Summary: Use Blob Storage for large files that you are going to be keeping for the long time. Copy your files to the Data Lake Store only when you need to run Analytics on them.

Data set: Reddit Public comments

I found this very interesting site called Academic Torrents where you can find a list of public large datasets for academic use. The reddit dataset is about 160GB compressed in bz2 files and composed of about 1.7 billion JSON comment objects from reddit.com between October 2010 and May 2015. The great thing about it is that is split into monthly chunks (one file per month) so you can just download one month of data and start working right away.

To download the contents you can use your downloader of choice (Also I only downloaded the files for year 2007 to run this proof of concept).

Setting up the Azure Data Lake Store

To run all these steps you first need to have the Azure CLI available in the Ubuntu Bash.

1. First step is to install Node.js. You can skip this if you have node already installed, or you are running this somewhere with Node.js already installed.

curl -sL https://deb.nodesource.com/setup_4.x | sudo -E bash - sudo apt-get install -y nodejs

2. Then you need to download and install the Azure CLI

wget aka.ms/linux-azure-cli -O azure-cli.tar.gz gzip -d ./azure-cli.tar.gz sudo npm install -g ./azure-cli.tar

3. Run some validation to see the CLI got installed correctly

azure help azure --version

4. Now you need to connect the CLI to your subscription, and set it into Resource Manager Mode

azure login azure config mode arm

5. If you don’t have a resource group, or you want to create a new one just for this. In this case, it is named dataRG

azure group create -n "dataRG" -l "Canada East"

6. Next, you need to register the Data Lake Store and Data Lake Analytics providers with your subscription.

azure provider register Microsoft.DataLakeStore

7. Create an Azure Data Lake store account. Keep in mind the service is only available on the East US 2 region so far. The account name in this case is redditdata

azure datalake store account create redditdata eastus2 dataRG

8. Create a folder. Here, I’m creating a folder “2007” to store the files from that year.

azure datalake store filesystem create redditdata 2007 --folder

9. As the files downloaded are compressed in bz2, first expand them. I only expanded one of them as I may want to try out using Azure Data Factory to do this.

bzip2 -d ./RC_2007-10.bz2

10. Upload files to the Data Lake store folder. In this case the uploads are the expanded file from the previous step and one of the compressed files.

azure datalake store filesystem import redditdata ./RC_2007-10 "/2007/RC_2007-10.json" azure datalake store filesystem import redditdata ./RC_2007-11.bz2 "/2007/RC_2007-11.bz2"

After all these steps, you should have both files (compressed bzip2 and uncompressed json) uploaded to the Data Lake store.

Setting up Azure Data Lake Analytics

1. Register data lake analytics provider for your subscription. This is similar to what we have done in step 6 but now for Data Analytics. If you have this enabled, you many not need it at all.

azure provider register Microsoft.DataLakeAnalytics

2. Create an account. In this case I’m calling it “redditanalytics”, the region is still East US 2, and I’m using the dataRG resource group and the redditdata Data Lake Store, both of them created in the previous steps.

azure datalake analytics account create "redditanalytics" eastus2 dataRG redditdata

Summary

With all these steps we just setup the stage to dive deep into doing analytics on the data. That will come in a future post, as I’m currently figuring out how to do it. But so far we proved that using the Azure CLI in the Windows Bash works pretty well, and you can manage most (if not all) of your subscription through it. Azure Data Lake Store seems like a service created to exclusively work paired to the Data Lake Analytics, so I still have to see if the value delivered justifies using it.

Lambda architecture: No Silver Bullet

Posted on August 6, 2016 by bnajlis

I been reading a lot of criticism about the lambda architecture lately, and it reminded me a lot about that famous essay about Software Engineering. And this doesn’t mean the Lambda architecture is not good, but that just because one architectural pattern exists doesn’t mean you have to use it in every single case.

My brief romance with Lambda

I’ve only worked in a couple of small projects related with Big Data / Modern Data Architecture that involved a Lambda architecture. The one that comes to mind was a proof of concept for IoT using the Azure Platform: Azure IoT hub, Azure Stream Analytic Jobs, SQL Azure Data Warehouse and Azure ML.

The goal was to capture telemetry data generated by Raspberry Pi devices (using Windows 10 IoT) and sent to an IoT hub. The data was then read by a Stream Analytics Job that sent it to the SQL DWH (batch layer), a PowerBI dasbhoard and also to an Event Hub for posterior feedback back to the Pi (speed layer). This case was particularly simple because the dashboard only had to show data from the speed layer (so there were no joins done with the speed layer) and also the batch layer only reprocessed the AzureML model on a daily basis. So far so good, lambda was my friend.

But this is only a rare case where both the batch and speed layer go separate ways. Usually they are combined at the end to show data in a dashboard, thus requiring to assemble data from both worlds. Plus, the logic on the batch layer has to be rewritten using speed layer tools.

What are the alternatives?

If you haven’t heard about the ‘Kappa architecture‘, you can take a look here. Basically, they propose using streaming as the common layer for both the speed and the batch layer

Another alternative is using micro-batches. This is probably the one I feel more comfortable with, coming from the world of BI, Data warehouses and batch processing.

Another interesting idea is presented here by folks from Uber in an O’Reilly data post: change Hadoop API to add mini-batches by basically adding just to primitives: Upserts and Incremental Consumption.

As you can see, not everything has to be Lambda-fied. As with all architectural patterns in software engineering, you just have to make the design the right solution for the problem at hand.

mongoDB – What’s great (and not so great) about it

Posted on August 25, 2014 by bnajlis

mongoDB is a relatively new database management system, one of the prime examples of the No-SQL database movement (if such a thing exists). In No-SQL databases, that can also be referred to as ‘non-relational databases’, you don’t represent data tables that store rows and their relations. Each No-SQL database has its own particular way of modelling, storing and representing data.

This NoSQL movement is basically promoting the shift of development and logic on database querying and processing out of the database systems (and SQL language) and into the developer and programming world. I think programmers never liked the SQL language, or never had the time or patience to understand its declarative nature (a declarative language is one where you express a computational logic and not so much a program flow). There were many attempts to lower the impedance mismatch between those world over the years: object-oriented databases, ORMs (object-relational mappers) and even LINQ in the .NET world and their equivalents in some other languages and platforms like Java. I think NoSQL is just another attempt on that, but more specific: their objective is targeted specifically to manage huge amounts of data (popularly known as “Big Data“). Summarizing, where in a relational database you would use SQL to pull data out of the database, in the NoSQL world you would use your system’s programming language.

In the case of mongoDB, data is stored in form of “documents” which are basically JSON strings, some sort of object serialization. If you are a JavaScript or web developer, you are in good luck today, because you are very familiar with JSON, and the way it represents information. If not, you will have a slight learning curve, but nothing to steep to be honest.

Another interesting characteristic on mongoDB is schema management: in a relational database, you first model a table, where you specify the types of data you will be able to store (columns) and their data types. In mongoDB there is no such thing, every data item you store is just a serialization and it can be completely different from any other stored in the same collection.

I’ve been working with mongoDB for the last couple of months, in an experimental way, but now I’m starting to work on it for a project full time.
I had the chance to compare it (more philosophically) with other database systems I worked with, and I’ve come to like it to some extent, although still leaves me with some doubts and wishes in several aspects.

The good

Here are some of the things I really like about mongoDB:

– Free and open source: This model works well for small projects, but you will find costs as you grow. You will want a more robust infrastructure, and mongoDB requires more hardware than other database systems in order to be fault tolerant. Also, you will want some kind of support from mongoDB, and you will have to pay for it. Also, open source means you can take a look at the source code, but mongoDB (the company) still owns the product and the project’s destiny. This means you can start small with free, and then keep growing as you need more.

– Scalable almost to infinity: this is not to say that you will need that, but is more scalable than traditional relational database systems. With the SQL Servers and Oracles of the world, if you want to scale, you would buy a bigger server (more RAM, more HDD, more processing power): this is called scaling vertically. You can see there is a limit to how big your server can be, right? With mongoDB, you will get more inexpensive hardware and add them to a cluster that behaves as just one big server to the application layer: this is called scaling horizontally. There is virtually no limit to how many servers you can add to a cluster.

– Simple JSON API: This is what makes it so popular. Everybody and their mothers who know who to program in JS can now use a very simple API to access a database.

– Very good documentation: All the information you can need is available at mongodb.org. If you need some hand holding, they even provide online courses at education.mongodb.com

The bad, and the ugly

Things I really don’t like about it:

– Not so great in the enterprise environment: mongoDB (the company) is clearly putting all their efforts to push this into the Enterprise landscape, with different degrees of success. I’ve seen some really awesome use cases (like implementations of Customer 360 view apps created in incredible record times) but also some very awful implementations.

– JSON: Yeap, I think this is their blessing and curse. The fact that everybody can simply use this makes it very easy for anybody with absolutely no understanding of database modelling or theory, to make things a mess in record time.

– DBA tooling is poor: And this is something that has been improved over time. As mongoDB relies heavily on their community to create management / monitoring / optimization tools, there is not a clear path or toolset that one can use to work or even develop. Sometimes, too many options can be a problem.

All in all, I would still recommend for you to take a look on it, just to get a glimpse on what the non-relational database world looks like. It is always good to broaden ones horizons.