This is a project from the Natural Language Processing course in my Masters in Data Science program. The project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.
The final conclusion is that, as the Deep Learning Model is less dependent on specific language grammar rules, it is more generalizable (given embeddings and some labeled corpora is provided in any language) whereas the Maximum Entropy model will perform poorly on an language where there is no Domain Knowledge to create the required features.
This is our deck for the final presentation:
This is our final report / paper with our results and conclusion:
All source code for this project can be found in this GitHub repository: https://github.com/bnajlis/named_entity_recognition
This is a summary presentation about the final group project I worked on during this winter for the Data Mining course in the Masters of Data Science and Analytics program at Ryerson University.
In this project we use daily world news (and more specifically the /r/worldnews subreddit) to try to predict trends (up or down) on the Dow Jones Industrial Average daily prices. The idea for this project is not originally mine, and it was first posted as part of a Kaggle dataset, with many kernel submissions , and our project changed a couple of things:
- Reprocess the data from the source: Extract the /r/worldnews directly from the complete reddit dataset, get up/down from DJIA data coming from wsj.com
- Change analytics tool: Use KNIME instead of R, Python or the likes
- Spent some more time with EDA: And it wasn’t even enough, if we would have had more time we may have with the same conclusion way earlier
Using the complete Reddit dataset available (posts, comments, everything!) to reprocessing the data (and get to the same data as the Kaggle dataset) was a very interesting exercise: I used Azure HDInsight to rapidly create a cluster and Hive to process and filter the JSON files to extract just the subreddit content. The DJIA data is much smaller (and simple to manage) and then both of them were joined to obtain a dataset similar to the one from Kaggle.
In a future post, I will publish the project report paper we published with our detailed procedure and reports.
Two weeks ago I started the second semester of the Masters in Data Science program and as part of it I am taking a course in Social Media Analytics. The first lab assignment for this course was on January 25 and the objective is to analyze Bell Let’s Talk social media campaign. Using a proposed tool called Netlytic (a community-supported text and social networks analyzer that automatically summarizes and discovers social networks from online conversations on social media sites) created by the course’s professor Dr. Anatoliy Gruzd I downloaded a tiny slice of #BellLetsTalk hashtagged data and created this super simple Power BI dashboard.
I have been wanting to play with Power BI’s Publish to Web functionality for quite some time and thought this was a great chance to give it a cool use. The data was exported from Netlytic as three CSV files and then imported into Power BI desktop. With the desktop tool I created a couple of simple measures (Total number of tweets and posts, Average number of tweets and posts per minute and so on) and then some simple visualizations.
Continue reading “Social Media Analytics: Bell Let’s Talk 2017” →