Named Entity Recognition from Online News

This is a project from the Natural Language Processing course in my Masters in Data Science program. The project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.

The final conclusion is that, as the Deep Learning Model is less dependent on specific language grammar rules, it is more generalizable (given embeddings and some labeled corpora is provided in any language) whereas the Maximum Entropy model will perform poorly on an language where there is no Domain Knowledge to create the required features.

This is our deck for the final presentation:

This is our final report / paper with our results and conclusion:

All source code for this project can be found in this GitHub repository:

Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark


As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.

Features Summary

  • Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
  • Documents to build the TF-IDF index can be on a local or HDFS path
  • Index is stored in parquet format in HDFS
  • Query terms and number of results are specified via command line arguments/li>

Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark”