Spark, Bernie Najlis

Introduction

As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.

Features Summary

Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
Documents to build the TF-IDF index can be on a local or HDFS path
Index is stored in parquet format in HDFS
Query terms and number of results are specified via command line arguments/li>

Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark” →