Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark

Introduction

As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.

Features Summary

  • Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
  • Documents to build the TF-IDF index can be on a local or HDFS path
  • Index is stored in parquet format in HDFS
  • Query terms and number of results are specified via command line arguments/li>

Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark”