As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.
- Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
- Documents to build the TF-IDF index can be on a local or HDFS path
- Index is stored in parquet format in HDFS
- Query terms and number of results are specified via command line arguments/li>
Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark”