Tag Archives: pyspark

Starting with PySpark – configuration

PySpark is a pain to configure.

For this guide I am using macOS Mojave.
Spark version 2.4.0
Python 3

Start by downloading the Spark https://spark.apache.org/downloads.html. Extract wherever – can be your home directory.

Install Java SDK. Important – some later versions don’t seem to be compatible with spark 2.4.0. Version 8 seems to work- https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Install pyspark: pip install pyspark

Configure your zshrc/bash_profile – depending on what shell you use:

export SPARK_PATH=~/spark-2.4.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"

export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

export SPARK_HOME=~/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

export JAVA_HOME=$(/usr/libexec/java_home)

Remember to reload your console.

Now, when you enter pyspark on your console, it’ll open a notebook.

You can validate if Spark context is available by entering this in your new notebook:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

References: https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735