Setup local Spark Cluster on Mac and explore UI for Master, Workers, and Jobs(pyspark)

What is covered

  • In this blog, I would describe my experiments about setting up a local (standalone) Spark cluster on Mac M1 machine.
  • Would start pyspark with an option to use the master setup in above step.
  • Explore the UI for:
    • Master
    • Worker
    • Jobs (transformation and action operations, Directed Acyclic Graph)

Why local (standalone) server

  • local server is a simple deployment mode and is possible to run the daemons in a single node
  • Jobs submitted using python programs and could be explored using the same UI (would cover in a future blog)

Start the master and worker

  • After installation of pyspark (using homebrew), scripts for starting the master and worker were available at the following location:
cd /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/sbin
  • start the server (note that would be specific to the machine)
% ./start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
  • check the log file for the spark master server details ( is obtained from the previous step>
% tail -10 /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
...
22/05/18 07:41:57 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/05/18 07:41:57 INFO Master: Starting Spark master at spark://<your-spark-master>:7077
22/05/18 07:41:57 INFO Master: Running Spark version 3.2.1
22/05/18 07:41:57 INFO Utils: Successfully started service 'MasterUI' on port 8080.
...
  • check the master UI
  • notice that workers are zero, since, we have not yet started any

Screen Shot 2022-05-18 at 7.44.38 AM.png

  • start the worker by selecting the number of cores and memory based on your system configuration.
  • please check logfile in the previous step and replace with the appropriate value for
 % ./start-worker.sh --cores 2 --memory 2G spark://<your-spark-master>:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-worker-logfile>.out
  • check the master UI for worker information (now it shows 1 worker)

Screen Shot 2022-05-18 at 7.48.16 AM.png

  • also, check the worker UI for the cores and memory allocated

Screen Shot 2022-05-18 at 7.50.01 AM.png

submit jobs using pyspark and observe jobs UI

  • we will start pyspark to use the master setup above
  • operations are performed on RDDs (Resilient Distributed Dataset)
  • spark has two kinds of operations:
    • Transformation → operations such as map, filter, join or union that are performed on an RDD that yields a new RDD containing the result
    • Action → operations such as reduce, first, count that return a value after running a computation on an RDD
  • Actions could be displayed as jobs in the UI and transformations could be observed while exploring the DAG (Directed Acyclyic Graph) output

  • for pyspark to submit jobs to master, start as follows (replace with the appropriate value for :

    % pyspark --master spark://<your-spark-master>:7077
    
  • Now observe the master that Application is listed

Screen Shot 2022-05-18 at 10.34.17 PM.png

  • we will write a simple code in pyspark shell to do one transformation (parallelize) and one action (count) operation

    ...
    Using Python version 3.9.7 (default, Sep 16 2021 08:50:36)
    Spark context Web UI available at http://venkatas-mini.lan:4040
    Spark context available as 'sc' (master = spark://Venkatas-Mini.lan:7077, app id = app-20220518222803-0001).
    SparkSession available as 'spark'.
    >>> a=("hello","world","pyspark","local","mac","m1")
    >>> b = sc.parallelize(a)
    >>> b.count()
    6                                                                               
    >>>
    
  • From master, click on the application to check the application UI:

Screen Shot 2022-05-18 at 10.39.02 PM.png

  • In the application ui, click on "Application Detail UI" to get details on the jobs

Screen Shot 2022-05-18 at 10.42.26 PM.png

  • to see DAG (Directed Acyclic Graph) visualization click on the job (under Description column)
  • this is a simple DAG with only one transformation and one operation

Screen Shot 2022-05-18 at 10.45.23 PM.png

Next steps

  • submit spark jobs as python programs and check UI for master, worker and jobs
  • try out additional spark operations, and actions, and check additional details like storage, etc. in the UI

References