Setup local Spark Cluster on Mac and explore UI for Master, Workers, and Jobs(Python using Jupyter notebook)

What is covered

In this blog, I would describe my experiments about setting up a local (standalone) Spark cluster on Mac M1 machine.
Would start jupyter notebook using anaconda navigator
Submit jobs using Python code executing in the notebook with jobs submitted to the master created above.
Explore the UI for:
- Master
- Worker
- Jobs (transformation and action operations, Directed Acyclic Graph)

Why local (standalone) server

local server is a simple deployment model and is possible to run the daemons in a single node
Jobs submitted using pyspark are explored using the same UI in previous blog
Jobs submitted using python programs are explored in this blog

Start the master and worker

After installation of pyspark (using homebrew), scripts for starting the master and worker were available at the following location:

cd /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/sbin

start the server (note that would be specific to the machine)

% ./start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out

check the log file for the spark master server details ( is obtained from the previous step>

% tail -10 /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
...
22/05/18 07:41:57 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/05/18 07:41:57 INFO Master: Starting Spark master at spark://<your-spark-master>:7077
22/05/18 07:41:57 INFO Master: Running Spark version 3.2.1
22/05/18 07:41:57 INFO Utils: Successfully started service 'MasterUI' on port 8080.
...

check the master UI
notice that workers are zero, since, we have not yet started any

Screen Shot 2022-05-18 at 7.44.38 AM.png

start the worker by selecting the number of cores and memory based on your system configuration.
please check logfile in the previous step and replace with the appropriate value for

 % ./start-worker.sh --cores 2 --memory 2G spark://<your-spark-master>:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-worker-logfile>.out

check the master UI for worker information (now it shows 1 worker)

Screen Shot 2022-05-18 at 7.48.16 AM.png

also, check the worker UI for the cores and memory allocated

Screen Shot 2022-05-18 at 7.50.01 AM.png

submit jobs using python in jupyter notebook and observe jobs UI

started jupyter notebook using anaconda navigator
write python code to submit jobs to the master setup above
operations are performed on RDDs (Resilient Distributed Dataset)
spark has two kinds of operations:
- Transformation → operations such as map, filter, join or union that are performed on an RDD that yields a new RDD containing the result
- Action → operations such as reduce, first, count that return a value after running a computation on an RDD
Actions could be displayed as jobs in the UI and transformations could be observed while exploring the DAG (Directed Acyclyic Graph) output
started jupyter notebook from anaconda navigator

Screen Shot 2022-05-27 at 10.51.46 PM.png

create Spark Application in notebook (replace with your-master)

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("<your-master>") \
                    .appName('SparkTestPython') \
                    .getOrCreate()

print("First SparkContext:");
print("APP Name :"+spark.sparkContext.appName);
print("Master :"+spark.sparkContext.master);

Screen Shot 2022-05-27 at 11.50.44 PM.png

Now observe the master and worker that the Application is listed

Screen Shot 2022-05-27 at 11.51.11 PM.png

Screen Shot 2022-05-27 at 11.55.46 PM.png

we will write a simple code in the notebook to do one transformation (parallelize) and one action (count) operation
```
a=("hello","world","jupyter","local","mac","m1")
b=spark.sparkContext.parallelize(a)
b.count()
```

Screen Shot 2022-05-27 at 11.54.09 PM.png

From the master, click on the application to check the application UI:
In the application UI, click on "Application Detail UI" to get details on the jobs

Screen Shot 2022-05-27 at 11.54.59 PM.png

to see DAG (Directed Acyclic Graph) visualization click on the job (under the Description column)
this is a simple DAG with only one transformation and one operation

Screen Shot 2022-05-27 at 11.56.52 PM.png

we will write another simple code in the notebook to do two transformations (read from file and map) and one action (collect) operation

x=spark.sparkContext.textFile('/tmp/sparktest.txt')
y=x.map(lambda n: n.upper())
y.collect()

Screen Shot 2022-05-28 at 12.04.02 AM.png

Check the jobs UI

Screen Shot 2022-05-28 at 12.06.04 AM.png

Next steps

try out additional spark operations, and actions, and check additional details like storage, etc. in the UI
using data frames

Sriram's blog