Sriram Chintalapati
Sriram's blog

Sriram's blog

Setup local Spark Cluster on Mac and explore UI for Master, Workers, and Jobs(Python using Jupyter notebook)

Sriram Chintalapati's photo
Sriram Chintalapati
·May 28, 2022·

4 min read

Table of contents

  • What is covered
  • Why local (standalone) server
  • Start the master and worker
  • submit jobs using python in jupyter notebook and observe jobs UI
  • Next steps
  • References

What is covered

  • In this blog, I would describe my experiments about setting up a local (standalone) Spark cluster on Mac M1 machine.
  • Would start jupyter notebook using anaconda navigator
  • Submit jobs using Python code executing in the notebook with jobs submitted to the master created above.
  • Explore the UI for:
    • Master
    • Worker
    • Jobs (transformation and action operations, Directed Acyclic Graph)

Why local (standalone) server

  • local server is a simple deployment model and is possible to run the daemons in a single node
  • Jobs submitted using pyspark are explored using the same UI in previous blog
  • Jobs submitted using python programs are explored in this blog

Start the master and worker

  • After installation of pyspark (using homebrew), scripts for starting the master and worker were available at the following location:
cd /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/sbin
  • start the server (note that would be specific to the machine)
% ./start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
  • check the log file for the spark master server details ( is obtained from the previous step>
% tail -10 /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
...
22/05/18 07:41:57 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/05/18 07:41:57 INFO Master: Starting Spark master at spark://<your-spark-master>:7077
22/05/18 07:41:57 INFO Master: Running Spark version 3.2.1
22/05/18 07:41:57 INFO Utils: Successfully started service 'MasterUI' on port 8080.
...
  • check the master UI
  • notice that workers are zero, since, we have not yet started any

Screen Shot 2022-05-18 at 7.44.38 AM.png

  • start the worker by selecting the number of cores and memory based on your system configuration.
  • please check logfile in the previous step and replace with the appropriate value for
 % ./start-worker.sh --cores 2 --memory 2G spark://<your-spark-master>:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-worker-logfile>.out
  • check the master UI for worker information (now it shows 1 worker)

Screen Shot 2022-05-18 at 7.48.16 AM.png

  • also, check the worker UI for the cores and memory allocated

Screen Shot 2022-05-18 at 7.50.01 AM.png

submit jobs using python in jupyter notebook and observe jobs UI

  • started jupyter notebook using anaconda navigator
  • write python code to submit jobs to the master setup above
  • operations are performed on RDDs (Resilient Distributed Dataset)
  • spark has two kinds of operations:
    • Transformation → operations such as map, filter, join or union that are performed on an RDD that yields a new RDD containing the result
    • Action → operations such as reduce, first, count that return a value after running a computation on an RDD
  • Actions could be displayed as jobs in the UI and transformations could be observed while exploring the DAG (Directed Acyclyic Graph) output

  • started jupyter notebook from anaconda navigator

Screen Shot 2022-05-27 at 10.51.46 PM.png

  • create Spark Application in notebook (replace with your-master)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("<your-master>") \
                    .appName('SparkTestPython') \
                    .getOrCreate()

print("First SparkContext:");
print("APP Name :"+spark.sparkContext.appName);
print("Master :"+spark.sparkContext.master);

Screen Shot 2022-05-27 at 11.50.44 PM.png

  • Now observe the master and worker that the Application is listed

Screen Shot 2022-05-27 at 11.51.11 PM.png

Screen Shot 2022-05-27 at 11.55.46 PM.png

  • we will write a simple code in the notebook to do one transformation (parallelize) and one action (count) operation
    a=("hello","world","jupyter","local","mac","m1")
    b=spark.sparkContext.parallelize(a)
    b.count()
    

Screen Shot 2022-05-27 at 11.54.09 PM.png

  • From the master, click on the application to check the application UI:
  • In the application UI, click on "Application Detail UI" to get details on the jobs

Screen Shot 2022-05-27 at 11.54.59 PM.png

  • to see DAG (Directed Acyclic Graph) visualization click on the job (under the Description column)
  • this is a simple DAG with only one transformation and one operation

Screen Shot 2022-05-27 at 11.56.52 PM.png

  • we will write another simple code in the notebook to do two transformations (read from file and map) and one action (collect) operation
x=spark.sparkContext.textFile('/tmp/sparktest.txt')
y=x.map(lambda n: n.upper())
y.collect()

Screen Shot 2022-05-28 at 12.04.02 AM.png

  • Check the jobs UI

Screen Shot 2022-05-28 at 12.06.04 AM.png

Next steps

  • try out additional spark operations, and actions, and check additional details like storage, etc. in the UI
  • using data frames

References

 
Share this