Skip to content

Apache Spark Cluster Setup on NERC OpenStack

Apache Spark Overview

Apache Spark is increasingly recognized as the primary analysis suite for big data, particularly among Python users. Spark offers a robust Python API and includes several valuable built-in libraries such as MLlib for machine learning and Spark Streaming for real-time analysis. In contrast to Apache Hadoop, Spark performs most computations in main memory boosting the performance.

Many modern computational tasks utilize the MapReduce parallel paradigm. This computational process comprises two stages: Map and Reduce. Before task execution, all data is distributed across the nodes of the cluster. During the "Map" stage, the master node dispatches the executable task to the other nodes, and each worker processes its respective data. The subsequent step is "Reduce" that involves the master node collecting all results from the workers and generating final results based on the workers' outcomes. Apache Spark also implements this model of computations so it provides Big Data Processing abilities.

Apache Spark Cluster Setup

To get a Spark standalone cluster up and running manually, all you need to do is spawn some VMs and start Spark as master on one of them and worker on the others. They will automatically form a cluster that you can connect to/from Python, Java, and Scala applications using the IP address of the master VM.

Setup a Master VM

  • To create a master VM for the first time, ensure that the "Image" dropdown option is selected. In this example, we selected ubuntu-22.04-x86_64 and the cpu-su.2 flavor is being used.

  • Make sure you have added rules in the Security Groups to allow ssh using Port 22 access to the instance.

  • Assign a Floating IP to your new instance so that you will be able to ssh into this machine:

    ssh ubuntu@<Floating-IP> -A -i <Path_To_Your_Private_Key>
    

    For example:

    ssh ubuntu@199.94.61.4 -A -i cloud.key
    
  • Upon successfully accessing the machine, execute the following dependencies:

    sudo apt-get -y update
    sudo apt install default-jre -y
    
  • Download and install Scala:

    wget https://downloads.lightbend.com/scala/2.13.10/scala-2.13.10.deb
    sudo dpkg -i scala-2.13.10.deb
    sudo apt-get install scala
    

    Note

    Installing Scala means installing various command-line tools such as the Scala compiler and build tools.

  • Download and unpack Apache Spark:

    SPARK_VERSION="3.4.2"
    APACHE_MIRROR="dlcdn.apache.org"
    
    wget https://$APACHE_MIRROR/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3-scala2.13.tgz
    sudo tar -zxvf spark-$SPARK_VERSION-bin-hadoop3-scala2.13.tgz
    sudo cp -far spark-$SPARK_VERSION-bin-hadoop3-scala2.13 /usr/local/spark
    

    Very Important Note

    Please ensure you are using the latest Spark version by modifying the SPARK_VERSION in the above script. Additionally, verify that the version exists on the APACHE_MIRROR website. Please note the value of SPARK_VERSION as you will need it during Preparing Jobs for Execution and Examination.

  • Create an SSH/RSA Key by running ssh-keygen -t rsa without using any passphrase:

    ssh-keygen -t rsa
    
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/ubuntu/.ssh/id_rsa
    Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub
    The key fingerprint is:
    SHA256:8i/TVSCfrkdV4+Jyqc00RoZZFSHNj8C0QugmBa7RX7U ubuntu@spark-master
    The key's randomart image is:
    +---[RSA 3072]----+
    |      .. ..o..++o|
    |     o  o.. +o.+.|
    |    . +o  .o=+.oo|
    |     +.oo  +o++..|
    |    o EoS  .+oo  |
    |     . o   .+B   |
    |        .. +O .  |
    |        o.o..o   |
    |         o..     |
    +----[SHA256]-----+
    
  • Copy and append the contents of SSH public key i.e. ~/.ssh/id_rsa.pub to the ~/.ssh/authorized_keys file.

Create a Volume Snapshot of the master VM

  • Once you're logged in to NERC's Horizon dashboard. You need to Shut Off the master vm before creating a volume snapshot.

    Click Action -> Shut Off Instance.

    Status will change to Shutoff.

  • Then, create a snapshot of its attached volume by clicking on the "Create snapshot" from the Project -> Volumes -> Volumes as described here.

Create Two Worker Instances from the Volume Snapshot

  • Once a snapshot is created and is in "Available" status, you can view and manage it under the Volumes menu in the Horizon dashboard under Volume Snapshots.

    Navigate to Project -> Volumes -> Snapshots.

  • You have the option to directly launch this volume as an instance by clicking on the arrow next to "Create Volume" and selecting "Launch as Instance".

    NOTE: Specify Count: 2 to launch 2 instances using the volume snapshot as shown below:

    Launch 2 Workers From Volume Snapshot

    Naming, Security Group and Flavor for Worker Nodes

    You can specify the "Instance Name" as "spark-worker", and for each instance, it will automatically append incremental values at the end, such as spark-worker-1 and spark-worker-2. Also, make sure you have attached the Security Groups to allow ssh using Port 22 access to the worker instances.

Additionally, during launch, you will have the option to choose your preferred flavor for the worker nodes, which can differ from the master VM based on your computational requirements.

  • Navigate to Project -> Compute -> Instances.

  • Restart the shutdown master VM, click Action -> Start Instance.

  • The final set up for our Spark cluster looks like this, with 1 master node and 2 worker nodes:

    Spark Cluster VMs

Configure Spark on the Master VM

  • SSH login into the master VM again.

  • Update the /etc/hosts file to specify all three hostnames with their corresponding internal IP addresses.

    sudo nano /etc/hosts
    

    Ensure all hosts are resolvable by adding them to /etc/hosts. You can modify the following content specifying each VM's internal IP addresses and paste the updated content at the end of the /etc/hosts file. Alternatively, you can use sudo cat >> /etc/hosts to append the content directly to the end of the /etc/hosts file.

    <Master-Internal-IP> master
    <Worker1-Internal-IP> worker1
    <Worker2-Internal-IP> worker2
    

    Very Important Note

    Make sure to use >> instead of > to avoid overwriting the existing content and append the new content at the end of the file.

    For example, the end of the /etc/hosts file looks like this:

    sudo cat /etc/hosts
    ...
    192.168.0.46 master
    192.168.0.26 worker1
    192.168.0.136 worker2
    
  • Verify that you can SSH into both worker nodes by using ssh worker1 and ssh worker2 from the Spark master node's terminal.

  • Copy the sample configuration file for the Spark:

    cd /usr/local/spark/conf/
    cp spark-env.sh.template spark-env.sh
    
  • Update the environment variables file i.e. spark-env.sh to include the following information:

    export SPARK_MASTER_HOST='<Master-Internal-IP>'
    export JAVA_HOME=<Path_of_JAVA_installation>
    

    Environment Variables

    Executing this command: readlink -f $(which java) will display the path to the current Java setup in your VM. For example: /usr/lib/jvm/java-11-openjdk-amd64/bin/java, you need to remove the last bin/java part, i.e. /usr/lib/jvm/java-11-openjdk-amd64, to set it as the JAVA_HOME environment variable. Learn more about other Spark settings that can be configured through environment variables here.

    For example:

    echo "export SPARK_MASTER_HOST='192.168.0.46'" >> spark-env.sh
    echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> spark-env.sh
    
  • Source the changed environment variables file i.e. spark-env.sh:

    source spark-env.sh
    
  • Create a file named slaves in the Spark configuration directory (i.e., /usr/local/spark/conf/) that specifies all 3 hostnames (nodes) as specified in /etc/hosts:

    sudo cat slaves
    master
    worker1
    worker2
    

Run the Spark cluster from the Master VM

  • SSH into the master VM again if you are not already logged in.

  • You need to run the Spark cluster from /usr/local/spark:

    cd /usr/local/spark
    
    # Start all hosts (nodes) including master and workers
    ./sbin/start-all.sh
    

    How to Stop All Spark Cluster

    To stop all of the Spark cluster nodes, execute ./sbin/stop-all.sh command from /usr/local/spark.

Connect to the Spark WebUI

Apache Spark provides a suite of web user interfaces (WebUIs) that you can use to monitor the status and resource consumption of your Spark cluster.

Different types of Spark Web UI

Apache Spark provides different web UIs: Master web UI, Worker web UI, and Application web UI.

  • You can connect to the Master web UI using SSH Port Forwarding, aka SSH Tunneling i.e. Local Port Forwarding from your local machine's terminal by running:

    ssh -N -L <Your_Preferred_Port>:localhost:8080 <User>@<Floating-IP> -i <Path_To_Your_Private_Key>
    

    Here, you can choose any port that is available on your machine as <Your_Preferred_Port> and then master VM's assigned Floating IP as <Floating-IP> and associated Private Key pair attached to the VM as <Path_To_Your_Private_Key>.

    For example:

    ssh -N -L 8080:localhost:8080 ubuntu@199.94.61.4 -i ~/.ssh/cloud.key
    
  • Once the SSH Tunneling is successful, please do not close or stop the terminal where you are running the SSH Tunneling. Instead, log in to the Master web UI using your web browser: http://localhost:<Your_Preferred_Port> i.e. http://localhost:8080.

The Master web UI offers an overview of the Spark cluster, showcasing the following details:

  • Master URL and REST URL
  • Available CPUs and memory for the Spark cluster
  • Status and allocated resources for each worker
  • Details on active and completed applications, including their status, resources, and duration
  • Details on active and completed drivers, including their status and resources

The Master web UI appears as shown below when you navigate to http://localhost:<Your_Preferred_Port> i.e. http://localhost:8080 from your web browser:

The Master web UI

The Master web UI also provides an overview of the applications. Through the Master web UI, you can easily identify the allocated vCPU (Core) and memory resources for both the Spark cluster and individual applications.

Preparing Jobs for Execution and Examination

  • To run jobs from /usr/local/spark, execute the following commands:

    cd /usr/local/spark
    SPARK_VERSION="3.4.2"
    

    Very Important Note

    Please ensure you are using the same Spark version that you have downloaded and installed previously as the value of SPARK_VERSION in the above script.

  • Single Node Job:

    Let's quickly start to run a simple job:

    ./bin/spark-submit --driver-memory 2g --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.13-$SPARK_VERSION.jar 50
    
  • Cluster Mode Job:

    Let's submit a longer and more complex job with many tasks that will be distributed among the multi-node cluster, and then view the Master web UI:

    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.13-$SPARK_VERSION.jar 1000
    

    While the job is running, you will see a similar view on the Master web UI under the "Running Applications" section:

    Spark Running Application

    Once the job is completed, it will show up under the "Completed Applications" section on the Master web UI as shown below:

    Spark Completed Application