Apache Spark Cluster Setup on NERC OpenStack

Apache Spark Overview

Apache Spark is increasingly recognized as the primary analysis suite for big data, particularly among Python users. Spark offers a robust Python API and includes several valuable built-in libraries such as MLlib for machine learning and Spark Streaming for real-time analysis. In contrast to Apache Hadoop, Spark performs most computations in main memory boosting the performance.

Many modern computational tasks utilize the MapReduce parallel paradigm. This computational process comprises two stages: Map and Reduce. Before task execution, all data is distributed across the nodes of the cluster. During the "Map" stage, the master node dispatches the executable task to the other nodes, and each worker processes its respective data. The subsequent step is "Reduce" that involves the master node collecting all results from the workers and generating final results based on the workers' outcomes. Apache Spark also implements this model of computations so it provides Big Data Processing abilities.

Apache Spark Cluster Setup

To get a Spark standalone cluster up and running manually, all you need to do is spawn some VMs and start Spark as master on one of them and worker on the others. They will automatically form a cluster that you can connect to/from Python, Java, and Scala applications using the IP address of the master VM.

Setup a Master VM

To create a master VM for the first time, ensure that the "Image" dropdown option is selected. In this example, we selected ubuntu-22.04-x86_64 and the cpu-su.2 flavor is being used.
Make sure you have added rules in the Security Groups to allow ssh using Port 22 access to the instance.

Assign a Floating IP to your new instance so that you will be able to ssh into this machine:

ssh ubuntu@<Floating-IP> -A -i <Path_To_Your_Private_Key>

For example:

ssh ubuntu@199.94.61.4 -A -i cloud.key

Upon successfully accessing the machine, execute the following dependencies:
```
sudo apt-get -y update
sudo apt install default-jre -y
```

Download and install Scala:

wget https://downloads.lightbend.com/scala/2.13.10/scala-2.13.10.deb
sudo dpkg -i scala-2.13.10.deb
sudo apt-get install scala

Note

Installing Scala means installing various command-line tools such as the Scala compiler and build tools.

Download and unpack Apache Spark:
```
SPARK_VERSION="3.4.2"
APACHE_MIRROR="dlcdn.apache.org"

wget https://$APACHE_MIRROR/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3-scala2.13.tgz
sudo tar -zxvf spark-$SPARK_VERSION-bin-hadoop3-scala2.13.tgz
sudo cp -far spark-$SPARK_VERSION-bin-hadoop3-scala2.13 /usr/local/spark
```
Very Important Note

Please ensure you are using the latest Spark version by modifying the SPARK_VERSION in the above script. Additionally, verify that the version exists on the APACHE_MIRROR website. Please note the value of SPARK_VERSION as you will need it during Preparing Jobs for Execution and Examination.

Create an SSH/RSA Key by running ssh-keygen -t rsa without using any passphrase:

ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/ubuntu/.ssh/id_rsa
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:8i/TVSCfrkdV4+Jyqc00RoZZFSHNj8C0QugmBa7RX7U ubuntu@spark-master
The key's randomart image is:
+---[RSA 3072]----+
|      .. ..o..++o|
|     o  o.. +o.+.|
|    . +o  .o=+.oo|
|     +.oo  +o++..|
|    o EoS  .+oo  |
|     . o   .+B   |
|        .. +O .  |
|        o.o..o   |
|         o..     |
+----[SHA256]-----+

Copy and append the contents of SSH public key i.e. ~/.ssh/id_rsa.pub to the ~/.ssh/authorized_keys file.

Create a Volume Snapshot of the master VM

Once you're logged in to NERC's Horizon dashboard. You need to Shut Off the master vm before creating a volume snapshot.

Click Action -> Shut Off Instance.

Status will change to Shutoff.
Then, create a snapshot of its attached volume by clicking on the "Create snapshot" from the Project -> Volumes -> Volumes as described here.

Create Two Worker Instances from the Volume Snapshot

Once a snapshot is created and is in "Available" status, you can view and manage it under the Volumes menu in the Horizon dashboard under Volume Snapshots.

Navigate to Project -> Volumes -> Snapshots.
You have the option to directly launch this volume as an instance by clicking on the arrow next to "Create Volume" and selecting "Launch as Instance".

NOTE: Specify Count: 2 to launch 2 instances using the volume snapshot as shown below:

Naming, Security Group and Flavor for Worker Nodes

You can specify the "Instance Name" as "spark-worker", and for each instance, it will automatically append incremental values at the end, such as spark-worker-1 and spark-worker-2. Also, make sure you have attached the Security Groups to allow ssh using Port 22 access to the worker instances.

Additionally, during launch, you will have the option to choose your preferred flavor for the worker nodes, which can differ from the master VM based on your computational requirements.

Navigate to Project -> Compute -> Instances.
Restart the shutdown master VM, click Action -> Start Instance.
The final set up for our Spark cluster looks like this, with 1 master node and 2 worker nodes:

Configure Spark on the Master VM

SSH login into the master VM again.
Update the /etc/hosts file to specify all three hostnames with their corresponding internal IP addresses.
```
sudo nano /etc/hosts
```
Ensure all hosts are resolvable by adding them to /etc/hosts. You can modify the following content specifying each VM's internal IP addresses and paste the updated content at the end of the /etc/hosts file. Alternatively, you can use sudo cat >> /etc/hosts to append the content directly to the end of the /etc/hosts file.
```
<Master-Internal-IP> master
<Worker1-Internal-IP> worker1
<Worker2-Internal-IP> worker2
```
Very Important Note

Make sure to use >> instead of > to avoid overwriting the existing content and append the new content at the end of the file.

For example, the end of the /etc/hosts file looks like this:
```
sudo cat /etc/hosts
...
192.168.0.46 master
192.168.0.26 worker1
192.168.0.136 worker2
```
Verify that you can SSH into both worker nodes by using ssh worker1 and ssh worker2 from the Spark master node's terminal.

Copy the sample configuration file for the Spark:

cd /usr/local/spark/conf/
cp spark-env.sh.template spark-env.sh

Update the environment variables file i.e. spark-env.sh to include the following information:
```
export SPARK_MASTER_HOST='<Master-Internal-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>
```
Environment Variables

Executing this command: readlink -f $(which java) will display the path to the current Java setup in your VM. For example: /usr/lib/jvm/java-11-openjdk-amd64/bin/java, you need to remove the last bin/java part, i.e. /usr/lib/jvm/java-11-openjdk-amd64, to set it as the JAVA_HOME environment variable. Learn more about other Spark settings that can be configured through environment variables here.

For example:
```
echo "export SPARK_MASTER_HOST='192.168.0.46'" >> spark-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> spark-env.sh
```
Source the changed environment variables file i.e. spark-env.sh:
```
source spark-env.sh
```
Create a file named slaves in the Spark configuration directory (i.e., /usr/local/spark/conf/) that specifies all 3 hostnames (nodes) as specified in /etc/hosts:
```
sudo cat slaves
master
worker1
worker2
```

Run the Spark cluster from the Master VM

SSH into the master VM again if you are not already logged in.
You need to run the Spark cluster from /usr/local/spark:
```
cd /usr/local/spark

# Start all hosts (nodes) including master and workers
./sbin/start-all.sh
```
How to Stop All Spark Cluster

To stop all of the Spark cluster nodes, execute ./sbin/stop-all.sh command from /usr/local/spark.

Connect to the Spark WebUI

Apache Spark provides a suite of web user interfaces (WebUIs) that you can use to monitor the status and resource consumption of your Spark cluster.

Different types of Spark Web UI

Apache Spark provides different web UIs: Master web UI, Worker web UI, and Application web UI.

You can connect to the Master web UI using SSH Port Forwarding, aka SSH Tunneling i.e. Local Port Forwarding from your local machine's terminal by running:
```
ssh -N -L <Your_Preferred_Port>:localhost:8080 <User>@<Floating-IP> -i <Path_To_Your_Private_Key>
```
Here, you can choose any port that is available on your machine as <Your_Preferred_Port> and then master VM's assigned Floating IP as <Floating-IP> and associated Private Key pair attached to the VM as <Path_To_Your_Private_Key>.

For example:
```
ssh -N -L 8080:localhost:8080 ubuntu@199.94.61.4 -i ~/.ssh/cloud.key
```
Once the SSH Tunneling is successful, please do not close or stop the terminal where you are running the SSH Tunneling. Instead, log in to the Master web UI using your web browser: http://localhost:<Your_Preferred_Port> i.e. http://localhost:8080.

The Master web UI offers an overview of the Spark cluster, showcasing the following details:

Master URL and REST URL
Available CPUs and memory for the Spark cluster
Status and allocated resources for each worker
Details on active and completed applications, including their status, resources, and duration
Details on active and completed drivers, including their status and resources

The Master web UI appears as shown below when you navigate to http://localhost:<Your_Preferred_Port> i.e. http://localhost:8080 from your web browser:

The Master web UI also provides an overview of the applications. Through the Master web UI, you can easily identify the allocated vCPU (Core) and memory resources for both the Spark cluster and individual applications.

Preparing Jobs for Execution and Examination

To run jobs from /usr/local/spark, execute the following commands:
```
cd /usr/local/spark
SPARK_VERSION="3.4.2"
```
Very Important Note

Please ensure you are using the same Spark version that you have downloaded and installed previously as the value of SPARK_VERSION in the above script.

Single Node Job:

Let's quickly start to run a simple job:

./bin/spark-submit --driver-memory 2g --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.13-$SPARK_VERSION.jar 50

Cluster Mode Job:

Let's submit a longer and more complex job with many tasks that will be distributed among the multi-node cluster, and then view the Master web UI:
```
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.13-$SPARK_VERSION.jar 1000
```
While the job is running, you will see a similar view on the Master web UI under the "Running Applications" section:

Once the job is completed, it will show up under the "Completed Applications" section on the Master web UI as shown below: