Spark-Cluster und Pyspark unter Minikube

Voraussetzung

  • Python 3.8 (Die Python-Version muss mit der Version im Pyspark-Image übereinstimmen)

Installation Minikube (qemu)

Ich will minikube auf Basis von Opensource-Paketen starten und die Nutzung von Docker vermeiden. Daher nutze ich qemu und socket_vmnet. Die Software kann per Homebrew installiert werden:

Vorbereitung
brew install minikube
brew install kubectl
brew install qemu
brew install socket_vmnet
brew tap homebrew/services
HOMEBREW=$(which brew) && sudo ${HOMEBREW} services restart socket_vmnet
Start minikube
$ minikube start --addons=ingress,ingress-dns,metrics-server,registry --driver=qemu --memory 6g --cpus 4 --network=socket_vmnet --insecure-registry "10.0.0.0/8" --insecure-registry "192.168.0.0/16"
😄  minikube v1.32.0 auf Darwin 14.3.1
✨  Verwende den Treiber qemu2 basierend auf der Benutzer-Konfiguration
🔑  Your firewall is blocking bootpd which is required for socket_vmnet. The following commands will be executed to unblock bootpd:

    $ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/libexec/bootpd
    $ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --unblock /usr/libexec/bootpd


💿  Lade VM boot image herunter ...
    > minikube-v1.32.1-amd64.iso....:  65 B / 65 B [---------] 100.00% ? p/s 0s
    > minikube-v1.32.1-amd64.iso:  292.96 MiB / 292.96 MiB  100.00% 39.42 MiB p
👍  Starte Control Plane Node minikube in Cluster minikube
💾  Lade Kubernetes v1.28.3 herunter ...
    > preloaded-images-k8s-v18-v1...:  403.35 MiB / 403.35 MiB  100.00% 20.08 M
🔥  Erstelle qemu2 VM (CPUs=2, Speicher=4000MB, Disk=20000MB ...
🐳  Vorbereiten von Kubernetes v1.28.3 auf Docker 24.0.7...
     Generiere Zertifikate und Schlüssel ...
     Starte Control-Plane ...
     Konfiguriere RBAC Regeln ...
🔗  Konfiguriere bridge CNI (Container Networking Interface) ...
     Verwende Image gcr.io/k8s-minikube/minikube-ingress-dns:0.0.2
🔎  Verifiziere Kubernetes Komponenten...
     Verwende Image gcr.io/k8s-minikube/kube-registry-proxy:0.0.5
     Verwende Image docker.io/registry:2.8.3
     Verwende Image registry.k8s.io/ingress-nginx/controller:v1.9.4
     Verwende Image registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20231011-8b53cabe0
     Verwende Image registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20231011-8b53cabe0
     Verwende Image gcr.io/k8s-minikube/storage-provisioner:v5
     Verwende Image registry.k8s.io/metrics-server/metrics-server:v0.6.4
🔎  Verifiziere registry Addon...
🔎  Verifiziere ingress Addon...
🌟  Addons aktiviert: ingress-dns, storage-provisioner, metrics-server, default-storageclass, registry, ingress
🏄  Fertig! kubectl ist jetzt für die standardmäßige (default) Verwendung des Clusters "minikube" und des Namespaces "default" konfiguriert
Aktivieren der Konfiguration
source <(minikube docker-env)
minikube status

minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
docker-env: in-use
Starten des Dashboards
minikube dashboard &

Konfiguration einer lokalen Dummy-Domain (MacOS)

Anlegen einer loaklen Test-Domain unter MacOS
# Create the resolver directory
sudo mkdir -v /etc/resolver

# Add your nameserver to resolvers
sudo bash -c 'echo "nameserver "$(minikube ip) > /etc/resolver/my'
sudo bash -c 'echo "domain my" >> /etc/resolver/my'
sudo bash -c 'echo "search_order 1" >> /etc/resolver/my'
sudo bash -c 'echo "timeout 5" >> /etc/resolver/my'

# If you want to check the configuration, run
scutil --dns

Test der minikube Installation

Start eines Test-Deployments
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  ports:
  - name: http
    port: 80
  selector:
    app: web
  type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: docker.io/nginx
        name: nginx
        ports:
        - containerPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
spec:
  rules:
  - host: test.my
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80
EOF
kubectl get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/web-deployment-6b87f678ff-h4qrn   1/1     Running   0          13m
pod/web-deployment-6b87f678ff-nwggr   1/1     Running   0          13m

NAME                  TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
service/kubernetes    ClusterIP   10.96.0.1     <none>        443/TCP        46m
service/web-service   NodePort    10.106.7.74   <none>        80:32387/TCP   13m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/web-deployment   2/2     2            2           13m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/web-deployment-6b87f678ff   2         2         2       13m

Jetzt sollte der Webserver vom lokalen Rechner aus unter http://test.my erreichbar sein.

Bereitstellung eines Images für Spark

Bauen und Pushen des Images
eval $(minikube -p minikube docker-env)
cd ~/spark-3.5/
export SPARK_HOME=~/spark-3.5/

bin/docker-image-tool.sh -m -r localhost:5000/sparkimage -t v1.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile/ -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile  build

[+] Building 46.1s (19/19) FINISHED
...
[+] Building 61.9s (11/11) FINISHED
...
[+] Building 133.6s (10/10) FINISHED
...

bin/docker-image-tool.sh -m -r localhost:5000/sparkimage -t v1.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile/ -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile  push

The push refers to repository [localhost:5000/sparkimage/spark]
...
v1.0.0: digest: sha256:4d9ec282f3b8382cb4e00bae91e2ed4639a37d010f36493b23d2aafabd335bd8 size: 4080
The push refers to repository [localhost:5000/sparkimage/spark-py]
...
v1.0.0: digest: sha256:4ac54e3d92466346a454d768b561b8458fdab046acfa229c560a35f55173b8bb size: 5129
The push refers to repository [localhost:5000/sparkimage/spark-r]
...
v1.0.0: digest: sha256:e64def12de242e7a6270e6384fabe1a25ab3923e08871d8966e2a40bdb58b195 size: 4918

K8s für den Spark-Driver vorbereiten

Namespace und Role anlegen (unschön, da cluster-admin)
  kubectl create namespace spark
  kubectl create serviceaccount spark --namespace=spark
  kubectl create clusterrolebinding spark-role --clusterrole=cluster-admin --serviceaccount=spark:spark --namespace=spark
Während des Laufens eines Jobs die Spark-UI erreichen
kubectl port-forward -n spark spark-pi-...-driver 4040:4040

Starten des Jobs und Test des Images

spark-submit eines Test-Jobs
kubectl cluster-info


./bin/spark-submit  --master k8s://https://$(minikube ip):8443                                      \
                    --deploy-mode cluster                                                           \
                    --name spark-pi                                                                 \
                    --class org.apache.spark.examples.SparkPi                                       \
                    --conf spark.executor.instances=3                                               \
                    --conf spark.kubernetes.container.image=localhost:5000/sparkimage/spark:v1.0.0  \
                    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark            \
                    --conf spark.kubernetes.authenticate.serviceAccountName=spark                   \
                    --conf spark.kubernetes.namespace=spark                                         \
                    --executor-memory 500m                                                          \
                    --driver-memory 500m                                                            \
                    local:///opt/spark/examples/jars/spark-examples_2.13-3.5.1.jar  100