Spark-Cluster und Pyspark unter Minikube¶
Voraussetzung¶
Python 3.8 (Die Python-Version muss mit der Version im Pyspark-Image übereinstimmen)
Installation Minikube (qemu)¶
Ich will minikube auf Basis von Opensource-Paketen starten und die Nutzung von Docker vermeiden. Daher nutze ich qemu und socket_vmnet. Die Software kann per Homebrew installiert werden:
brew install minikube
brew install kubectl
brew install qemu
brew install socket_vmnet
brew tap homebrew/services
HOMEBREW=$(which brew) && sudo ${HOMEBREW} services restart socket_vmnet
$ minikube start --addons=ingress,ingress-dns,metrics-server,registry --driver=qemu --memory 6g --cpus 4 --network=socket_vmnet --insecure-registry "10.0.0.0/8" --insecure-registry "192.168.0.0/16"
😄 minikube v1.32.0 auf Darwin 14.3.1
✨ Verwende den Treiber qemu2 basierend auf der Benutzer-Konfiguration
🔑 Your firewall is blocking bootpd which is required for socket_vmnet. The following commands will be executed to unblock bootpd:
$ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/libexec/bootpd
$ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --unblock /usr/libexec/bootpd
💿 Lade VM boot image herunter ...
> minikube-v1.32.1-amd64.iso....: 65 B / 65 B [---------] 100.00% ? p/s 0s
> minikube-v1.32.1-amd64.iso: 292.96 MiB / 292.96 MiB 100.00% 39.42 MiB p
👍 Starte Control Plane Node minikube in Cluster minikube
💾 Lade Kubernetes v1.28.3 herunter ...
> preloaded-images-k8s-v18-v1...: 403.35 MiB / 403.35 MiB 100.00% 20.08 M
🔥 Erstelle qemu2 VM (CPUs=2, Speicher=4000MB, Disk=20000MB ...
🐳 Vorbereiten von Kubernetes v1.28.3 auf Docker 24.0.7...
▪ Generiere Zertifikate und Schlüssel ...
▪ Starte Control-Plane ...
▪ Konfiguriere RBAC Regeln ...
🔗 Konfiguriere bridge CNI (Container Networking Interface) ...
▪ Verwende Image gcr.io/k8s-minikube/minikube-ingress-dns:0.0.2
🔎 Verifiziere Kubernetes Komponenten...
▪ Verwende Image gcr.io/k8s-minikube/kube-registry-proxy:0.0.5
▪ Verwende Image docker.io/registry:2.8.3
▪ Verwende Image registry.k8s.io/ingress-nginx/controller:v1.9.4
▪ Verwende Image registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20231011-8b53cabe0
▪ Verwende Image registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20231011-8b53cabe0
▪ Verwende Image gcr.io/k8s-minikube/storage-provisioner:v5
▪ Verwende Image registry.k8s.io/metrics-server/metrics-server:v0.6.4
🔎 Verifiziere registry Addon...
🔎 Verifiziere ingress Addon...
🌟 Addons aktiviert: ingress-dns, storage-provisioner, metrics-server, default-storageclass, registry, ingress
🏄 Fertig! kubectl ist jetzt für die standardmäßige (default) Verwendung des Clusters "minikube" und des Namespaces "default" konfiguriert
source <(minikube docker-env)
minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
docker-env: in-use
minikube dashboard &
Konfiguration einer lokalen Dummy-Domain (MacOS)¶
# Create the resolver directory
sudo mkdir -v /etc/resolver
# Add your nameserver to resolvers
sudo bash -c 'echo "nameserver "$(minikube ip) > /etc/resolver/my'
sudo bash -c 'echo "domain my" >> /etc/resolver/my'
sudo bash -c 'echo "search_order 1" >> /etc/resolver/my'
sudo bash -c 'echo "timeout 5" >> /etc/resolver/my'
# If you want to check the configuration, run
scutil --dns
Test der minikube Installation¶
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
ports:
- name: http
port: 80
selector:
app: web
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-deployment
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- image: docker.io/nginx
name: nginx
ports:
- containerPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
spec:
rules:
- host: test.my
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-service
port:
number: 80
EOF
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/web-deployment-6b87f678ff-h4qrn 1/1 Running 0 13m
pod/web-deployment-6b87f678ff-nwggr 1/1 Running 0 13m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 46m
service/web-service NodePort 10.106.7.74 <none> 80:32387/TCP 13m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/web-deployment 2/2 2 2 13m
NAME DESIRED CURRENT READY AGE
replicaset.apps/web-deployment-6b87f678ff 2 2 2 13m
Jetzt sollte der Webserver vom lokalen Rechner aus unter http://test.my erreichbar sein.
Bereitstellung eines Images für Spark¶
eval $(minikube -p minikube docker-env)
cd ~/spark-3.5/
export SPARK_HOME=~/spark-3.5/
bin/docker-image-tool.sh -m -r localhost:5000/sparkimage -t v1.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile/ -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build
[+] Building 46.1s (19/19) FINISHED
...
[+] Building 61.9s (11/11) FINISHED
...
[+] Building 133.6s (10/10) FINISHED
...
bin/docker-image-tool.sh -m -r localhost:5000/sparkimage -t v1.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile/ -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile push
The push refers to repository [localhost:5000/sparkimage/spark]
...
v1.0.0: digest: sha256:4d9ec282f3b8382cb4e00bae91e2ed4639a37d010f36493b23d2aafabd335bd8 size: 4080
The push refers to repository [localhost:5000/sparkimage/spark-py]
...
v1.0.0: digest: sha256:4ac54e3d92466346a454d768b561b8458fdab046acfa229c560a35f55173b8bb size: 5129
The push refers to repository [localhost:5000/sparkimage/spark-r]
...
v1.0.0: digest: sha256:e64def12de242e7a6270e6384fabe1a25ab3923e08871d8966e2a40bdb58b195 size: 4918
K8s für den Spark-Driver vorbereiten¶
kubectl create namespace spark
kubectl create serviceaccount spark --namespace=spark
kubectl create clusterrolebinding spark-role --clusterrole=cluster-admin --serviceaccount=spark:spark --namespace=spark
kubectl port-forward -n spark spark-pi-...-driver 4040:4040
Starten des Jobs und Test des Images¶
kubectl cluster-info
./bin/spark-submit --master k8s://https://$(minikube ip):8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=localhost:5000/sparkimage/spark:v1.0.0 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--executor-memory 500m \
--driver-memory 500m \
local:///opt/spark/examples/jars/spark-examples_2.13-3.5.1.jar 100
Links¶
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://medium.com/geekculture/build-your-own-big-data-ecosystem-part-1-a19e4c778632
https://minikube.sigs.k8s.io/docs/handbook/registry/#enabling-insecure-registries/
https://gist.github.com/trisberg/37c97b6cc53def9a3e38be6143786589
https://blog.thenets.org/how-to-use-minikube-qemu2-ingress-dns-on-macos/