[Spark] Apache Hadoop 클러스터에서 YARN과 Spark 애플리케이션 연동하기: 완벽 가이드
Spark Standalone Cluster도 구성해보았고 - Spark Standalone Cluster 구성
Hadoop Full Distrtibute Mode로도 구성을 해보았습니다. Hadoop Full Distrtibute Mode 구성
이번 포스트에서는 구성해놓은 Full Distrtibute Mode Hadoop Cluster의 yarn에서 Spark Applicatino을 구동시키는 과정을 정리했습니다.
### 사전 파일 설치 [root@hadoop-master hadoop]# yum -y install gcc openssl-devel bzip2-devel libffi-devel make ### 파이썬 설치 및 환경 설정 [root@hadoop-master home]# wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz [root@hadoop-master home]# tar xvfz Python-3.8.8.tgz [root@hadoop-master home]# rm -rf Python-3.8.8.tgz [root@hadoop-master home]# chmod -R 777 Python-3.8.8/ [root@hadoop-master Python-3.8.8]# ./configure --enable-optimizations [root@hadoop-master Python-3.8.8]# make altinstall [root@hadoop-master Python-3.8.8]# echo alias python="/usr/local/bin/python3.8" >> /root/.bashrc [root@hadoop-master Python-3.8.8]# source /root/.bashrc [root@hadoop-master Python-3.8.8]# python -V Python 3.8.8 [root@hadoop-master Python-3.8.8]# which python alias python='/usr/local/bin/python3.8' /usr/local/bin/python3.8
[root@hadoop-master ~]# useradd spark [root@hadoop-master ~]# passwd spark [root@hadoop-master ~]# usermod -G wheel spark
### Spark 계정 생성 및 설정 [root@hadoop-master ~]# useradd spark [root@hadoop-master ~]# passwd spark [root@hadoop-master ~]# usermod -G wheel spark ### Spark 3.0.2 Version 다운로드 [root@hadoop-master spark]# wget https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz --2021-03-10 01:27:27-- https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz ### 압축 해제 및 권한 설정 [root@hadoop-master spark]# tar xvfz spark-3.0.2-bin-hadoop2.7.tgz [root@hadoop-master spark]# mv spark-3.0.2-bin-hadoop2.7 spar [root@hadoop-master spark]# chown -R spark:spark spark [root@hadoop-master spark]# chmod -R 777 spark
[spark@hadoop-master ~]$ echo export PATH='$PATH':/home/spark/spark/bin >> ~/.bashrc [spark@hadoop-master ~]$ echo export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-1.el8_3.x86_64" >> ~/.bashrc [spark@hadoop-master ~]$ tail -1 ~/.bashrc export PATH=$PATH:/home/spark/spark/bin [spark@hadoop-master ~]$ soruce ~/.bashrc bash: soruce: command not found [spark@hadoop-master ~]$ source ~/.bashrc [spark@hadoop-master ~]$ echo $PATH /home/spark/.local/bin:/home/spark/bin:/home/spark/.local/bin:/home/spark/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-1.el8_3.x86_64/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin::/root/bin:/home/spark/spark/bin
#### spark-env.sh 에 다음 설정 추가 ## nasa settin export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-1.el8_3.x86_64" export SPARK_WORKER_INSTANCES=2 export PYSPARK_PYTHON="/usr/bin/python3" export HADOOP_HOME="/usr/local/hadoop" export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop #### spark-defaults.conf 설정 [spark@hadoop-master conf]$ cp spark-defaults.conf.template spark-defaults.conf [spark@hadoop-master conf]$ vim spark-defaults.conf ## 설정 추가 spark.master yarn spark.deploy.mode client
### Spark-shell 동작확인 [spark@hadoop-master root]$ spark-shell 21/03/10 01:56:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop-master:4040 Spark context available as 'sc' (master = local[*], app id = local-1615341369333). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.2 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275) Type in expressions to have them evaluated. Type :help for more information. scala> ### pyspark 동작 확인 [spark@hadoop-master conf]$ pyspark Python 3.6.8 (default, Apr 16 2020, 01:36:27) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux Type "help", "copyright", "credits" or "license" for more information. 21/03/10 01:59:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.2 /_/ Using Python version 3.6.8 (default, Apr 16 2020 01:36:27) SparkSession available as 'spark'. >>> 1+2 3
[hadoop@hadoop-master nasa1515]$ hdfs dfs -ls / Found 1 items -rw-r--r-- 3 hadoop supergroup 500253789 2021-03-09 08:47 /nasa.jsv
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .getOrCreate() df = spark.read.option("header","true").jsv('hdfs:/nasa.jsv').cache() df.show()
[spark@hadoop-master conf]$ spark-submit --master yarn --deploy-mode client --executor-memory 1g /home/spark/test.py 2021-03-10 02:20:10,493 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-03-10 02:20:11,137 INFO spark.SparkContext: Running Spark version 3.0.2 2021-03-10 02:20:11,177 INFO resource.ResourceUtils: ============================================================== 2021-03-10 02:20:11,185 INFO resource.ResourceUtils: Resources for spark.driver: ... ...(중략) 2021-03-10 02:20:40,402 INFO spark.SparkContext: Invoking stop() from shutdown hook 2021-03-10 02:20:40,410 INFO server.AbstractConnector: Stopped Spark@6f05fe89{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 2021-03-10 02:20:40,412 INFO ui.SparkUI: Stopped Spark web UI at http://hadoop-master:4040 2021-03-10 02:20:40,416 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread 2021-03-10 02:20:40,438 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors 2021-03-10 02:20:40,438 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 2021-03-10 02:20:40,443 INFO cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped 2021-03-10 02:20:40,452 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 2021-03-10 02:20:40,463 INFO memory.MemoryStore: MemoryStore cleared 2021-03-10 02:20:40,464 INFO storage.BlockManager: BlockManager stopped 2021-03-10 02:20:40,469 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 2021-03-10 02:20:40,475 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 2021-03-10 02:20:40,497 INFO spark.SparkContext: Successfully stopped SparkContext 2021-03-10 02:20:40,498 INFO util.ShutdownHookManager: Shutdown hook called 2021-03-10 02:20:40,498 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b04d174a-0be5-44ee-87ad-8915e64b3d51 2021-03-10 02:20:40,501 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b04d174a-0be5-44ee-87ad-8915e64b3d51/pyspark-5cf80ddf-89c1-4330-93c8-28e2f93b6c08 2021-03-10 02:20:40,510 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-dd89ea67-71be-4e21-ba16-ee7623fea72a
[spark@hadoop-master sbin]$ start-master.sh starting org.apache.spark.deploy.master.Master, logging to /home/spark/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1-hadoop-master.out [spark@hadoop-master sbin]$ start-slave.sh spark://hadoop-master:7077 starting org.apache.spark.deploy.worker.Worker, logging to /home/spark/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-hadoop-master.out starting org.apache.spark.deploy.worker.Worker, logging to /home/spark/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-2-hadoop-master.out
[root@hadoop-master ~]# useradd zeppelin [root@hadoop-master ~]# passwd zeppelin [root@hadoop-master ~]# cd /home/zeppelin/ [root@hadoop-master zeppelin]# wget https://downloads.apachewget https://downloads.apache.org/zeppelin/zeppelin-0.9.0-preview2/zeppelin-0.9.0-preview2-bin-all.tgz [root@hadoop-master zeppelin]# tar xvfz zeppelin-0.9.0-preview2-bin-all.tgz [root@hadoop-master zeppelin]# mv zeppelin-0.9.0-preview2-bin-all zeppelin [root@hadoop-master zeppelin]# chown -R zeppelin:zeppelin zeppelin [root@hadoop-master zeppelin]# chmod -R 777 zeppelin
[zeppelin@hadoop-master ~]$ echo export PATH="$PATH:/home/zeppelin/zeppelin/bin" >> ~/.bashrc [zeppelin@hadoop-master ~]$ source ~/.bashrc ### 총 내용 export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-1.el8_3.x86_64" export HADOOP_HOME="/usr/local/hadoop" export SPARK_HOME="/home/spark/spark" export PATH="$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:" export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH export PATH=$PATH:$SPARK_HOME/bin:$HADDOP_HOME/bin:$HADOOP_HOME/sbin
[zeppelin@hadoop-master ~]$ cd /home/zeppelin/zeppelin/conf [zeppelin@hadoop-master conf]$ cp zeppelin-env.sh.template zeppelin-env.sh ### 설정 추가 export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-1.el8_3.x86_64" export SPARK_HOME="/home/spark/spark" export MASTER=yarn-client export HADOOP_HOME="/usr/local/hadoop" export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop ### zeppelin-site.xml 수정 zeppelin-site.xml ... ... <property> <name>zeppelin.server.addr</name> <value>10.0.0.5</value> -> Client IP로 변경 <description>Server binding address</description> </property> <property> <name>zeppelin.server.port</name> <value>7777</value> -> 8080은 Spark가 쓰고있기에 7777로 설정 <description>Server port.</description> </property> ... ...
[zeppelin@hadoop-master conf]$ zeppelin-daemon.sh start