在Debian上集成Hadoop与Spark涉及多个步骤,包括安装必要的软件包、配置环境变量、修改配置文件以及启动服务。以下是一个基本的集成指南:
1. 安装Java开发工具包(JDK)
Hadoop依赖于Java,因此首先需要安装JDK。
sudo apt update sudo apt install openjdk-11-jdk
验证Java安装:
java -version
2. 安装Hadoop
下载并解压Hadoop安装包。例如,安装Hadoop 3.3.6:
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz.sha512 sha256sum -c hadoop-3.3.6-src.tar.gz.sha512 sudo mkdir /opt/hadoops sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoops --strip-components=1
3. 配置Hadoop
编辑Hadoop配置文件:
-
core-site.xml:
fs.defaultFS hdfs://localhost:9000 -
hdfs-site.xml:
dfs.replication 1 dfs.namenode.name.dir /opt/hadoops/hadoop/hdfs/namenode -
mapred-site.xml:
mapreduce.framework.name yarn -
yarn-site.xml:
yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname localhost
4. 设置SSH无密码访问
为Hadoop用户生成SSH密钥,并将公钥复制到 authorized_keys 文件中:
sudo su - hadoopuser ssh-keygen -t rsa -P '' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys
测试SSH连接:
ssh localhost
5. 格式化HDFS并启动Hadoop服务
格式化NameNode:
hdfs namenode -format
启动Hadoop服务:
/opt/hadoops/hadoop/sbin/start-dfs.sh /opt/hadoops/hadoop/sbin/start-yarn.sh
验证Hadoop服务状态:
-
HDFS状态:
hdfs dfsadmin -report
-
YARN资源管理器状态:
curl http://localhost:8088/cluster/scheduler
6. 安装Spark
下载并解压Spark安装包。例如,安装Spark 3.3.2:
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz tar -xzvf spark-3.3.2-bin-hadoop3.tgz cd spark-3.3.2-bin-hadoop3
7. 配置Spark
修改Spark配置文件:
-
spark-defaults.conf:
spark.master yarn spark.executor.memory 4g spark.driver.memory 4g
-
spark-env.sh:
export HADOOP_CONF_DIR=/opt/hadoops/hadoop/etc/hadoop export HADOOP_HOME=/opt/hadoops/hadoop export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_HOME/jars/*
8. 启动Spark
启动Spark集群:
./sbin/start-master.sh ./sbin/start-slave.sh
验证Spark服务状态:
访问Spark Web UI:
http://localhost:8080
通过以上步骤,您可以在Debian上成功集成Hadoop与Spark。请根据实际需求调整配置和版本。