PyTorch-On-Angel
PyTorch-On-Angel copied to clipboard
2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang
第一次作业
很荣幸入选 Angel 项目,开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限,错误和不足之处在所难免,敬请各位专家读者指正。
Angel 环境搭建
本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现,在进行其它工作之前,我们需要部署一个可以运行的环境。

PyTorch on Angel's architecture
PyTorch-On-Angel 主要由三个模块构成:
- Python Client:用于生成 ScriptModule
- Angel PS:参数服务器,负责模型的分布式存储、同步和协调计算
- Spark:Spark Driver、Spark Executor 负责加载 ScriptModule,数据处理,同参数服务器协同完成模型的训练和预测
厘清依赖:
- 由 Python 代码生成 ScriptModule,需要 python 环境和 torch 包
- 使用 C++ 后端,需要 libtorch_angel
- Angel PS 和 Spark Driver、Spark Executor 需要 Spark
- 项目中推荐使用 Spark on YARN 的方式,Hadoop 也是需要的
以下操作均基于 Ubuntu 20.04 LTS ,因为自用,环境不完全干净,不保证没有别的问题。
PyTorch-On-Angel
第一步当然是:
git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1
项目文档中介绍了编译方法,出于使用方便,我准备好镜像源文件,放在下 ./addon 备用:
Debian 9 sources.list :
deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
maven settings.xml :
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<mirrors>
<mirror>
<id>nexus-tencentyun</id>
<mirrorOf>*</mirrorOf>
<name>Nexus tencentyun</name>
<url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
</mirror>
</mirrors>
</settings>
修改了 Dockerfile :
########################################################################################################################
# DEV #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV
##########################
# install dependencies #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl=7.52.1-5+deb9u9 \
g++=4:6.3.0-4 \
make=4.1-9.1 \
unzip=6.0-21+deb9u1 \
python3 \
python3-pip \
python3-setuptools \
python3-wheel \
&& rm -rf /var/lib/apt/lists/*
#####################
# Install PyTorch #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
#######################
# install new cmake #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
&& tar -xzf /tmp/cmake.tar.gz -C /tmp \
&& rm -rf /tmp/cmake.tar.gz \
&& mv /tmp/cmake-* /tmp/cmake \
&& cd /tmp/cmake \
&& ./bootstrap \
&& make -j8 \
&& make install \
&& rm -rf /tmp/cmake
#######################
# download libtorch #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
&& unzip -q libtorch.zip \
&& rm libtorch.zip
ENV TORCH_HOME=/opt/libtorch
########################################################################################################################
# JAVA BUILDER #
########################################################################################################################
FROM DEV as JAVA_BUILDER
COPY ./addon/settings.xml /usr/share/maven/conf/
WORKDIR /app
COPY ./java/pom.xml /app
RUN mvn -e -B dependency:resolve dependency:resolve-plugins
COPY ./java /app
RUN mvn -e -B -Dmaven.test.skip=true package
########################################################################################################################
# CPP BUILDER #
########################################################################################################################
FROM DEV as CPP_BUILDER
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
zip=3.0-11+b1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY ./cpp ./
RUN ./build.sh \
&& cp ./out/*.so "$TORCH_HOME"/lib \
&& cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
&& ln -s "$TORCH_HOME"/lib torch-lib \
&& zip -qr /torch.zip torch-lib
########################################################################################################################
# Artifacts #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS
WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./
VOLUME /output
CMD [ "/bin/sh", "-c", "cp ./* /output" ]
修改 cpp/CMakeList.txt :
set(TORCH_HOME $ENV{TORCH_HOME})
执行 build.sh 静待片刻:
./build.sh
如果下载安装缓慢也可以提前在 addon 下准备好需要的文件并修改 Dockerfile 里相应部分:
cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
修改 gen_pt_model.sh python → python3:
docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}
./dist 下就有了我们所需要的文件:
deepfm.pt pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar torch.zip
第一步就完成了~
Hadoop
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
强迫症表示看到很多没用的文件就想删掉:
find . -name *.cmd | xargs rm
修改配置文件:
hadoop-env.sh
export JAVA_HOME="按情况修改"
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
到这里 HDFS 就设置完了,format 一下:
hdfs namenode –format
启动试试是否正常工作:启动需要能 SSH master worker,SSH 设置这里就略了
./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦,没有的看看日志排查
mapred-site.xml 运行方式改成 yarn
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml yarn 的资源配置,默认是 8G ,跑 Angel 可能不够,根据自身电脑配置修改:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>12</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>12</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>30720</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>30720</value>
</property>
</configuration>
启动试试是否正常工作:
./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦,没有的看看日志排查
Spark
wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
配置好 Hadoop 之后 Spark 的配置就比较简单了,Spark on YARN 可以直接从 Hadoop 的配置里读取,只需要修改:
spark-env.sh
export HADOOP_CONF_DIR="按情况修改"
启动试试是否正常工作:
./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦,没有的看看日志排查
Angel
注意 jdk 版本,不然后续会报错
sudo apt install openjdk-8-jdk -y
sudo apt install maven -y
编译安装 protobuf 2.5.0 ,依照 README.txt 即可,记得最后要 ldconfig :
wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
按照说明编译即可:
wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz
编译完成后解压,进行配置:
spark-on-angel-env.sh
export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0
# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar
创建文件夹,把需要的文件放上 HDFS 备用
hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel
把之前生成好的四个文件放在合适的位置:
torch.zip pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar deepfm.pt
spark-submit 配置参数按实际情况修改:
因为--archives torch.zip#torch 在我这一直不起作用,搜寻资料也没有结果,于是我解压了 torch.zip,选择用 —-files 上传:
#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=5g \
--conf spark.ps.log.level=INFO \
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
--conf spark.executor.extraLibraryPath=. \
--conf spark.driver.extraLibraryPath=. \
--conf spark.executorEnv.OMP_NUM_THREADS=2 \
--conf spark.executorEnv.MKL_NUM_THREADS=2 \
--name "deepfm for torch on angel" \
--jars $SONA_SPARK_JARS \
--files deepfm.pt,$torchlib \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 1 \
--executor-memory 5g \
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
stepSize:0.001 numEpoch:10 testRatio:0.1 \
angelModelOutputPath:$output
去 http://master:8088/cluster/apps 上收获成功吧!

#106 add mmoe; support multilabel libsvm, multilabelauc
- 增加 MMoE 的 python 实现;
- SampleParser 支持 multilabel 的 libsvm;
- 支持 multi_forward_out;
- 将 0.3.0 的 multilabelauc 实现回合到当前版本;
以上代码经过测试,MMoE 单任务,多任务均可以正确训练、运行;测试示例模型 deepfm 亦无影响,可以正确运行。
@earlytobed 编译报错,运行也报错;
- CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
- 运行时,会遇到找不到GLIBC_2.23
- class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改
@earlytobed 编译报错,运行也报错;
- CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
- 运行时,会遇到找不到GLIBC_2.23
- class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改
你好
- 这个 pytorch_scatter-2.0.5 需要去 https://github.com/rusty1s/pytorch_scatter/releases/tag/2.0.5 下载,解压到对应目录
- 我这里的示例基于 branch-0.2.0 , 在 0.2.0 版本里是可以的