PyTorch-On-Angel icon indicating copy to clipboard operation
PyTorch-On-Angel copied to clipboard

2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang

Open earlytobed opened this issue 4 years ago • 3 comments

第一次作业

很荣幸入选 Angel 项目,开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限,错误和不足之处在所难免,敬请各位专家读者指正。

Angel 环境搭建

本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现,在进行其它工作之前,我们需要部署一个可以运行的环境。

https://github.com/Angel-ML/PyTorch-On-Angel/blob/master/docs/img/pytorch_on_angel_framework.png?raw=true

PyTorch on Angel's architecture

PyTorch-On-Angel 主要由三个模块构成:

  1. Python Client:用于生成 ScriptModule
  2. Angel PS:参数服务器,负责模型的分布式存储、同步和协调计算
  3. Spark:Spark Driver、Spark Executor 负责加载 ScriptModule,数据处理,同参数服务器协同完成模型的训练和预测

厘清依赖:

  • 由 Python 代码生成 ScriptModule,需要 python 环境和 torch 包
  • 使用 C++ 后端,需要 libtorch_angel
  • Angel PS 和 Spark Driver、Spark Executor 需要 Spark
  • 项目中推荐使用 Spark on YARN 的方式,Hadoop 也是需要的

以下操作均基于 Ubuntu 20.04 LTS ,因为自用,环境不完全干净,不保证没有别的问题。

PyTorch-On-Angel

第一步当然是:

git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1

项目文档中介绍了编译方法,出于使用方便,我准备好镜像源文件,放在下 ./addon 备用:

Debian 9 sources.list

deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free

maven settings.xml

<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <mirrors>
    <mirror>
      <id>nexus-tencentyun</id>
      <mirrorOf>*</mirrorOf>
      <name>Nexus tencentyun</name>
      <url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
    </mirror>
  </mirrors>
</settings>

修改了 Dockerfile

########################################################################################################################
#                                                       DEV                                                            #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV

##########################
#  install dependencies  #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    curl=7.52.1-5+deb9u9 \
    g++=4:6.3.0-4 \
    make=4.1-9.1 \
    unzip=6.0-21+deb9u1 \
    python3 \
    python3-pip \
    python3-setuptools \
    python3-wheel \
    && rm -rf /var/lib/apt/lists/*

#####################
#  Install PyTorch  #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

#######################
#  install new cmake  #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    && tar -xzf /tmp/cmake.tar.gz -C /tmp \
    && rm -rf /tmp/cmake.tar.gz  \
    && mv /tmp/cmake-* /tmp/cmake \
    && cd /tmp/cmake \
    && ./bootstrap \
    && make -j8 \
    && make install \
    && rm -rf /tmp/cmake

#######################
#  download libtorch  #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
    && unzip -q libtorch.zip \
    && rm libtorch.zip

ENV TORCH_HOME=/opt/libtorch

########################################################################################################################
#                                                     JAVA BUILDER                                                     #
########################################################################################################################
FROM DEV as JAVA_BUILDER

COPY ./addon/settings.xml /usr/share/maven/conf/

WORKDIR /app

COPY ./java/pom.xml /app

RUN mvn -e -B dependency:resolve dependency:resolve-plugins

COPY ./java /app

RUN mvn -e -B -Dmaven.test.skip=true package

########################################################################################################################
#                                                     CPP BUILDER                                                      #
########################################################################################################################
FROM DEV as CPP_BUILDER

RUN apt-get update  \
    && apt-get install -y --no-install-recommends \
    zip=3.0-11+b1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY ./cpp ./

RUN ./build.sh \
    && cp ./out/*.so "$TORCH_HOME"/lib \
    && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
    && ln -s "$TORCH_HOME"/lib torch-lib \
    && zip -qr /torch.zip torch-lib

########################################################################################################################
#                                                       Artifacts                                                      #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS

WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./

VOLUME /output

CMD [ "/bin/sh", "-c", "cp ./* /output" ]

修改 cpp/CMakeList.txt

set(TORCH_HOME $ENV{TORCH_HOME})

执行 build.sh 静待片刻:

./build.sh

如果下载安装缓慢也可以提前在 addon 下准备好需要的文件并修改 Dockerfile 里相应部分:

cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

修改 gen_pt_model.sh python → python3

docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}

./dist 下就有了我们所需要的文件:

deepfm.pt  pytorch-on-angel-0.2.0.jar  pytorch-on-angel-0.2.0-jar-with-dependencies.jar  torch.zip

第一步就完成了~

Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

强迫症表示看到很多没用的文件就想删掉:

find . -name *.cmd | xargs rm

修改配置文件:

hadoop-env.sh

export JAVA_HOME="按情况修改"

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

到这里 HDFS 就设置完了,format 一下:

hdfs namenode –format

启动试试是否正常工作:启动需要能 SSH master worker,SSH 设置这里就略了

./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦,没有的看看日志排查

mapred-site.xml 运行方式改成 yarn

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml yarn 的资源配置,默认是 8G ,跑 Angel 可能不够,根据自身电脑配置修改:

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>30720</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>30720</value>
  </property>
</configuration>

启动试试是否正常工作:

./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦,没有的看看日志排查

Spark

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

配置好 Hadoop 之后 Spark 的配置就比较简单了,Spark on YARN 可以直接从 Hadoop 的配置里读取,只需要修改:

spark-env.sh

export HADOOP_CONF_DIR="按情况修改"

启动试试是否正常工作:

./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦,没有的看看日志排查

Angel

注意 jdk 版本,不然后续会报错

sudo apt install openjdk-8-jdk -y
sudo apt install maven -y

编译安装 protobuf 2.5.0 ,依照 README.txt 即可,记得最后要 ldconfig :

wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

按照说明编译即可:

wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz

编译完成后解压,进行配置:

spark-on-angel-env.sh

export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0

# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar

创建文件夹,把需要的文件放上 HDFS 备用

hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel

把之前生成好的四个文件放在合适的位置:

torch.zip pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar deepfm.pt

spark-submit 配置参数按实际情况修改:

因为--archives torch.zip#torch 在我这一直不起作用,搜寻资料也没有结果,于是我解压了 torch.zip,选择用 —-files 上传:

#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.ps.instances=1 \
    --conf spark.ps.cores=1 \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.memory=5g \
    --conf spark.ps.log.level=INFO \
    --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraLibraryPath=. \
    --conf spark.driver.extraLibraryPath=. \
    --conf spark.executorEnv.OMP_NUM_THREADS=2 \
    --conf spark.executorEnv.MKL_NUM_THREADS=2 \
    --name "deepfm for torch on angel" \
    --jars $SONA_SPARK_JARS \
    --files deepfm.pt,$torchlib \
    --driver-memory 5g \
    --num-executors 1 \
    --executor-cores 1 \
    --executor-memory 5g \
    --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
    trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
    stepSize:0.001 numEpoch:10 testRatio:0.1 \
    angelModelOutputPath:$output

http://master:8088/cluster/apps 上收获成功吧!

success

earlytobed avatar Aug 04 '21 14:08 earlytobed

#106 add mmoe; support multilabel libsvm, multilabelauc

  1. 增加 MMoE 的 python 实现;
  2. SampleParser 支持 multilabel 的 libsvm;
  3. 支持 multi_forward_out;
  4. 将 0.3.0 的 multilabelauc 实现回合到当前版本;

以上代码经过测试,MMoE 单任务,多任务均可以正确训练、运行;测试示例模型 deepfm 亦无影响,可以正确运行。

earlytobed avatar Sep 05 '21 13:09 earlytobed

@earlytobed 编译报错,运行也报错;

  1. CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
  2. 运行时,会遇到找不到GLIBC_2.23
  3. class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改

jinqinn avatar Feb 22 '22 10:02 jinqinn

@earlytobed 编译报错,运行也报错;

  1. CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
  2. 运行时,会遇到找不到GLIBC_2.23
  3. class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改

你好

  1. 这个 pytorch_scatter-2.0.5 需要去 https://github.com/rusty1s/pytorch_scatter/releases/tag/2.0.5 下载,解压到对应目录
  2. 我这里的示例基于 branch-0.2.0 , 在 0.2.0 版本里是可以的

earlytobed avatar Feb 25 '22 04:02 earlytobed