binbinHan comments

Results 9 comments of


                                            binbinHan

compared the output results of acceleration schemes from both the deepcache and onediff versions

@onefish51 Deepcache is a lossy algorithm. If you want to be close to the original algorithm, you can adjust cache_interval to a smaller value, or adjust cache_layer_id and cache_block_id to...

dynamic batch size failed

@HydrogenQAQ sorry i can not reproduce the erro with your script. con you tell us version of diffusers in your env? Or maybe you can update oneflow and onediff then...

2、1D 并行 @clackhan [Global tensor](https://docs.oneflow.org/master/parallelism/03_consistent_tensor.html)可以轻松支持任何并行性，包括数据并行性、模型并行性，可以跨多台机器运行。 > **注意：** 本教程中的代码在 2-GPU 服务器上运行，但可以轻松推广到其他环境 - [ ] 数据并行 - 模型构建在数据并行模式中，每个GPU上包含完整的模型参数，各张卡的参数完全相同，每个rank输入不同的数据。接下来使用Global 模式训练数据并行网络，第一步是创建模型，下面代码定义了一个包含两个全连接层的网络，并将其扩展到到两卡。 > **注意：** 代码中单模型通过to_global扩展到两卡时，会将rank 0上模型的参数广播到其他rank上，故无需担心不同的进程上模型参数初始值不同。 ```python import oneflow as flow import oneflow.nn as...

nccl not support for float16?

I cannot reproduce your problem. Can you print dtype of input and weight before doing matmul to make sure they are the same? If the dtype of input and weight...

Fused llama kernel

> fast transformer 是这样做的吗？ > fast transformer是纯c++实现，可以认为是一个专用实现，代码中实现了一个`Llama`类，编译生成一个可行性的二进制文件，运行时创建一个Llama实例，在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作，所以整个算子launch过程非常快。目前`Llama`还处于第三方pr状态，没有python实现。 fast transformer主仓库中比较成熟的实现如GPT，也是基本上是这个套路，其pytorch和tensorflow实现就是将c++端的`class GptOp`包装一下导出到python端。 > llama 的 python 实现需要手工改动吗？还是自动通过模式匹配实现的？使用融合算子时需要手工改动代码。

Fused llama kernel

> > 在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作 > > 之前提到推理时有个动态 shape 的问题，它是取 max 去申请了内存么是的，申请了最大所需内存

save_pipe and load_pipe not work

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe: ```python if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast: pipe.upcast_vae() ``` And we will integrate...

about Nexfort compile cache

> 这里是autotune时的一部分代码，会有一些错误报告，但最后仍能完成 nexfort已经在修复，相关代码正在加紧合并中 > 是什么问题导致了autotune过程这么慢？有没有能够加速过程的方法？ autotune是为了变编译过程中为某个op选择最快的实现，比如说 matmul(x, y)，x.size为(400, 600)，y的size为(600, 1000)，autotune尝试使用不同的配置执行这个op，从中选择最快的配置，所以autotune过程会比较慢，暂时没有比较好的加速方式。 > 或者更好的离线编译存储结果在线时加载？目前新版nexfort已经支持编译缓存，尝试更新一下 nexfort，然后设置下面两个环境变量： ``` export NEXFORT_GRAPH_CACHE=1 export TORCHINDUCTOR_CACHE_DIR=~/torchinductor ``` 这两个环境变量的作用是： NEXFORT_GRAPH_CACHE=1是指打开编译缓存设置 TORCHINDUCTOR_CACHE_DIR是指编译缓存的位置 > 为何会有一些很夸张大的尺寸？（例如AUTOTUNE nexfort_geglu(36864x640, 640x2560, 640x2560,...

Flux加速，使用nexfort，只使用 nexfort.compilers import transform 的功能，报错没有license

> 用的是社区版的 transform 功能，为啥会报错没有license呢？没有使用企业版的量化功能最近新加了 autotune缓存功能，这个也是企业版功能，需要设置 NEXFORT_ENABLE_TRITON_AUTOTUNE_CACHE=0 关闭一下。