runtime icon indicating copy to clipboard operation
runtime copied to clipboard

All registered kernel functions are c++ function now, how to register MLIR compiled function?

Open shanshanpt opened this issue 5 years ago • 20 comments

All registered kernel functions are c++ function now, how to register MLIR compiled function? I'm interested in the concurrency of these MLIR compiled functions. Does TFRT support JIT execute now? I didn't find how to JIT now, :)

Thanks.

shanshanpt avatar May 03 '20 04:05 shanshanpt

@zhangqiaorjc @martinwicke hello could you please help take a look? thanks.

YongCHN avatar May 07 '20 12:05 YongCHN

Can you please provide some examples / use cases of "MLIR compiled functions"?

TFRT supports some form of JIT (implemented for Google TPU, but not open sourced). Please share your use cases of JIT so that we can assess more.

mhong avatar May 07 '20 12:05 mhong

@mhong Thanks for the reply.

We are currently investigating how to leverage MLIR Compiler capability with the new TFRT in order to improve the machine learning workload performance inside Alibaba. As we went through the TFRT design and codebase, we found that the functions/kernels defined in the mllir file are required registered in the KernelLibary and later BEFExecutor will invoke the related functions/kernels in the workqueue. We are wondering how we can jit the kernel if the want to use the MLIR in the new TFRT. Will TFRT open source this capability in future?

Thanks.

YongCHN avatar May 08 '20 02:05 YongCHN

Thanks mingsheng, we have some use cases. For example, we jit on xla, and now we wanna use mlir instead of xla to compile some subgraphs. So we hope tfrt can support jit. And one more confuse, will tfrt rewrite all op kernels? So far I didn't found how tfrt can reuse tf kernels.

Thanks.

shanshanpt avatar May 08 '20 02:05 shanshanpt

TFRT will support different flavors of kernels: codegen'ed (via xla, mlir, or other technology), hand-written (e.g. via Eigen), and library-based (e.g. by calling into cuDNN).

codengen'ed kernels can be AOT (ahead of time) generated or jitted. Kernel codegen is being worked on, but we don't yet have an open sourcing timeline for you.

TFRT can call into (most of the) existing TF ops and kernels. We will share some pointers when the work is ready.

mhong avatar May 08 '20 12:05 mhong

TFRT will support different flavors of kernels: codegen'ed (via xla, mlir, or other technology), hand-written (e.g. via Eigen), and library-based (e.g. by calling into cuDNN).

codengen'ed kernels can be AOT (ahead of time) generated or jitted. Kernel codegen is being worked on, but we don't yet have an open sourcing timeline for you.

TFRT can call into (most of the) existing TF ops and kernels. We will share some pointers when the work is ready.

My hunch it that it looks like a linker for binaries, be it AOT or JIT compiled. Assume we do not need to do link time optimization, then a linker boils down to call convention.

Being interoperable is great for downstream library authors.

byronyi avatar May 08 '20 13:05 byronyi

TFRT will support different flavors of kernels: codegen'ed (via xla, mlir, or other technology), hand-written (e.g. via Eigen), and library-based (e.g. by calling into cuDNN).

codengen'ed kernels can be AOT (ahead of time) generated or jitted. Kernel codegen is being worked on, but we don't yet have an open sourcing timeline for you.

TFRT can call into (most of the) existing TF ops and kernels. We will share some pointers when the work is ready.

Thanks a lot for sharing. I'm quite curious about how the new TFRT will invoke the existing kernel while still get the performance improved. The new TFRT introduces the Aysns Value which means that the conversion between the existing kernel and the new kernel is required and may reduce the performance. If possible could you please elaborate a bit more about this feature? By the way, during the public sharing you mentioned the Resent50 model inference speed get around 30% improved, is there anyway that we can reproduce this example?

Thanks

YongCHN avatar May 09 '20 03:05 YongCHN

When calling existing kernels, indeed some overhead will be incurred, but not all kernel invocation will be on the critical path of the model computation. Our goal in that case is to minimize the overhead, and offer generality in TFRT's kernel support.

To out-performance the current stack, as in the ResNet-50 inference model, we would be writing native TFRT kernels.

The TFRT infra used for the ResNet-50 benchmarking will be open sourced over time as a by-product of our on-going development, but making the benchmark itself easy to reproduce would not be our eng focus; instead our focus is to build out TFRT to fulfill the requirements of TF training and serving integration, and land TFRT in those environments. Afterwards, we will also be looking into MLPerf and other opportunities.

mhong avatar May 12 '20 04:05 mhong

When calling existing kernels, indeed some overhead will be incurred, but not all kernel invocation will be on the critical path of the model computation. Our goal in that case is to minimize the overhead, and offer generality in TFRT's kernel support.

To out-performance the current stack, as in the ResNet-50 inference model, we would be writing native TFRT kernels.

The TFRT infra used for the ResNet-50 benchmarking will be open sourced over time as a by-product of our on-going development, but making the benchmark itself easy to reproduce would not be our eng focus; instead our focus is to build out TFRT to fulfill the requirements of TF training and serving integration, and land TFRT in those environments. Afterwards, we will also be looking into MLPerf and other opportunities.

Thanks a lot for sharing. One more question about the kernel. You mentioned that native TFRT kernels will be written in order to get better performance. Since Tensorflow has more than our thousand kernels, do we intend to re-write most of the widely used kernels by hand? Or do we intend to use the MLIR to jit/aot the kernels?

thanks.

YongCHN avatar May 12 '20 06:05 YongCHN

We will lazily write native kernels based on performance needs. There is probably a skewed distributions of the kernels, such that we need not rewrite thousands of them. For the native kernels, we will have both hand-written kernels (including wrapping existing libraries such as CuDNN) and codegen'ed kernels via jit/aot.

mhong avatar May 12 '20 14:05 mhong

There is probably a skewed distributions of the kernels.

With this do you mean that fusion passes could do something better then now?

bhack avatar May 12 '20 15:05 bhack

I'm not sure how you drew the connection. Can you elaborate on your reasoning and/or question?

mhong avatar May 12 '20 15:05 mhong

I don't know if correctly or not but I connect this with some MLIR related preliminary observations/experiments at https://gist.github.com/stellaraccident/2c11652cfdee1457921bc7c98807b462 and so that the canned kernel set could be re-analyzed over what it was accumulated over time in TF.

bhack avatar May 12 '20 15:05 bhack

I know this may not be saying much: Independent of the "skewed distributions" observation, TF will continue to improve its kernel technology, including fusion passes.

mhong avatar May 12 '20 15:05 mhong

Thanks. I hope that you can disclose something more soon also on what kind, if any, positive impact we could have on users custom ops.

bhack avatar May 12 '20 15:05 bhack

Thanks Mingsheng, could you share some detail documents about how to jit ? For example, how to plugin the jit to the BEF. Maybe an architecture diagram. These can help us to understand some details design. Thanks.

shanshanpt avatar May 14 '20 02:05 shanshanpt

Will share the info down the road when ready. Thanks.

mhong avatar May 14 '20 03:05 mhong

Interested in your project, already working in MLIR for our project. Have a question that is very related: if we fuse 3 operations in MLIR because we can come up with a more efficient version, can we add a "never seen before" op for which we provide the code in the BEF file? We would then expect the TFRT to invoke this provided code on a CPU once all the input are computed, and notify dependent operations of the completion once the code completed.

I believe this scenario is of interest to the MLIR community. Is that something that is on your radar? Thanks.

AlexandreEichenberger avatar May 26 '20 22:05 AlexandreEichenberger

Such kernel fusion scenarios will be supported (design is WIP). I don't think this has to be done by introducing a new op to the graph though; it could be a TFRT BEF kernel (hand-written, or fused/generated by a device compiler like XLA).

mhong avatar May 27 '20 18:05 mhong

Thanks for your response. Our interest would be to group "islands" of original TF operations that can be mapped to a device, and have these islands interact with any other TF operations that were not isolated to these islands, or with other islands. So such islands would include arbitrary amount of work involving arbitrary numbers of original TF operations and interact like normal ops within the TFRT, ideally.

AlexandreEichenberger avatar May 27 '20 19:05 AlexandreEichenberger