fix: [nvbugs/5066257] serialization improvments
Description
Add an approve list into the pickle deserialization process to reduce the attack surface of using pickle to a subset of supported objects.
Test Coverage
if a new object is added this PR will cause a ValueError to be raised that clearly states what module is not supported that is being serialized/deserialized. Fully utilizing the zmq interfaces would test this code.
Thanks a lot for making this PR! Several general comments:
- Could you please follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md to sign your commits and run pre-commit hook.
- Please add the complicated dataclass serialization unit test we discussed offline to
test_executor.py. Also would be good to have additional tests forserailization.py. I can trigger a pipeline test after you added the tests to see if there is any test failure.
@kaiyux This is the "approved list approach" I discussed with you offline. Let me know your thoughts.
cc @Superjomn @litaotju @juney-nvidia
@yibinl-nvidia I fixed the DCO and ran the pre-commit hooks, added the tests, and made some improvements based on the comments. let me know if there is anything else I should address.
/bot run
PR_Github #3654 [ run ] triggered by Bot
PR_Github #3654 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2585 completed with status: 'FAILURE'
I did a micro-benchmark comparing the custom serialization and deserialization against Pickle. The first result is based on a simple GenerationResult object (same one as in the test_executor.py) approximately 2 KB in size. The second result focused on benchmarking message with various size. This shows that the new approach introduces trivial μs-level of perf degradation.
Serialization and Deserialization Performance Results for GenerationRequest with 10000 iterations:
Pickle (avg/min/max): 199.038μs / 186.559μs / 1271.870μs
Custom (avg/min/max): 204.601μs / 192.718μs / 363.867μs
Average time difference: 5.56 μs
Size Comparison:
Pickle size: 1920 bytes
Custom size: 1920 bytes
Size ratio (custom/pickle): 1.00x
Serialization Performance Across Different Message Sizes:
Size Pickle Avg (μs) Custom Avg (μs) Diff (μs) Size Ratio
--------------------------------------------------------------------------------
1.0 KB 200.348 205.710 5 1.00x
10.0 KB 279.806 285.365 5 1.00x
100.0 KB 1044.078 1051.731 7 1.00x
1.0 MB 8487.678 8505.488 17 1.00x
10.0 MB 92523.012 93069.062 546 1.00x
The last commit should resolve the issue reported by blossom.
/bot run
PR_Github #3744 [ run ] triggered by Bot
PR_Github #3744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2650 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #3766 [ run ] triggered by Bot
the tests that were failed in the basic bot run should be fixed with those new commits.
PR_Github #3766 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2667 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #3873 [ run ] triggered by Bot
PR_Github #3873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2746 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #3941 [ run ] triggered by Bot
PR_Github #3941 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2793 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #4000 [ run ] triggered by Bot
PR_Github #4000 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2841 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #4012 [ run ] triggered by Bot
PR_Github #4012 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2850 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test --disable-fail-fast
PR_Github #4038 [ run ] triggered by Bot
PR_Github #4038 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2868 completed with status: 'FAILURE'