DPL: earlier forwarding
This anticipates the forwarding to the earliest possible moment, i.e. when we are about to insert the messages in a slot. This is the earliest moment we can guarantee messages will be seen only once.
Stack created with Sapling. Best reviewed with ReviewStack.
- -> #14910 (2 commits)
- #14911 (2 commits)
- #14859 (2 commits)
REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-
+async-label <label1>, <label2>, !<label3> ...
This will add <label1> and <label2> and removes <label3>.
The following labels are available async-2023-pbpb-apass4 async-2023-pp-apass4 async-2024-pp-apass1 async-2022-pp-apass7 async-2024-pp-cpass0 async-2024-PbPb-apass1 async-2024-ppRef-apass1 async-2024-PbPb-apass2 async-2023-PbPb-apass5
@shahor02 this works in my synthetic tests (stage/bin/o2-testworkflows-early-forwarding -s --severity detail --early-forward-policy=always) . In the end I refactored the code to find the earliest spot where messages are guaranteed to be seen only once and I moved the early forward there.
@davidrohr @shahor02 I have noticed that the early forwarding is disabled by default. Is this expected?
@jgrosseo @nicolaspoffley I expect this to improve parallelism on hyperloop as well.
@ktf for me it is not expected that the EF is disabled, when I was debugging the slow turnover of Polaris jobs, I thought the forwarding is done at the beginning of run method. Was not this the supposed behaviour of the EF?
@shahor02 I need to have a better look. Maybe it's just my small reproducer to be wrong.
I also see there is some issues with some of the tests. I will debug better tomorrow morning.
Error while checking build/O2/fullCI_slc9 for 13018ab17136df492f574400382d9875f2c20767 at 2025-12-11 02:12:
No log files found
Full log here.
Ok, fixed the off by one issue with multiparts.
Error while checking build/O2/fullCI_slc9 for f6dfccea8b4921a0dc9b67dab536271bd70fffc4 at 2026-01-02 17:50:
## sw/BUILD/O2-full-system-test-latest/log
command /sw/slc9_x86-64/O2/14910-slc9_x86-64-local6/prodtests/full-system-test/dpl-workflow.sh had nonzero exit code 137
[ERROR] Workflow crashed - PID 8320 (EMCALRawToCellConverterSpec) did not exit correctly however it's not clear why. Exit code forced to 128.
[ERROR] Workflow crashed - PID 8876 (FDD-DigitQcTaskFDD-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8885 (GLO-MTCITSTPC-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8964 (ZDC-QcZDCRecTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8935 (TPC-Clusters-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8806 (CPV-PhysicsOnEPNs-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8861 (EMC-RawTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8837 (TRD-RawData-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8825 (TPC-PID-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8896 (ITS-ITSClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8954 (TRD-Tracklets-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8816 (GLO-MUONTracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8843 (EMC-CellTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8877 (FT0-DigitQcTaskFT0-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8902 (ITS-ITSTrackTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8841 (TRD-Tracking-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8952 (TPC-Tracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8925 (MID-QcTaskMIDDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8932 (TOF-MatchingTOFwTRD-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8931 (PHS-ClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8920 (MID-QcTaskMIDClust-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8824 (MFT-MFTAsyncTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8835 (TRD-PHTrackMatch-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8913 (MFT-MFTClusterTask-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8895 (GLO-Vertexing-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8884 (FV0-DigitQcTaskFV0-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8933 (TOF-TaskDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8912 (MCH-QcTaskMCHDigits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8965 (internal-dpl-injected-dummy-sink) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8930 (MID-QcTaskMIDTracks-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8953 (TRD-Digits-proxy) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8617 (qc-task-MID-QcTaskMIDTracks) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8613 (qc-task-MID-QcTaskMIDDigits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8681 (qc-task-TPC-Tracks) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8655 (qc-task-TOF-TaskDigits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8796 (qc-task-TRD-Tracklets) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8670 (qc-task-TPC-Clusters) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8683 (qc-task-TRD-Digits) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8651 (qc-task-TOF-MatchingTOFwTRD) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8316 (internal-dpl-ccdb-backend) was killed abnormally with Killed and exited code was set to 137.
[ERROR] Workflow crashed - PID 8802 (qc-task-ZDC-QcZDCRecTask) was killed abnormally with Killed and exited code was set to 137.
[ERROR] - Device internal-dpl-ccdb-backend: pid 8316 (exit 137)
[ERROR] - Device EMCALRawToCellConverterSpec: pid 8320 (exit 128)
[ERROR] - Device qc-task-MID-QcTaskMIDDigits: pid 8613 (exit 137)
[ERROR] - Device qc-task-MID-QcTaskMIDTracks: pid 8617 (exit 137)
[ERROR] - Device qc-task-TOF-MatchingTOFwTRD: pid 8651 (exit 137)
[ERROR] - Device qc-task-TOF-TaskDigits: pid 8655 (exit 137)
[0 more errors; see full log]
Full log here.
@shahor02 this works in my synthetic tests (stage/bin/o2-testworkflows-early-forwarding -s --severity detail --early-forward-policy=always) . In the end I refactored the code to find the earliest spot where messages are guaranteed to be seen only once and I moved the early forward there.
@davidrohr @shahor02 I have noticed that the early forwarding is disabled by default. Is this expected?
For online and offline reco we enable it here: https://github.com/davidrohr/O2DPG/blob/a5af1be2a96bbe3b2eeb2cf13d41c4afd1b81e4a/DATA/common/getCommonArgs.sh#L12
@ktf this seems to be genuine crash:
[8369:EMCALRawToCellConverterSpec]: [14:43:54][INFO] Correctly handshaken websocket connection.
[8369:EMCALRawToCellConverterSpec]: [14:43:59][WARN] Timed out sending after 1s. Downstream backpressure detected on from_EMCALRawToCellConverterSpec_to_Dispatcher[0].
[8369:EMCALRawToCellConverterSpec]: [14:44:02][INFO] Downstream backpressure on from_EMCALRawToCellConverterSpec_to_Dispatcher[0] recovered.
[8369:EMCALRawToCellConverterSpec]: *** Program crashed (Segmentation fault)
[8369:EMCALRawToCellConverterSpec]: Backtrace by DPL:
[8369:EMCALRawToCellConverterSpec]: Executable is /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/bin/o2-emcal-reco-workflow
[8369:EMCALRawToCellConverterSpec]: /lib64/libc.so.6: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: fair::mq::shmem::Message::Copy(fair::mq::Message const&)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingHelpers::routeForwardedMessages(o2::framework::FairMQDeviceProxy&, std::span<std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >, 18446744073709551615ul>&, std::vector<fair::mq::Parts, std::allocator<fair::mq::Parts> >&, bool, bool)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataRelayer::relay(void const*, std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >*, o2::framework::DataRelayer::InputInfo const&, unsigned long, unsigned long, std::function<void (o2::framework::ServiceRegistryRef&, std::span<std::unique_ptr<fair::mq::Message, std::default_delete<fair::mq::Message> >, 18446744073709551615ul>&)>, std::function<void (o2::framework::TimesliceSlot, std::vector<o2::framework::MessageSet, std::allocator<o2::framework::MessageSet> >&, o2::framework::TimesliceIndex::OldestOutputInfo)>)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::handleData(o2::framework::ServiceRegistryRef, o2::framework::InputChannelInfo&)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::doPrepare(o2::framework::ServiceRegistryRef)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::run_callback(uv_work_s*)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::Run()
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::Device::RunWrapper()
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: boost::detail::function::void_function_obj_invoker1<std::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::fsm::Machine_::ProcessWork()
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::StateMachine::ProcessWork()
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/FairMQ/v1.10.0-7/lib/libfairmq.so.1.10.0: fair::mq::DeviceRunner::Run()
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: doChild(int, char**, o2::framework::ServiceRegistry&, o2::framework::RunningWorkflowInfo const&, o2::framework::RunningDeviceRef, o2::framework::DriverConfig const&, o2::framework::ProcessingPolicies, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uv_loop_s*)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: runStateMachine(std::vector<o2::framework::DataProcessorSpec, std::allocator<o2::framework::DataProcessorSpec> > const&, WorkflowInfo const&, std::vector<o2::framework::DataProcessorInfo, std::allocator<o2::framework::DataProcessorInfo> > const&, o2::framework::CommandInfo const&, o2::framework::DriverControl&, o2::framework::DriverInfo&, o2::framework::DriverConfig&, std::vector<o2::framework::DeviceMetricsInfo, std::allocator<o2::framework::DeviceMetricsInfo> >&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, boost::program_options::variables_map&, std::vector<o2::framework::ServiceSpec, std::allocator<o2::framework::ServiceSpec> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: doMain(int, char**, std::vector<o2::framework::DataProcessorSpec, std::allocator<o2::framework::DataProcessorSpec> > const&, std::vector<o2::framework::ChannelConfigurationPolicy, std::allocator<o2::framework::ChannelConfigurationPolicy> > const&, std::vector<o2::framework::CompletionPolicy, std::allocator<o2::framework::CompletionPolicy> > const&, std::vector<o2::framework::DispatchPolicy, std::allocator<o2::framework::DispatchPolicy> > const&, std::vector<o2::framework::ResourcePolicy, std::allocator<o2::framework::ResourcePolicy> > const&, std::vector<o2::framework::CallbacksPolicy, std::allocator<o2::framework::CallbacksPolicy> > const&, std::vector<o2::framework::SendingPolicy, std::allocator<o2::framework::SendingPolicy> > const&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, std::vector<o2::framework::ConfigParamSpec, std::allocator<o2::framework::ConfigParamSpec> > const&, o2::framework::ConfigContext&)
[8369:EMCALRawToCellConverterSpec]: o2-emcal-reco-workflow() [0x407811]: std::vector<o2::framework::ChannelConfigurationPolicy, std::allocator<o2::framework::ChannelConfigurationPolicy> >::~vector() at stl_vector.h:735
[8369:EMCALRawToCellConverterSpec]: /sw/slc9_x86-64/O2/14910-slc9_x86-64-local1/lib/libO2Framework.so: callMain(int, char**, int (*)(int, char**))
[8369:EMCALRawToCellConverterSpec]: o2-emcal-reco-workflow() [0x404c59]: main at runDataProcessing.h:220
[8369:EMCALRawToCellConverterSpec]: /lib64/libc.so.6: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: /lib64/libc.so.6: ?? ??:0
[8369:EMCALRawToCellConverterSpec]: o2-emcal-reco-workflow() [0x404cf5]: _start at ??:?
[8369:EMCALRawToCellConverterSpec]: Backtrace complete.
@shahor02 indeed. I am investigating.
I suspect it's an issue with the back pressure. I will try to replicate.