cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

DQM bin-by-bin comparison missing after moving to ROOT 6.30

Open smuzaffar opened this issue 2 years ago • 46 comments

Looks like DQM bin by bin comparison plots are not available after moving to ROOT 6.30 on 13th Dec. ROOT was first integrated for CMSSW_14_0_X_2023-12-13-1100 IB and all PR tests using this or above IB have empty DQM plots e.g see

  • https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12066d/36477/summary.html
  • https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-79a965/36539/summary.html

Log files for DQM comparison do not show any error ( https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_14_0_X_2023-12-17-0000+79a965/60327/dqmBinByBinLog.log ) .

@cms-sw/dqm-l2 , could this is issue with DQM gui not able to process root 6.30 based plots?

smuzaffar avatar Dec 18 '23 07:12 smuzaffar

type root

smuzaffar avatar Dec 18 '23 07:12 smuzaffar

assign dqm

smuzaffar avatar Dec 18 '23 07:12 smuzaffar

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Dec 18 '23 07:12 cmsbuild

cms-bot internal usage

cmsbuild avatar Dec 18 '23 07:12 cmsbuild

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@Dr15Jones, @smuzaffar, @makortel, @rappoccio, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Dec 18 '23 07:12 cmsbuild

Hello, this was already reported last week by another group as well (@cms-sw/pdmv-l2), and your suspicion seems to be correct, it's due to the GUI using ROOT 6.14 which produces error messages when opening ROOT files created by 6.30:

Error in <TList::Clear>: A list is accessing an object (0x7f53bc738600) already deleted (list name = TList)

Since there are no plans to update the ROOT package supported by the comp team (here) to newer versions, we will proceed to ignore those messages for now, until we overhaul our DQMGUI deployment procedure.

We are in the process of deploying a fix any day now.

I will post here if we discover anything else.

(on behalf of DQM-DC)

nothingface0 avatar Dec 18 '23 08:12 nothingface0

Do we understand the origin of this error?

rovere avatar Dec 18 '23 11:12 rovere

Do we understand the origin of this error?

From my side, no. Any input is welcome.

nothingface0 avatar Dec 18 '23 11:12 nothingface0

Let's tag @pcanal here too.

makortel avatar Dec 18 '23 14:12 makortel

Let's tag @pcanal here too.

No need, I already opened a thread on the ROOT forum :)

nothingface0 avatar Dec 18 '23 14:12 nothingface0

@cms-sw/pdmv-l2 Was this error not seen in 14_0_0_pre0 + ROOT 6.30 RelVal production? Or those RelVals are not impacted?

makortel avatar Dec 18 '23 14:12 makortel

@cms-sw/pdmv-l2 Was this error not seen in 14_0_0_pre0 + ROOT 6.30 RelVal production? Or those RelVals are not impacted?

Reading again https://mattermost.web.cern.ch/cms-o-and-c/pl/ig7t5innq7b65mr9jns1y95gpo I see those RelVals were impacted.

@smuzaffar I guess we'd need to revert 6.30 for 14_0_0_pre2?

makortel avatar Dec 18 '23 14:12 makortel

@makortel the production itself went smoothly. All the data-tiers have been created and are regularly available on DAS.

Then when we tried to build the RelMon we noticed that the DQM were not actually uploaded to the DQM GUI and this is what started the investigations mentioned by @nothingface0 here.

AdrianoDee avatar Dec 18 '23 14:12 AdrianoDee

Then when we tried to build the RelMon we noticed that the DQM were not actually uploaded to the DQM GUI and this is what started the investigations mentioned by @nothingface0 here.

Thanks. But DQM GUI is critical part of the pipeline towards validators, right?

While the problem, in principle, can be addressed from DQM GUI side, I feel the CMSSW-side action should be on the table as well (even if I want ROOT 6.30 in production for 14_0_0).

makortel avatar Dec 18 '23 15:12 makortel

A note, if it helps: when trying to run the HARVESTING step with 14_0_0_pre0 on top of 14_0_0_pre0_ROOT630 file I get a Fatal ROOT Error with the same message as the one reported for the visDQMReceiveDaemon.

----- Begin Fatal Exception 18-Dec-2023 16:06:23 CET-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Calling InputSource::readFile_
   [1] Opening DQM Root file
Exception Message:

Input file file:step3_inDQM.root was not found, could not be opened, or is corrupted.
   Additional Info:
      [a] Fatal Root Error: @SUB=TList::Clear
A list is accessing an object (0x7f0738d930c0) already deleted (list name = TList)

----- End Fatal Exception -------------------------------------------------

In case somebody wants to reproduce it:

cmsDriver.py step4 --conditions auto:phase1_2023_realistic --era Run3_2023 --filein /store/relval/CMSSW_14_0_0_pre0_ROOT630/RelValBuMixing_14/DQMIO/PU_133X_mcRun3_2023_realistic_v2_el8_amd64_gcc12-v1/2590000/5892F3A7-1CC2-4989-B5A5-C72D9B038ECE.root --fileout file:step4.root --filetype DQM --geometry DB:Extended --mc --number 10 --python_filename step_4_cfg.py --scenario pp --step HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM+@nanoAODDQM

AdrianoDee avatar Dec 18 '23 15:12 AdrianoDee

Thanks. But DQM GUI is critical part of the pipeline towards validators, right?

Yes, indeed, it's critical and I agree with the fact that an action on CMSSW side could be on the table. Then, I think, having a solution (preferably not a workaround) on the DQM GUI side could allow us to have, in parallel, the physics validation in place with the samples already produced.

AdrianoDee avatar Dec 18 '23 15:12 AdrianoDee

While the problem, in principle, can be addressed from DQM GUI side, I feel the CMSSW-side action should be on the table as well (even if I want ROOT 6.30 in production for 14_0_0).

Then, I think, having a solution (preferably not a workaround) on the DQM GUI side could allow us to have, in parallel, the physics validation in place with the samples already produced.

Just to be clear (especially because I won't be able to attend ORP tomorrow), I'm specifically thinking what should we do for CMSSW_14_0_0_pre2.

makortel avatar Dec 18 '23 15:12 makortel

@cms-sw/orp-l2 @makortel , I am fine with reverting to ROOT 6.26 for 14.0.0.pre2. If we decide to go this path then we should do it at least 1 day (i.e. today) before we build 14.0.0.pre2 .

By the way, we also have seen this Error in <TList::Clear>: A list is accessing an object (0x7fffc1cafe80) already deleted (list name = TList) error during the PowerPC validation couple of years ago but that might be due to disk quota exceeded

smuzaffar avatar Dec 18 '23 16:12 smuzaffar

From the ROOT forum, it seems like the fix on the file reading ROOT side might be straightforward https://root-forum.cern.ch/t/error-in-tlist-clear-a-list-is-accessing-an-object-already-deleted-list-name-tlist-when-opening-a-file-created-by-root-6-30-using-root-6-14-09/57588/5

makortel avatar Dec 18 '23 16:12 makortel

From the ROOT forum, it seems like the fix on the file reading ROOT side might be straightforward root-forum.cern.ch/t/error-in-tlist-clear-a-list-is-accessing-an-object-already-deleted-list-name-tlist-when-opening-a-file-created-by-root-6-30-using-root-6-14-09/57588/5

Does that "simply" mean applying the patch and recompiling?

nothingface0 avatar Dec 18 '23 17:12 nothingface0

It should. I am checking.

pcanal avatar Dec 18 '23 17:12 pcanal

The patch applies cleanly, works and has been pushed to the v6-14-00-patches branch.

pcanal avatar Dec 18 '23 17:12 pcanal

The patch applies cleanly, works and has been pushed to the v6-14-00-patches branch.

Much appreciated!

nothingface0 avatar Dec 18 '23 17:12 nothingface0

@nothingface0 , I can open a PR for cmsdist comp_630 branch to include this fix

smuzaffar avatar Dec 18 '23 17:12 smuzaffar

@nothingface0 , I can open a PR for cmsdist comp_630 branch to include this fix

@smuzaffar Already on it!

nothingface0 avatar Dec 18 '23 17:12 nothingface0

Note that https://github.com/cms-sw/cmsdist/blob/comp_gcc630/root.spec is not using the tip of root 6.14 branch. So I would suggest to only apply https://github.com/root-project/root/commit/65ed49a726bd293edd259f2ceccbd7dc8756808f.patch on top of what is already used by comp_gcc630

smuzaffar avatar Dec 18 '23 17:12 smuzaffar

FYI, new DQM GUI with patched ROOT 6.14 has been deployed and DQM bin-by-bin comparisons look good e.g. see PR cms-sw/cmssw#43596 results and its DQM bin-by-bin comparison here

smuzaffar avatar Dec 19 '23 14:12 smuzaffar

Thanks to everyone involved for your inputs!

nothingface0 avatar Dec 19 '23 14:12 nothingface0

Great!

Next question is then, how could we discover this kind of problems earlier? (like when testing a new ROOT version for the first time(s))

makortel avatar Dec 19 '23 14:12 makortel

@makortel , I guess we (@cms-sw/externals-l2 ) need to keep an eye on PR tests for special ROOTX updates.

smuzaffar avatar Dec 19 '23 14:12 smuzaffar