vunit icon indicating copy to clipboard operation
vunit copied to clipboard

regression works when running sequentially but fails when using multiple threads

Open vperrin59 opened this issue 4 years ago • 19 comments

I'm using Vunit with verilog/systemverilog and xcelium I have a strange issue. When in my setup I use multiple thread with the option:

python run.py --num-threads 5

Some tests are randomly failing. After some investigation, it looks like the issue is coming from vunit_results report. In the failing tests the file is empty and that's why they parsed as skipped by the script.

Did somebody face similar issue ?

vperrin59 avatar Jan 06 '22 13:01 vperrin59

It sounds like a buffer that is not flushed properly. When a VUnit simulation ends properly the VHDL part of the tool writes test_suite_done to the result file and then closes it https://github.com/VUnit/vunit/blob/f344c8b5642d7aa13db2e16f6fc7151585ca96d0/vunit/vhdl/core/src/core_pkg.vhd#L43-L49 One would assume that the implementation of file_close also flushes any buffer but I don't think the VHDL standard is explicit about this. What happens if you add a new line line before line 48:

flush(test_results);

Does the error go away?I don't have access to Xcelium so I can't test it myself.

Could also be related to the operating system. Which one are you using?

LarsAsplund avatar Jan 06 '22 16:01 LarsAsplund

I forgot to mention, I added $fclose to the vunit_pkg.sv (I'm using SystemVerilog), hoping it would magically fix the issue but it did not.

  $fclose(trace_fd);
  exit_without_errors = 1;
  $finish(1);

I don't know of other ways to force flushing of the buffer in Verilog.

The operating system is CentOS

vperrin59 avatar Jan 06 '22 16:01 vperrin59

A way to get closer to finding out whether flushing is indeed the issue, you might try running xrun with the -unbuffered option (probably easiest by exporting XRUNOPTS). This will be slower, so not recommended for production use.

cmarqu avatar Jan 06 '22 17:01 cmarqu

I added $fflush(trace_fd) before the $fclose and it seems to have made it work. I'm running more iterations to confirm it's fixing the issue

vperrin59 avatar Jan 06 '22 17:01 vperrin59

A way to get closer to finding out whether flushing is indeed the issue, you might try running xrun with the -unbuffered option (probably easiest by exporting XRUNOPTS). This will be slower, so not recommended for production use.

Thanks for the tip, useful to know

vperrin59 avatar Jan 06 '22 17:01 vperrin59

@vperrin59 I'm not really a SystemVerilog user but I think the code we have look a bit dangerous. There is no $fclose so it looks like it relies on implicit $fclose rules in SystemVerilog. Are there such rules? Even if that is implemented it doesn't seem to work since your explicit $fclose doesn't fix the problem. I would expect $fclose to also do a flush but if your fix works that is not the case,

LarsAsplund avatar Jan 06 '22 18:01 LarsAsplund

@LarsAsplund I agree with you. I would have thought fclose would fix the issue indeed, even though it looks like simulator automatically close the handler by finishing the sim. On my side, it seems only the $fflush fixes the issue

vperrin59 avatar Jan 06 '22 22:01 vperrin59

@vperrin59 I looked into the standard and I couldn't find any guidance. I think we should add both fflush and fclose. Want to make a PR?

LarsAsplund avatar Jan 07 '22 08:01 LarsAsplund

So I've run more iterations during the night, and the issue still appear (less frequent though it seems). I have to debug it further.

vperrin59 avatar Jan 07 '22 10:01 vperrin59

Ok I made some progress, it looks like the issue coming from the fact that parameter runner_cfg get assigned to wrong value. It get assigned the value intended for another test. That is really strange because the content of xrun_simulate.args looks correct

vperrin59 avatar Jan 07 '22 14:01 vperrin59

Do you have a way to distinguish between randomly failing to write to file and randomly failing to complete the simulation for other reasons before VUnit writes to file. I've had experiences with other simulators failing to complete without any message whatsoever. You could try to write something to another file before completing. If flushing fails completely random you would not expect both writes to fail. At least not all the time.

LarsAsplund avatar Jan 07 '22 14:01 LarsAsplund

I still didn't find the root cause of the issue. This happens randomly. Not always the same test fail. Something I noticed though:

For example if I have the following test start order:

test1, test2, test3, test4,

If the issue appears in test3, test3 will get runner_cfg param value from test2

vperrin59 avatar Jan 07 '22 16:01 vperrin59

I may look like a xcelium simulator issue when using gpg arg with multiple thread. I could not find any explanation. I found a work around by assigning runner_cfg with $value$plusargs function. Unfortunately I do not have another simulator to test. That would be interesting to know if someone with similar setup doesn't have any issue (xcelium, verilog, multiple threads)

vperrin59 avatar Jan 07 '22 17:01 vperrin59

@LarsAsplund I finally root caused the issue (that was tough debug). It looks like there is a fundamental issue with VUnit incisive/xcelium flow when running multiple threads and overriding parameters. I could fix it locally. The issue is related to cds.lib file. The parallel sims were pointing to the same work library and that caused the issue

vperrin59 avatar Jan 09 '22 12:01 vperrin59

What is your workaround? I'm working on another feature which will require that parallel simulations will have to start from different directories. Have no experience with Xcelium and the cds.lib so it might not make any difference for your issue. @cmarqu , any thoughts on this?

LarsAsplund avatar Jan 09 '22 14:01 LarsAsplund

I'm actually suprised that this issue hasn't been reported before. Maybe not so much people use the same config as me. In order to fix the issue, that is the patch I used

diff --git a/vunit/sim_if/xcelium.py b/vunit/sim_if/xcelium.py
index 6196a5ee..c6ff6153 100644
--- a/vunit/sim_if/xcelium.py
+++ b/vunit/sim_if/xcelium.py
@@ -311,6 +311,12 @@ define work "{2}/libraries/work"
         else:
         steps = ["elaborate", "simulate"]
 
+        sim_cds_lib = os.path.join(script_path, "cds.lib")
+
+        cds = CDSFile.parse(self._cdslib)
+        # Change work path
+        cds["work"] = f"{script_path}/work"
+        cds.write(sim_cds_lib)
         for step in steps:
             cmd = str(Path(self._prefix) / "xrun")
             args = []
@@ -342,7 +348,7 @@ define work "{2}/libraries/work"
             #     '-xmlibdirname "%s"' % (str(Path(self._output_path) / "libraries"))
             # ]  # @TODO: ugly
             args += config.sim_options.get("xcelium.xrun_sim_flags", [])
-            args += ['-cdslib "%s"' % self._cdslib]
+            args += ['-cdslib "%s"' % sim_cds_lib]
             args += self._hdlvar_args()
             args += ['-log "%s"' % str(Path(script_path) / ("xrun_%s.log" % step))]
             if not self._log_level == "debug":

vperrin59 avatar Jan 09 '22 14:01 vperrin59

@vperrin59 What this is doing (unless I'm wrong) is compiling and elaborating the full snapshot with your design several times. While this works, it's more space and time efficient to do something like this in the elaborate step:

xrun -cdslib cds.lib mymodule.vhdl -elaborate -xmlibdirname ./snapshot

and this in the simulate step:

mkdir -p run1 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run1 -l run1.log ...
mkdir -p run2 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run2 -l run2.log ...
mkdir -p run3 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run3 -l run3.log ...

(Some of the options already exist in the vunit code.) There is a tutorial for this in the docs somewhere.

cmarqu avatar Jan 10 '22 17:01 cmarqu

@cmarqu I don't think you can use only one elab command. If you have one tb with 2bit param and you have 4 tests that set 4 different values for the param then you need to call elaborate 4 times. Do you agree ?

Moreover with that being said, it would be more flexible to have the runner_cfg not defined as param but as variable string that would be assigned with $value$plusargs. That way we don't need to call elaborate for each test

vperrin59 avatar Jan 11 '22 20:01 vperrin59

Indeed, the example I was taking this from is using plusargs to differentiate between parallel runs. But it might still work without plusargs - it is possible to stack elaborated snapshots (at least with what they call "Multi-snapshot incremental elaboration (MSIE)" and the related options), though I don't know offhand if this also works with -cds_implicit_tmpdir.

cmarqu avatar Jan 12 '22 07:01 cmarqu