west icon indicating copy to clipboard operation
west copied to clipboard

Setting `zephyr.base` variable breaks parallel builds

Open eugmes opened this issue 5 years ago • 12 comments

I'm attempting to run multiple west build commands in parallel (to save some time used by cmake). This breaks randomly with corrupted .west/config file. I tracked this down to this line:

https://github.com/zephyrproject-rtos/west/blob/618020d1979d07003d6df40ce59b413785ce3015/src/west/app/main.py#L692

Here multiple processes may attempt to write to the same file at the same time in my scenario. It would be good to get rid of this. Right now one have to run a zephyr-provided command with valid arguments before doing the build. west boards seems to work.

Note that I have ZEPHYR_BASE environment variable set during the build, but that does not prevent west from mangling the config file.

eugmes avatar Jun 19 '20 13:06 eugmes

Thanks for your report.

We'd like to solve these types of problems in general by divorcing zephyr base setting from west itself, now that @tejlmand implemented a Zephyr CMake package which makes the environment variable obsolete. But the transition is tricky and we haven't done it yet, and I'm sorry this is causing you trouble.

I'm a bit surprised that this is a continual problem. The code should set zephyr.base once in the config file and then not have to set it again, so as long as your .west/config has a [zephyr] base = ... everything should work in subsequent runs. Is that not what you're seeing?

Is this in a CI environment where the workspace is being set up fresh each time? If so, perhaps you can work around this issue by having the CI setup scripts run west config zephyr.base "$ZEPHYR_BASE" or so, since you say the environment variable is set?

mbolivar-nordic avatar Jun 19 '20 18:06 mbolivar-nordic

@mbolivar I believe @eugmes problem could be that two west processes writes the file simultaneously, and thus the file get corrupted. Which also explains why running west boards first, before running n-west build commands in parallel work, cause then the config is set, and will not be updated by the parallel runs afterwards.

(to save some time used by cmake).

What time you try to save ? west build invokes CMake, just as you would do by hand, and then ninja afterwards. It is simply a convenience instead of doing:

cmake -DBOARD=<board_name> -Bbuild .; ninja -Cbuild

tejlmand avatar Jun 22 '20 05:06 tejlmand

@tejlmand Yes, the problem only occurs when one runs multiple west builds in parallel, when they started almost at the same time. There is a chance of config file corruption then, but only when no other extension command was run before.

I have a test repo that builds multiple simple applications for multiple configurations. All together there are more than 100 binaries to build, and the number is going to increase. For those applications the configuration step using cmake takes almost the same time as building the applications. Almost nothing can be done to speed up building, it is already done in parallel. But the time used to run cmake can be reduced by running west processes in parallel. That's why I'm trying to use ninja for that. That not optimal, but it still results in significant speedup.

eugmes avatar Jun 22 '20 10:06 eugmes

But the time used to run cmake can be reduced by running west processes in parallel.

I'm just being curios, why not simply run n-CMake instances in parallel ? After all, west build invokes CMake, so you would save the overhead of west.

tejlmand avatar Jun 22 '20 12:06 tejlmand

@tejlmand mainly because west seems to be an official way to build zephyr apps.

eugmes avatar Jun 22 '20 14:06 eugmes

@mbolivar I believe @eugmes problem could be that two west processes writes the file simultaneously, and thus the file get corrupted.

I understand that, but west should only write the file if there's not a value already there, hence my follow up questions. If you could provide some answers for those, @eugmes, that would be helpful. Thanks.

mbolivar-nordic avatar Jun 22 '20 19:06 mbolivar-nordic

@mbolivar-nordic The problem is that the process of checking and setting the config variable is not atomic. So if multiple west build processes are started in quick succession then several of them could see that the value is not set. Additionally writing to the config file is also not atomic, so when multiple process try to write to this file, it may end up being corrupted.

Consider this sequence for example:

process 1: is the value set? - no
process 1: compute the default value
process 2: is the value set? - no
process 2: compute the default value
process 1: start writing the default value
process 2: start writing the default value
process 1: write complete
process 2: write complete

In this case the config file will contain both data written by process 1 and 2, in arbitrary order, depending on exact timing of events.

eugmes avatar Jun 22 '20 20:06 eugmes

I understand the race. Are you able to apply any of the suggested workarounds?

mbolivar-nordic avatar Jun 22 '20 20:06 mbolivar-nordic

@mbolivar-nordic yes, running west boards before doing anything else works fine.

eugmes avatar Jun 22 '20 20:06 eugmes

@tejlmand mainly because west seems to be an official way to build zephyr apps.

Thanks for the reason. It is correct that Zephyr docs are describing the build process using west and it is an easy way to get started with Zephyr.

But actually west is an optional (but very convenient tool), so there is no problem in using cmake directly, especially if it solves your problems.

That being said, west should still be fixed so that this issue can be fixed. (we would like to remove zephyr.base from Cmake, but currently we need west to be backward compatible.)

tejlmand avatar Jun 23 '20 08:06 tejlmand

That being said, west should still be fixed so that this issue can be fixed. (we would like to remove zephyr.base from Cmake, but currently we need west to be backward compatible.)

I agree with this.

@mbolivar-nordic yes, running west boards before doing anything else works fine.

Ok, glad to know you have a workaround, since the fix (getting rid of zephyr.base entirely) is something we don't have scheduled.

mbolivar-nordic avatar Jun 23 '20 23:06 mbolivar-nordic

@mbolivar-nordic with Zephyr 2.7.0 LTS out, it now the time to remove this in west ? Users on older Zephyr (for example 1.14 LTS ) can manually set zephyr.base if they need it.

I believe we don't need to set this value automatically anymore.

tejlmand avatar Mar 07 '22 15:03 tejlmand