postgresml icon indicating copy to clipboard operation
postgresml copied to clipboard

Cannot use `pgml.activate_venv()` to set environment for parallel workers.

Open higuoxing opened this issue 2 years ago • 10 comments

Currently, pgml provides a UDF called pgml.activate_venv()^1. However, when a query requires parallel workers, the venv environment cannot be set for parallel workers. This is not very easy to reproduce but parallel queries are not rare in PostgreSQL. Probably we can remove this UDF since we've already had the GUC parameter 'pgml.venv'^2 to control the venv path.

Steps to reproduce:

  1. The package xgboost exists in my venv environment.

  2. Remove the IMMUTABLE qualifier from pgml.validate_python_dependencies, so that this UDF can be execute on parallel workers multiple times.

    diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs
    index ad952e48..440df23d 100644
    --- a/pgml-extension/src/api.rs
    +++ b/pgml-extension/src/api.rs
    @@ -27,7 +27,7 @@ pub fn activate_venv(venv: &str) -> bool {
     }
    
     #[cfg(feature = "python")]
    -#[pg_extern(immutable, parallel_safe)]
    +#[pg_extern(parallel_safe)]
     pub fn validate_python_dependencies() -> bool {
         unwrap_or_error!(crate::bindings::python::validate_dependencies())
     }
    
  3. Construct a query that involves parallel workers.

    CREATE TABLE t1(i int);
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    
    pgml=# select pgml.activate_venv('/tmp/virtualenv');
    activate_venv
    ---------------
     t
    (1 row)
    
    pgml=# explain (analyze) select count(pgml.validate_python_dependencies()) 
    from t1;
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    INFO:  Scikit-learn 1.3.0, XGBoost 2.0.1, LightGBM 4.1.0, NumPy 1.26.1
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    ERROR:  The xgboost package is missing. Install it with `sudo pip3 install xgboost`
    ModuleNotFoundError: No module named 'xgboost'
    CONTEXT:  parallel worker
    

higuoxing avatar Nov 07 '23 06:11 higuoxing

Is this related to #1146? Can we kill 2 birds with one stone?

montanalow avatar Nov 08 '23 17:11 montanalow

Is this related to #1146? Can we kill 2 birds with one stone?

Probably no? #1146 is adding support for storing cache in a custom location. This issue states that the pgml.activate_venv() isn't implemented correctly that when there're queries involve parallel workers the UDF pgml.activate_venv() cannot setup venv for parallel workers.

One possible solution is removing the pgml.activate_venv() UDF and tell users to use the pgml.venv GUC parameter to set up the virtual environment.

cc @liuxueyang

higuoxing avatar Nov 08 '23 23:11 higuoxing

I think it would be great if we can also remove pgml.activate_venv() in favor of setting pgml.venv at boot, since that covers our known use cases for venvs, without any runtime performance impact or complexity.

montanalow avatar Nov 09 '23 18:11 montanalow

@montanalow Do we need to refactor the pgml.venv in #1146 or create another separate PR?

liuxueyang avatar Nov 10 '23 08:11 liuxueyang

I think #1146 will remove the need to call pgml.activate_venv() all together. We'll leave that function as is for now, since dropping it would be a breaking API change, we'll need to wait for a major version bump.

montanalow avatar Nov 15 '23 01:11 montanalow

pgml.activate_venv

Sorry, I didn't get it. activate_venv is for activating the Python's virtualenv. #1146 is for caching models from huggingface, right? Why #1146 removes the need to call pgml.activate_venv()?

higuoxing avatar Nov 15 '23 01:11 higuoxing

@higuoxing I guess what he said is that we can keep pgml.activate_venv() as is for now. There is no need to remove the UDF now since it would be a breaking change.

liuxueyang avatar Nov 15 '23 01:11 liuxueyang

Right, sorry, I confused comments between #1147 & #1146. #1146 establishes a better pattern for gathering env vars in _PG_init, rather than calling a function at runtime. To fix this issue #1147, I think we should adopt a similar approach that will set the Python venv at server start only, but we don't need to hold up #1146 for this additional fix. cc @levkk

montanalow avatar Nov 15 '23 01:11 montanalow

I believe this was implemented in: https://github.com/postgresml/postgresml/commit/5fdbd5dc15230beb68d7bb04f9c448c6915afb3d

@montanalow can you confirm before I close.

We still have the UDF, but we do initialize the venv at boot: https://github.com/postgresml/postgresml/blob/61cc2858a750949fe2d7208b9ebe4d77f6123161/pgml-extension/src/lib.rs#L28

I'm not sure if removing the UDF is something we want to do now. I don't know what the use case for leaving it is: https://github.com/postgresml/postgresml/blob/61cc2858a750949fe2d7208b9ebe4d77f6123161/pgml-extension/src/api.rs#L25

SilasMarvin avatar Jan 15 '25 20:01 SilasMarvin

I believe this was implemented in: 5fdbd5d

Yes. I think it's fixed.

@montanalow can you confirm before I close.

We still have the UDF, but we do initialize the venv at boot:

https://github.com/postgresml/postgresml/blob/61cc2858a750949fe2d7208b9ebe4d77f6123161/pgml-extension/src/lib.rs#L28

I'm not sure if removing the UDF is something we want to do now. I don't know what the use case for leaving it is:

https://github.com/postgresml/postgresml/blob/61cc2858a750949fe2d7208b9ebe4d77f6123161/pgml-extension/src/api.rs#L25

I think we can emit an error/warning message to stop user using this UDF and finally we can remove it in future.

We can have something like this

warning!("This UDF will be deprecated in future release.");
warning!("Please set the virtual environment by configuring the 'pgml.venv' parameter.");

higuoxing avatar Jan 16 '25 03:01 higuoxing