parsec GPU status not reset after a taskpool finished on GPU

Original report by Qinglei Cao (Bitbucket: Qinglei_Cao, ).

If a taskpool runs on GPU, it seems that status on GPU (some GPU data_copy_t, load between devices) is not reset when finished. Hence, if called another taskpool on GPU that uses the same descriptor, there will be unexpected actions (maybe reuse of old GPU data_copy_t, that are not updated because the CPU version is not changed?). This issue was encountered when running Cholesky multiple times (all kernels on GPU), with multi-GPUs, and it could be solved by adding “parsec_devices_release_memory()” and “parsec_devices_reset_load(parsec)” (suggested by Thomas) between taskpool_add.

Mar 19 '20 19:03 abouteiller

Original comment by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).

This was designed in purpose. If we reset the GPU then each taskpool will have to push all the data again, even the data that has not been altered. So I think the problem is that you change the underlaying matrix data outside parsec, without altering the data version. Thus, parsec will assume there is no need to push the data, because nothing new happened on it.

Can you give me a small example of the code you are running into problem with ?

Mar 19 '20 20:03 abouteiller

Original comment by Qinglei Cao (Bitbucket: Qinglei_Cao, ).

I made a simple example. Actually, it cannot run the second time when calling dplasma_dpotrf. It shows “device_cuda_module.c:1223: parsec_gpu_data_stage_in: Assertion `(gpu_elem->version < in_elem->version) || (gpu_elem->data_transfer_status == ((parsec_data_coherency_t)0x0))' failed.“. If use “parsec_devices_release_memory()” and “parsec_devices_reset_load(parsec)” to reset GPU status before dplasma_dpotrf, it works.

command: ./testing_issue242 -N 25 -t 5 -g 1 -r 2

#include "parsec/data_dist/matrix/two_dim_rectangle_cyclic.h"
#include "parsec/data_dist/matrix/sym_two_dim_rectangle_cyclic.h"
#include "parsec/utils/mca_param.h"
#include "parsec/private_mempool.h"
#include "parsec/runtime.h"
#include "parsec/data_internal.h"
#include "parsec/execution_stream.h"
#include "parsec/data_dist/matrix/matrix.h"

#include "parsec/mca/device/cuda/device_cuda.h"
#include "parsec/mca/device/cuda/device_cuda_internal.h"
#include "parsec/utils/zone_malloc.h"

/* dplasma */
#include "dplasma.h"

static int my_matrix_init_ops(parsec_execution_stream_t *es,
                        const parsec_tiled_matrix_dc_t *descA,
                        void *_A, enum matrix_uplo uplo,
                        int m, int n, void *args) {
    double *A = (double *)_A;
    if( m == n ) {
        for( int j = 0; j < descA->nb; j++ ) {
            for( int i = 0; i < descA->mb; i++ ) {
                if( i == j ) 
                    A[j*descA->mb+i] = ((double *)args)[0];
                else
                    A[j*descA->mb+i] = ((double *)args)[1];
            }
        }
    } else {
        for( int j = 0; j < descA->nb; j++ ) {
            for( int i = 0; i < descA->mb; i++ ) {
                A[j*descA->mb+i] = ((double *)args)[1];
            }
        }
    }
}

int main(int argc, char *argv[])
{
    parsec_context_t* parsec;
    int rank, nodes, ch, info;
    int pargc = 0;
    char **pargv;
    double gflops, flops;
    int i, jj;

    /* Default */
    int N = 8;
    int NB = 4;
    int P = 1;
    int cores = -1;
    int nb_runs = 2;
    int nb_gpus = 1;

    while ((ch = getopt(argc, argv, "N:t:P:c:h:r:g:")) != -1) {
        switch (ch) {
            case 'N': N = atoi(optarg); break;
            case 't': NB = atoi(optarg); break;
            case 'P': P = atoi(optarg); break;
            case 'r': nb_runs = atoi(optarg); break;
            case 'g': nb_gpus = atoi(optarg); break;
            case 'c': cores = atoi(optarg); break;
            case '?': case 'h': default:
                fprintf(stderr,
                        "-N : column dimension (N) of the matrices (default: 8)\n"
                        "-t : row dimension (NB) of the tiles (default: 4)\n"
                        "-P : rows (P) in the PxQ process grid (default: 1)\n"
                        "-r : number of runs (default: 2)\n"
                        "-g : number of gpus (default: 1)\n"
                        "-c : number of cores used (default: -1)\n"
                        "\n");
                 exit(1);
        }
    }

#if defined(PARSEC_HAVE_MPI)
    {
        int provided;
        MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    }
    MPI_Comm_size(MPI_COMM_WORLD, &nodes);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#else
    nodes = 1;
    rank = 0;
#endif

    pargc = 0; pargv = NULL;
    for(i = 1; i < argc; i++) {
        if( strcmp(argv[i], "--") == 0 ) {
            pargc = argc - i;
            pargv = &argv[i];
            break;
        }
    }

    extern char **environ;
    char *value;
    if(0 == rank ) printf("nb_gpus: %d\n", nb_gpus);
    if( nb_gpus < 1 && 0 == rank ) {
            fprintf(stderr, "Warnning: if run on GPUs, please set -g value bigger than 0\n");
    }
    asprintf(&value, "%d", nb_gpus);
    parsec_setenv_mca_param( "device_cuda_enabled", value, &environ );
    free(value);

    /* Initialize PaRSEC */
    parsec = parsec_init(cores, &pargc, &pargv);

    if( NULL == parsec ) {
        /* Failed to correctly initialize. In a correct scenario report
         * upstream, but in this particular case bail out.
         */
        exit(-1);
    }

    /* If the number of cores has not been defined as a parameter earlier
     * update it with the default parameter computed in parsec_init. */
    if(cores <= 0)
    {
        int p, nb_total_comp_threads = 0;
        for(p = 0; p < parsec->nb_vp; p++) {
            nb_total_comp_threads += parsec->virtual_processes[p]->nb_cores;
        }
        cores = nb_total_comp_threads;
    }

    /* initializing matrix structure */
    enum matrix_uplo uplo = matrix_Lower;
    sym_two_dim_block_cyclic_t dcA;
    sym_two_dim_block_cyclic_init(&dcA, matrix_RealDouble,
                                nodes, rank, NB, NB, N, N, 0, 0,
                                N, N, P, uplo);
    dcA.mat = parsec_data_allocate((size_t)dcA.super.nb_local_tiles *
                                   (size_t)dcA.super.bsiz *
                                   (size_t)parsec_datadist_getsizeoftype(dcA.super.mtype));
    parsec_data_collection_set_key((parsec_data_collection_t*)&dcA, "dcA");

    for( i = 0; i < nb_runs; i ++ ) {
	    /* Init dcA, diagonal op_args[0], off-diagnal op_args[1] */
	    double *op_args = (double *)malloc(2 * sizeof(double));
	    if( i % 2 ) {
		    op_args[0] = (double)N * N;
		    op_args[1] = (double)1.0;
	    } else {
		    op_args[0] = (double)N * N * N; 
		    op_args[1] = (double)N;
	    }

	    parsec_apply( parsec, uplo,
			    (parsec_tiled_matrix_dc_t *)&dcA,
			    (tiled_matrix_unary_op_t)my_matrix_init_ops, op_args);

	    int NT = N / NB * 2; 

	    /* Reset cache on GPU before Cholesky */
	    //parsec_devices_release_memory();
	    //parsec_devices_reset_load(parsec);

	    info = dplasma_dpotrf( parsec, uplo, (parsec_tiled_matrix_dc_t *)&dcA );

	    printf("Print matrix of %d :\n", i);
	    dplasma_dprint(parsec, uplo, (parsec_tiled_matrix_dc_t *)&dcA);
	    printf("\n\n");
    }

    parsec_data_free(dcA.mat);
    parsec_tiled_matrix_dc_destroy((parsec_tiled_matrix_dc_t*)&dcA);

    /* Clean up parsec*/
    parsec_fini(&parsec);

#ifdef PARSEC_HAVE_MPI
    MPI_Finalize();
#endif

    return 0;
}

‌

Mar 20 '20 15:03 abouteiller

Original comment by Nuria Losada (Bitbucket: nuriallv, GitHub: nuriallv).

I believe this issue is related to the coherence state when we marked a copy shared between CPU and GPU. If the CPU modifies it, it does not set it as invalid in the GPU, so we have a copy PARSEC_DATA_STATUS_COMPLETE_TRANSFER and PARSEC_DATA_COHERENCY_SHARED, but with a lower version number with the original, triggering the assert at device_cuda_module.c:1227:

assert((gpu_elem->version < in_elem->version) || (gpu_elem->data_transfer_status == PARSEC_DATA_STATUS_NOT_TRANSFER));

Another reproducer where this happens is in the checking of tests/testing_dpotrf. It happens during the checking with the call to dplasma_zpotrs when running the two consecutive dplasma_ztrsm:
gdb --args ./testing_dpotrf -c 1 -P 1 -Q 1 -M 2048 -N 4096 -NB 512 -x -- --mca device_cuda_enabled 1

Apr 01 '20 20:04 abouteiller

This should be restudied after 3935f31 as it may now behave correctly

Apr 24 '24 21:04 abouteiller

I will take a look this weekend. Thanks.

On Wed, Apr 24, 2024 at 4:56 PM Aurelien Bouteiller < @.***> wrote:

This should be restudied after 3935f31 https://github.com/ICLDisco/parsec/commit/3935f318340ad346597d853e7e8ca630ef320056 as it may now behave correctly

— Reply to this email directly, view it on GitHub https://github.com/ICLDisco/parsec/issues/242#issuecomment-2075912827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2QKLOESFT7MC7ZDRBA6VTY7AS7LAVCNFSM5Q2CSD7KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBXGU4TCMRYGI3Q . You are receiving this because you were assigned.Message ID: @.***>

Apr 26 '24 12:04 QingleiCao

I will take a look this weekend. Thanks. …

We never documented what was the outcome of this study @QingleiCao

Mar 07 '25 18:03 abouteiller

I will take a look this weekend. Thanks. …

We never documented what was the outcome of this study @QingleiCao

It seems it's not there.

Mar 13 '25 15:03 QingleiCao