OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

"BLAS: Bad memory unallocation!" errors on Power

Open cparrott73 opened this issue 5 years ago • 6 comments

We recently encountered some failures in some of our tests on Power, which a colleague of mine traced to an apparent issue in OpenBLAS. For reference, consider the attached test program, test.f. (Remove the .txt extension below - github made me add that:)

test.f.txt

The symptom when you run this program on Power with recent OpenBLAS versions up to and including OpenBLAS 0.3.10 is that OpenBLAS will print the following messages while the program runs:

BLAS : Bad memory unallocation! : 1024  0x7ffe39b40000
BLAS : Bad memory unallocation! : 1024  0x7ffdadb40000
BLAS : Bad memory unallocation! : 1024  0x7ffe1db40000
BLAS : Bad memory unallocation! : 1024  0x7ffd0db40000

We have observed this behavior when compiling OpenBLAS with both gfortran/gcc, as well as the hybrid nvc/nvfortran/gcc build referenced in my comment in issue #2718. Something like the following should be sufficient to expose the bug:

make CC=gcc FC=gfortran USE_OPENMP=1 DYNAMIC_ARCH=1 NUM_THREADS=512 all

My colleague did some analysis, and here is what he found:

On POWER8 and POWER9 systems, there is a race condition in the memory management logic in routines blas_memory_alloc() and blas_memory_free() (driver/others/memory.c) when the library is compiled for use with OpenMP.

At runtime we are seeing the following errors from blas_memory_free():

BLAS : Bad memory unallocation! : 1024 0x2001b9440000

The message comes from line 2906, if "free_area" is not found in table "memory".

[cparrott73 note - I'm not sure if these line numbers are from the 0.3.10 or the 0.3.7 version, but this should get you in the general ballpark of the problem.]

2871
2872   int position;
2873
2874 #ifdef DEBUG
2875   printf("Unmapped Start : %p ...\n", free_area);
2876 #endif
2877
2878   position = 0;
2879 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2880   LOCK_COMMAND(&alloc_lock);
2881 #endif
2882   while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2883     position++;
2884
2885   if (memory[position].addr != free_area) goto error;
2886
2887 #ifdef DEBUG
2888   printf("  Position : %d\n", position);
2889 #endif
2890
2891   // arm: ensure all writes are finished before other thread takes this memory
2892   WMB;
2893
2894   memory[position].used = 0;
2895 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2896   UNLOCK_COMMAND(&alloc_lock);
2897 #endif
2898
2899 #ifdef DEBUG
2900   printf("Unmap Succeeded.\n\n");
2901 #endif
2902
2903   return;
2904
2905  error:
2906   printf("BLAS : Bad memory unallocation! : %4d  %p\n", position,  free_area);
2907
2908 #ifdef DEBUG
2909   for (position = 0; position < NUM_BUFFERS; position++)
2910     printf("%4ld  %p : %d\n", position, memory[position].addr, memory[position].used);
2911 #endif
2912 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2913   UNLOCK_COMMAND(&alloc_lock);
2914 #endif
2915   return;
2916 }

After much tracing, the problem with the free turns out to be an issue with allocation (blas_memory_alloc()). There are various paths in that routine depending upon which environment the build targets. For our OpenMP build, the region of code that is of interest is:

2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741   LOCK_COMMAND(&alloc_lock);
2742 #endif
2743   do {
2744       RMB;
2745 #if defined(USE_OPENMP)  
2746     if (!memory[position].used) {
2747       blas_lock(&memory[position].lock);
2748 #endif
2749       if (!memory[position].used) goto allocation;
2750
2751 #if defined(USE_OPENMP)
2752       blas_unlock(&memory[position].lock);
2753     }
2754 #endif
2755     position ++;
2756
2757   } while (position < NUM_BUFFERS);
2758 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2759   UNLOCK_COMMAND(&alloc_lock);
2760 #endif
2761   goto error;
2762
2763   allocation :
2764
2765 #ifdef DEBUG
2766   printf("  Position -> %d\n", position);
2767 #endif
2768
2769   memory[position].used = 1;
2770 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2771   UNLOCK_COMMAND(&alloc_lock);
2772 #else
2773   blas_unlock(&memory[position].lock);
2774 #endif

The code is looping through the table "memory" looking to see if an entry is "used" (line 2746). If the entry is free, a lock is acquired (line 2747), and then the "used" flag is again checked to see whether the entry is still free. If the entry is still free, control transfers to "allocation:", the entry is marked as being "used" (line 2769), and the lock is subsequently released (line 2773).

This is a good implementation which minimizes what has to be serialized the number of locks that have to be acquired/released, thus minimizing the amount of logic that has to be serialized.

The problem is with the implementation of the "blas_lock" routine which is defined in header file common_power.h (top level directory). The lock is coded with inline assembly as:

102   __asm__ __volatile__ (
103            "0:  lwarx %0, 0, %1\n"
104            "    cmpwi %0, 0\n"
105            "    bne- 1f\n"
106            "    stwcx. %2,0, %1\n"
107            "    bne- 0b\n"
108            "1:    "
109         : "=&r"(ret)
110         : "r"(address), "r" (val)
111         : "cr0", "memory");
112 #else  

What is wrong, is that there is a missing "isync" import memory barrier after after acquiring the lock (BTW, this locking does not guard against a lock being acquired multiple times - not sure if that's intended or not).

This is what the assembly for blas_lock should be:

102   __asm__ __volatile__ (
103        "0:  lwarx %0, 0, %1\n"
104        "    cmpwi %0, 0\n"
105        "    bne- 1f\n"
106        "    stwcx. %2,0, %1\n"
107        "    bne- 0b\n"
108        "    isync\n"
109        "1:    "
110     : "=&r"(ret)
111     : "r"(address), "r" (val)
112     : "cr0", "memory");

(Note the isync instruction added at line 108 in this version.)

Without the import barrier, the read of "used" (line 2749) is being speculatively executed, prior to knowing whether the lock has been acquired or not. Here's the supporting documentation from IBM's "Power ISA Version 2.07" manual:

B.2.1.1 Acquire Lock and Import Shared Storage If lwarx and stwcx. instructions are used to obtain the lock, an import barrier can be constructed by placing an isync instruction immediately following the loop containing the lwarx and stwcx.. The following example uses the “Compare and Swap” primitive to acquire the lock. In this example it is assumed that the address of the lock is in GPR 3, the value indicating that the lock is free is in GPR 4, the value to which the lock should be set is in GPR 5, the old value of the lock is returned in GPR 6, and the address of the shared data structure is in GPR 9. loop: lwarx r6,0,r3,1 #load lock and reserve cmpw r4,r6 #skip ahead if bne- wait # lock not free stwcx. r5,0,r3 #try to set lock bne- loop #loop if lost reservation isync #import barrier lwz r7,data1(r9)#load shared data . . wait... #wait for lock to free The hint provided with lwarx indicates that after the program acquires the lock variable (i.e. stwcx. is successful), it will release it (i.e. store to it) prior to another program attempting to modify it. The second bne- does not complete until CR0 has been set by the stwcx.. The stwcx. does not set CR0 until it has completed (successfully or unsuccessfully). The lock is acquired when the stwcx. completes successfully. Together, the second bne- and the subsequent isync create an import barrier that prevents the load from “data1” from being performed until the branch has been resolved not to be taken.

I implemented an alternative version of the alloc and free logic for OpenMP builds using the GNU legacy atomic operation __sync_val_compare_and_swap, which now only references the "used" field of the structure. This implementation makes the "used" field both a lock and a flag (original code disabled using "//"):

2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741   LOCK_COMMAND(&alloc_lock);
2742 #endif
2743   do {
2744       RMB;
2745 #if defined(USE_OPENMP)  
2746 //    if (!memory[position].used) {
2747 //      blas_lock(&memory[position].lock);
2748 #endif
2749 //      if (!memory[position].used) goto allocation;
2750       if (memory[position].used == 0 && __sync_val_compare_and_swap(&memory[position].used, 0, 1) == 0) goto allocation;
2751
2752 #if defined(USE_OPENMP)
2753 //      blas_unlock(&memory[position].lock);      
2754 //    }
2755 #endif
2756     position ++;
2757
2758   } while (position < NUM_BUFFERS);
2759 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2760   UNLOCK_COMMAND(&alloc_lock);
2761 #endif
2762   goto error;
2763
2764   allocation :
2765
2766 #ifdef DEBUG
2767   printf("  Position -> %d\n", position);
2768 #endif
2769
2770 //  memory[position].used = 1;
2771 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2772   UNLOCK_COMMAND(&alloc_lock);
2773 #else
2774 //  blas_unlock(&memory[position].lock);    
2775 #endif

and blas_free():

2880 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2881   LOCK_COMMAND(&alloc_lock);
2882 #endif
2883   while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2884     position++;
2885
2886   if (memory[position].addr != free_area) goto error;
2887
2888 #ifdef DEBUG
2889   printf("  Position : %d\n", position);
2890 #endif
2891
2892   // arm: ensure all writes are finished before other thread takes this memory
2893   WMB;
2894
2895 //  memory[position].used = 0;
2896   assert(__sync_val_compare_and_swap(&memory[position].used, 1, 0) == 1);
2897 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2898   UNLOCK_COMMAND(&alloc_lock);
2899 #endif

Line 2896 might be considered "overkill", but it does guarantee table coherency (and abort if it is not).

One last issue - the "memory" table is not properly aligned, though compilers will properly align the structure:

2604   BLASULONG lock;
2605   void *addr;
2606 #if defined(WHEREAMI) && !defined(USE_OPENMP)
2607   int   pos;
2608 #endif
2609   int used;
2610 #ifndef __64BIT__
2611   char dummy[48];
2612 #else
2613   char dummy[40];
2614 #endif
2615
2616 } memory[NUM_BUFFERS];

The "dummy" padding (line 2613) is incorrect for OpenMP builds when BLASULONG is 64 bits and compiling for 64-bits. sizeof(BLASULONG) + sizeof(*addr) + sizeof(int) = 8+8+4 = 20. It should be declared as 44 elements for 64-byte alignment.

But, this brings up a different issue for POWER8 and POWER9 systems. IBM recommends that locks be the only element in an aligned 128-byte (granule) cache line. Here's the programming note from IBM's POWER ISA Version 2.07 reference manual from section 1.7.3.1:

Because the reservation is lost if another processor stores anywhere in the reservation granule, lock words (or bytes, halfwords, or doublewords) should be allocated such that few such stores occur, other than perhaps to the lock word itself. (Stores by other processors to the lock word result from contention for the lock, and are an expected consequence of using locks to control access to shared storage; stores to other locations in the reservation granule can cause needless reservation loss.) Such allocation can most easily be accomplished by allocating an entire reservation granule for the lock and wasting all but one word. Because reservation granule size is implementation-dependent, portable code must do such allocation dynamically. Similar considerations apply to other data that are shared directly using lwarx and stwcx. (e.g., pointers in certain linked lists; see Section B.3, “List Insertion” on page 833).

[cparrott73 note - I have yet not implemented the second recommendation regarding alignment from my colleague, but the first recommendation with regard to locking appears to be sufficient to make the BLAS: Bad memory unallocation! messages go away. I will include these changes in the patches I will submit against OpenBLAS 0.3.10 to work around compilation issues with the new NVIDIA HPC SDK 20.7 compilers, as promised in issue #2718.]

cparrott73 avatar Aug 07 '20 18:08 cparrott73

@cparrott73 - the 128 byte alignment is an optimization which can minimizes the number of times the "stwcx." can possibly fail - that is the reservation is cleared by another processor accessing the same granule (in this situation most likely the same granule has more than one entry of the "memory" structure). The patch as submitted is still relevant.

d-parks avatar Aug 07 '20 19:08 d-parks

Here is an alternative fix, developed by my colleague @d-parks, as described above. This fix implements both locking and usage via the GNU __sync_val_compare_and_swap intrinsic function, using only the used field of the memory management structure. I think this solution is a bit more general, but please take a look and see if you spot any obvious issues with it. You are welcome to use it if you like. Thanks!

pwrmemfix.diff.txt

cparrott73 avatar Aug 11 '20 18:08 cparrott73

Thanks. I have merged the isync fix to protect other users of blas_lock, will need to look at portability of the elegant alternative in more detail as it would affect all architectures. (Besides not sure I like to have an assert() in library code)

martin-frbg avatar Aug 11 '20 20:08 martin-frbg

@martin-frbg The enhanced mod as provided, was developed with only consideration given to the OpenMP path. If you do like the mod, it might just be a special case for OpenMP and not applicable to any other build configuration(s). With regards to using assert - agreed it might not be the best choice in a production environment, the mod was created for a development lab where catching those errors immediately is essential. It can be easily replace with a combination of nothing or an if/fprintf(stderr,....) block, but still something has gone terribly wrong if the "memory" table is no longer consistent.

d-parks avatar Aug 11 '20 21:08 d-parks

I'd like to add 2 comments to the proposed patch which IMO is a very good alternative:

  • The use of the non-standard function could likely be replaced by C11 ones https://en.cppreference.com/w/c/atomic/atomic_compare_exchange (or https://en.cppreference.com/w/c/atomic/atomic_flag_test_and_set) maybe based on availability so OpenBLAS requires either C11 or the extension
  • Using the assert is fine, however it's usage is problematic. Given that NDEBUG is defined (e.g. for optimized release builds) the assert expands to nothing leading to never resetting the "lock" in that case. So I'd recommend running the command and storing it's result into a variable which is then asserted below

Flamefire avatar Oct 01 '20 08:10 Flamefire

@Flamefire , no objections to your comments. The mod was developed to get around the runtime failures we were experiencing in-house, and I expected that the OpenBLAS developers would modify it to follow your guidelines and coding standards.

d-parks avatar Oct 02 '20 11:10 d-parks