[QAT_MEM} page allocation failure issues
Hi !
A number of kernel panic messages were found in syslog while using the QAT module. (occurs in qat_mem.ko) qat_mem_ioctl seems to have failed to allocate page.
It looks like a system hang due to a memory leak when checking panic messages. Do you have any reports about this issue? Do you know what to do if you have a reported history?
[7730502.783337] page allocation failure: order:5, mode:0x24000c0
[7730502.783343] CPU: 18 PID: 22411 Comm: ssl_proxy Tainted: G OE 4.4.0-116-generic #140-Ubuntu
[7730502.783345] Hardware name: Micro-Star International Co., Ltd. KT-S145/KT-S145, BIOS 5.11 08/25/2017
[7730502.783348] 0000000000000286 7aa48184a81eb87b ffff880643953be8 ffffffff813ffc13
[7730502.783352] 00000000024000c0 0000000000000000 ffff880643953c78 ffffffff81198bca
[7730502.783354] 7aa48184a81eb87b 0000000000000005 0000000000000040 ffff88086a521c00
[7730502.783357] Call Trace:
[7730502.783367] [
Kind Regards,
Hi @duadbsgh,
The issue you are seeing is not uncommon. That is one of the reasons why the qat_mem/qat_contig_mem is not considered a production ready driver.
The qat_mem/qat_contig_mem driver requests kernel memory via the __get_free_pages() call. The driver asks for 128KB contiguous slabs. As this memory is coming out of the general system allocation, it works well to start with, but over time the systems memory can become fragmented (by other usage) leading eventually to failed requests. It is not that memory is leaking, it is just the amount of available 128KB contiguous slabs is reducing over time until none are available.
If possible it is recommended to use the USDM driver that is included with the CPM 1.7 upstream driver. The USDM driver has far more configurable options to help prevent this. Firstly the USDM driver by default allocates 2MB slabs rather than 128KB slabs, this may seem counter intuitive but it uses them more efficiently so does overall reduce the memory consumption as long as it can get 2MB allocations. If 2MB slabs are not working well you can configure the USDM driver to use 128KB slabs instead like the qat_mem/qat_config_mem driver uses. Additionally you can configure how the USDM driver caches slabs, In this way you can reduce whether slabs get released back to the kernel to be used by other applications. If you set the threshold of slabs high it will mean memory usage will remain high when traffic is low but will decrease the risk of fragmentation as nothing else can use the memory. The other thing is that the USDM driver supports using Hugepages. You can configure the Linux Kernel to set aside an amount of memory at boot time that will only be used by applications making use of Hugepages. In this way if only USDM is using Hugepages, and Hugepages are configured with 2MB slabs there will be no fragmentation as USDM (without reconfiguration) always requests/releases 2MB slabs. Obviously there is a tradeoff as the memory you set aside for Hugepages cannot be used for anything else. There is also the issue that if you allocate too little Hugepages then you can still end up exhausting the slabs if utilisation rises too high.
Unfortunately I do not have any documentation on how exactly how to configure these options for the USDM Driver. I'm not sure if it is documented as part of the QAT 1.7 Driver documentation or whether you will need to look at the source code. If you are interested I can look into the details further.
Hope this additional information is helpful,
Steve.
For info on using USDM with Hugepages see the following section: 3.16 Huge Pages with the Included Memory Driver of the QAT 1.7 Programmers Guide.
Kind Regards,
Steve.
Hi @stevelinsell Thank you for answer.
We are currently unable to use USDM. Is there any way to resolve this in qat_mem? Sorry.
Kind Regards,
Hi @duadbsgh,
I don't have a 100% solution for you unfortunately.
One thing you could try is within qae_mem_utils.c (one of the qat engine files) there is a #define:
#define MAX_EMPTY_SLAB 128
This define controls how many empty slabs are cached before they are released back to the kernel. By increasing this value you will increase the memory usage of the QAT Engine when it is not busy, but it will also mean slabs are kept hold of rather than released, preventing that memory from becoming fragmented by other use. Setting it very high will prevent any slabs being released until the application exits.
This will help in a scenario where traffic bursts a lot over time between high and low, as if the high point happens early on when slabs are available they will be kept. It will not help in a scenario where traffic stays low for a long period initially, memory gets fragmented, and then a burst happens, as there will still not be slabs available. Any time that a new high point in traffic is reached then new slabs will be requested, and the error could still occur.
Additionally you could also make some code changes to qae_mem_utils.c to allocate a large amount of slabs up front at application start up, and by changing the #define mentioned above prevent those slabs getting released. This would help in both scenarios above, but at a large cost always in memory usage.
Also consider the scenario where you want to restart the application without restarting the whole system. In that case the memory could already be fragmented and none of the workarounds suggested above will be of any benefit. It is those kind of scenarios where hugepages are really useful. I do not have a version of qat_mem/qat_contig_mem that will make use of hugepages unfortunately, but if you really needed it, it maybe possible for you to port the hugepage support from the USDM driver.
I hope some of that might be useful to you,
Steve.
@stevelinsell is there a reason that physically contiguous memory is required? Is the QAT accelerator not behind an IOMMU?