esp-nimble-cpp icon indicating copy to clipboard operation
esp-nimble-cpp copied to clipboard

Scanning, connections dropping & crashing

Open pscs opened this issue 4 years ago • 7 comments

This is with the 'master' branch.

I have a W32-WROVER and am writing quite a complex device with it. It will connect to 4 BLE peripherals if they're available, and display details on a screen, etc.

So, it does a BLE Scan, if any recognised devices are available it connects, but if ANY devices are not available, it should continue scanning to see if any of the other devices are available so that it can connect to those.

What seems to happen when 1 device is available is:

  • scan for 5 seconds use a scanner callback to detect when scan has completed

  • after the scan has finished - 1 device is found

  • connect to that device

  • after a few seconds, start another scan for 5 seconds

  • the connection to the first device is immediately dropped with an error

  • this second scan does NOT stop after 5 seconds and just goes on for ever (I've waited for several minutes).

  • if I force the scan stopped (Scanner->stop()) after, say, 6 seconds, then

    • the first device is available (since its previous connection was disconnected)
    • connect to the first device again
    • after a few seconds, start another scan
    • an immediate crash and restart of the device

If no connection has been made, then scans end after 5 seconds every time, it's only after it seems to get messed up a bit when the established connection is dropped, that the scan never ends, and then crashes after a force stop.

This is a (not very detailed) log from my application:

victron(3) Notify ed8c -110 victron(3) Notify 010e 0 victron(3) Notify ed8c -120 E NimBLERemoteCharacteristic: "<< writeValue, rc: 7" ble(3) BLE Scanner restarting E NimBLERemoteDescriptor: "Disconnected" E NimBLERemoteDescriptor: "Disconnected" E NimBLERemoteDescriptor: "Disconnected" ble(3) BLE Scanner restarted ble(4) BLE advertised device found - address: 58:af:87:6b:5d:5f - 1

All that's between the 'restarting' and 'restarted' logs is 'pScanner->start(5, BLEScanDone, true)'. I've tried setting the 'isContinue' to both 'true' and 'false' with no difference.

The scanner is started with:

pScanner = BLEDevice::getScan();
pScanner->setAdvertisedDeviceCallbacks(new MyAdvertisedDeviceCallbacks(this));
pScanner->setInterval(1349);
pScanner->setWindow(449);
pScanner->start(5, BLEScanDone, false);

I'm fairly sure this didn't happen with 1.2.0, but I had a problem with my BLE service when using that version.

What other information would be useful to diagnose this?

pscs avatar Jul 12 '21 13:07 pscs

More logging info.

(Oddly it doesn't seem to crash when core logging is set to '5', but it still disconnects when the scan starts and the scan still never ends by itself)

D NimBLEClient: "checking service 306b0001-b081-4037-83dc-e59fcc3cdfd0 for handle: 36" D NimBLEClient: "Got Notification for characteristic Characteristic: uuid: 306b0003-b081-4037-83dc-e59fcc3cdfd0, handle: 36 0x0024, props: 0x14 Descriptor: uuid: 0x2902, handle: 37" D NimBLEClient: "Invoking callback for notification on characteristic Characteristic: uuid: 306b0003-b081-4037-83dc-e59fcc3cdfd0, handle: 36 0x0024, props: 0x14 Descriptor: uuid: 0x2902, handle: 37" D NimBLEClient: "Got Client event " D NimBLEClient: "Notify Recieved for handle: 36" D NimBLEClient: "checking service 306b0001-b081-4037-83dc-e59fcc3cdfd0 for handle: 36" D NimBLEClient: "Got Notification for characteristic Characteristic: uuid: 306b0003-b081-4037-83dc-e59fcc3cdfd0, handle: 36 0x0024, props: 0x14 Descriptor: uuid: 0x2902, handle: 37" D NimBLEClient: "Invoking callback for notification on characteristic Characteristic: uuid: 306b0003-b081-4037-83dc-e59fcc3cdfd0, handle: 36 0x0024, props: 0x14 Descriptor: uuid: 0x2902, handle: 37" I NimBLERemoteCharacteristic: "Write complete; status=7 conn_handle=0" E NimBLERemoteCharacteristic: "<< writeValue, rc: 7" ble(3) BLE Scanner restarting D NimBLEClient: "Got Client event " D NimBLEScan: ">> start(duration=5)" I NimBLEClient: "disconnect; reason=531, " D NimBLERemoteCharacteristic: ">> setNotify(): Characteristic: uuid: 306b0002-b081-4037-83dc-e59fcc3cdfd0, handle: 33 0x0021, props: 0x16 Descriptor: uuid: 0x2902, handle: 34, 00" D NimBLERemoteCharacteristic: ">> getDescriptor: uuid: 0x2902" D NimBLERemoteCharacteristic: "<< getDescriptor: found the descriptor with uuid: 0x2902" D NimBLERemoteCharacteristic: "<< setNotify()" D NimBLERemoteDescriptor: ">> Descriptor writeValue: Descriptor: uuid: 0x2902, handle: 34" E NimBLERemoteDescriptor: "Disconnected" D NimBLERemoteCharacteristic: ">> setNotify(): Characteristic: uuid: 306b0003-b081-4037-83dc-e59fcc3cdfd0, handle: 36 0x0024, props: 0x14 Descriptor: uuid: 0x2902, handle: 37, 00" D NimBLERemoteCharacteristic: ">> getDescriptor: uuid: 0x2902" D NimBLERemoteCharacteristic: "<< getDescriptor: found the descriptor with uuid: 0x2902" D NimBLERemoteCharacteristic: "<< setNotify()" D NimBLERemoteDescriptor: ">> Descriptor writeValue: Descriptor: uuid: 0x2902, handle: 37" E NimBLERemoteDescriptor: "Disconnected" D NimBLERemoteCharacteristic: ">> setNotify(): Characteristic: uuid: 306b0004-b081-4037-83dc-e59fcc3cdfd0, handle: 39 0x0027, props: 0x14 Descriptor: uuid: 0x2902, handle: 40, 00" D NimBLERemoteCharacteristic: ">> getDescriptor: uuid: 0x2902" D NimBLERemoteCharacteristic: "<< getDescriptor: found the descriptor with uuid: 0x2902" D NimBLERemoteCharacteristic: "<< setNotify()" D NimBLERemoteDescriptor: ">> Descriptor writeValue: Descriptor: uuid: 0x2902, handle: 40" E NimBLERemoteDescriptor: "Disconnected" D NimBLEScan: "<< start()" ble(3) BLE Scanner restarted I NimBLEScan: "Updated advertiser: c4:17:54:d3:f4:c2" ble(4) BLE advertised device found - address: c4:17:54:d3:f4:c2 - 1 ble(3) Set found - SmartSolar [141443][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 59:bf:62:c5:2d:5a" ble(4) BLE advertised device found - address: 59:bf:62:c5:2d:5a - 1 [141526][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 59:50:01:62:4a:05" ble(4) BLE advertised device found - address: 59:50:01:62:4a:05 - 1 [141652][V][location.cpp:70] Update(): locationUpdate

[142180][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 7e:f8:9f:88:f8:4b" ble(4) BLE advertised device found - address: 7e:f8:9f:88:f8:4b - 1 I NimBLEScan: "Updated advertiser: 6c:dd:7e:0c:cc:48" ble(4) BLE advertised device found - address: 6c:dd:7e:0c:cc:48 - 1 [143018][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 59:50:01:62:4a:05" ble(4) BLE advertised device found - address: 59:50:01:62:4a:05 - 1 I NimBLEScan: "Updated advertiser: c4:17:54:d3:f4:c2" ble(4) BLE advertised device found - address: c4:17:54:d3:f4:c2 - 1 ble(3) Set found - SmartSolar [144015][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 7e:f8:9f:88:f8:4b" ble(4) BLE advertised device found - address: 7e:f8:9f:88:f8:4b - 1 I NimBLEScan: "Updated advertiser: 59:bf:62:c5:2d:5a" ble(4) BLE advertised device found - address: 59:bf:62:c5:2d:5a - 1 I NimBLEScan: "Updated advertiser: 17:33:db:ed:e7:b3" ble(4) BLE advertised device found - address: 17:33:db:ed:e7:b3 - 1 [145005][V][location.cpp:70] Update(): locationUpdate

[148014][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 59:50:01:62:4a:05" ble(4) BLE advertised device found - address: 59:50:01:62:4a:05 - 1 I NimBLEScan: "Updated advertiser: c4:17:54:d3:f4:c2" ble(4) BLE advertised device found - address: c4:17:54:d3:f4:c2 - 1 ble(3) Set found - SmartSolar I NimBLEScan: "Updated advertiser: 17:33:db:ed:e7:b3" ble(4) BLE advertised device found - address: 17:33:db:ed:e7:b3 - 1 I NimBLEScan: "Updated advertiser: 59:bf:62:c5:2d:5a" ble(4) BLE advertised device found - address: 59:bf:62:c5:2d:5a - 1 [149012][V][location.cpp:70] Update(): locationUpdate

I NimBLEScan: "Updated advertiser: 7e:f8:9f:88:f8:4b" ble(4) BLE advertised device found - address: 7e:f8:9f:88:f8:4b - 1 [150017][V][location.cpp:70] Update(): locationUpdate

ble(3) Force stop scanner after 6 seconds D NimBLEScan: ">> stop()" ble(3) BLE Scan Done D NimBLEScan: "<< stop()" [151009][V][location.cpp:70] Update(): locationUpdate

[160005][V][location.cpp:70] Update(): locationUpdate

victron(3) Victron - try connect victron(3) Victron - start connect - c4:17:54:d3:f4:c2 D NimBLEClient: ">> connect(c4:17:54:d3:f4:c2)" D NimBLEClient: "Got Client event " I NimBLEClient: "Connected event" D NimBLEClient: "Got Client event " I NimBLEClient: "mtu update event; conn_handle=0 mtu=23" I NimBLEClient: "Connection established" D NimBLEClient: "<< connect()" victron(3) SmartSolar - Connected to Server

pscs avatar Jul 12 '21 13:07 pscs

From the logs:

E NimBLERemoteCharacteristic: "<< writeValue, rc: 7"

That return code means:

0x07 BLE_HS_ENOTCONN No open connection with the specified handle.

I NimBLEClient: "disconnect; reason=531, "

And this one is:

0x0213 0x13 BLE_ERR_REM_USER_CONN_TERM Remote User Terminated Connection

What kind of device is this? It appears it is disconnecting from you after trying to write to that characteristic.

As for the crash, I have no ideas without more info. If possible, please try to capture a backtrace and I'll have a look.

h2zero avatar Jul 12 '21 14:07 h2zero

OK, I think I was slightly mislead in my diagnosis...

After lots of diagnosis, It appears that I was sending too many writes to the characteristic, which could have been filling a buffer up somewhere. That meant that the next write was blocked. THAT meant that Loop() was never being called, so a keepalive message was never being sent. THEN, when the remote device timed out the connection after 10 seconds or so, the blocked write was being released, and generating the error, and Loop() then continued, triggering the next Scan->start()

It appears that the blocked write was never going to be 'unblocked' until the connection dropped. As I say, it finally unblocked after 10 seconds, which was plenty of time to do the writes (approx 100 bytes over 9 'writeValue' calls). Maybe the blocked one should have returned an error, or the blocking was a bug.

I guess it was working earlier because I was sending a slightly incorrect initialisation sequence which was 36 bytes shorter. That's now been corrected, and I guess it's gone over a threshold somewhere.

Either that, or doing a single WriteValue with 36 bytes is causing the problem (but I thought it was allowed to send up to the MTU value in one go - 255 bytes)

I'll try sending the first batch of Writes (which are to initialise the remote device) and then pause for a few milliseconds then send the second batch to finish off initialisation and see if that fixes the problem.

(Not really knowing what I'm doing, I tried increasing the 'CONFIG_BT_NIMBLE_MSYS1_BLOCK_COUNT' value, but that made no difference)

pscs avatar Jul 12 '21 17:07 pscs

Bit more diagnosis later - sending the last WriteValue as two blocks of 18 bytes works fine, whereas a single block of 36 bytes causes it to freeze. How big a block are you meant to be able to send in one go?

Ah. Got it, it's specified by the peripheral - and it's 23 in this case.

Sorry for the daft questions. I'm still learning about BLE.

Should NimBLE have given an error or something if I tried to send more than 23 bytes in one chunk?

pscs avatar Jul 12 '21 17:07 pscs

Either that, or doing a single WriteValue with 36 bytes is causing the problem (but I thought it was allowed to send up to the MTU value in one go - 255 bytes)

This may be part of the issue, setting the MTU simply allows for up to that number, does not mean you will get that unless both devices support the same MTU size. Many devices will only support 23 bytes, in which case a long write will be performed which is slow and requires responses.

h2zero avatar Jul 12 '21 17:07 h2zero

Yes, I think that was it - but it didn't seem to manage a 'long write' - it just hung. I don't know if that was a limitation of the remote device (a Victron Solar Charger) or a problem with the library. Unfortunately I can't get a BLE sniffer trace because the remote device only supports encrypted connections, so the trace after connection is a bit useless

pscs avatar Jul 12 '21 19:07 pscs

The remote device may not be able to support that much data in it's characteristic buffer. I don't have an answer to that but long writes in the library happen automatically if the data length is greater than the MTU.

h2zero avatar Jul 15 '21 02:07 h2zero