The LCD screen gets corrupted after a while
When you leave the keyboard on for a while, the graphics on the LCD screen can get corrupted.
I don't really know what's causing it, or what could be done to fix it, so ideas are welcome.
Do other people have the same problem?
It seems like not even putting the keyboard to sleep will fix the issue. In fact it seems like the sleep command is ignored and the graphics is still shown on the LCD.
I have the same issue. It starts with the LED backlight on the right half going dark, then the right LCD turns off. Finally, the LCD on the left half gets corrupted.
I have not personally observed the LED backlight going dark. But @tthayer reported a similar issue when he was testing the LED support https://github.com/fredizzimo/infinity_ergodox/issues/5.
But that problem makes the issue even more strange, since the two has nothing to do with each other, and are handled in completely different parts of the code.
Typically memory corruptions could cause issues like this, but since both systems are handled at completely different places, I have a hard time figuring out what variables could get corrupted and how.
An SWD debugger would be really useful. At least I think it could be used, but it seems like it would use the same pins as the interconnect, so I would only be able to debug with only one half connected. But currently I don't have any, and I don't know which one to buy, and from where. So I guess I will have to use print based debugging.
I'm working on getting a SWD debugger up and running. I was able to find a colleague who has a tagconnect and I have a debugger handy. I'm just working out the pin mapping from tagconnect to the debugger.
It looks like SWD_DIO and SWD_CLK are dedicated lines so there is no reason that debugging can't happen on both at the same time.
I'm seeing the LED backlight going dark on the right half, but not the corruption of the screen.
@bremoran, yes you are right. I believe the trace port would be quite useful for debugging this problem though. But it's connected to tx0, which is used for the UART communication.
Yes, if it's down to needing trace, that will be a problem, however, I am hoping to track the problem down with just SWD debugging.
When the right-hand LCD backlight goes dark, lcd_backlight_color() is called with (hue=64 '@', saturation=176 '\260', intensity=255 '\377'), however, current_brightness is set to 0, so there is no backlight.
lcd_backlight_brightness() is only called once, which points to a memory corruption bug.
The actual point where current_brightness gets set to 0 is in a memcpy()
Here is the traceback:
#0 0x000033bc in memcpy ()
#1 0x00008f32 in transport_recv_frame (from=<optimized out>,
data=data@entry=0x1fffa225 <states+5> "\001", size=size@entry=20)
at ./tmk_serial_link/serial_link/protocol/transport.c:89
#2 0x00008d24 in route_incoming_frame (link=<optimized out>,
data=0x1fffa225 <states+5> "\001", size=<optimized out>)
at ./tmk_serial_link/serial_link/protocol/frame_router.c:44
#3 0x00008dea in validator_recv_frame (link=<optimized out>,
data=<optimized out>, size=<optimized out>)
at ./tmk_serial_link/serial_link/protocol/frame_validator.c:112
#4 0x00008c26 in byte_stuffer_recv_byte (link=link@entry=0 '\000',
data=<optimized out>)
at ./tmk_serial_link/serial_link/protocol/byte_stuffer.c:74
#5 0x00009146 in read_from_serial (driver=<optimized out>,
link=link@entry=0 '\000')
at ./tmk_serial_link/serial_link/system/system.c:68
#6 0x00009196 in serialThread (arg=<optimized out>)
at ./tmk_serial_link/serial_link/system/system.c:101
#7 0x000022aa in _port_thread_start ()
In my build, current-brightness exists at 0x1fff86e4
I'm still working out the exact conditions for the failure, but it appears to be a flaw in transport_recv_frame()
Thanks a lot,
There's some seriously dangerous code there in the transport protocol, mostly because I'm used to relying on C++ and type safety. I also wanted to keep the system from allocating.
Anyway this information is probably enough for me to find the bug.
The issue occurs for id == 2 with ptr == 0x1fff86d4. Not sure if that helps
Let me know if you need more help with this and I'll take another look.
I'm not able to look into the issue properly right now, since I'm at work.
One thing that might help, is the map file in the build folder, which should specify where all variables are in memory, as I might not get the same memory locations if I build the code locally. At least I think that's file, but it could also be in some other file in that folder.
Ah, one other piece of information:
remote_objects[2] == 0x1fff8648 <remote_object_current_status> and ptr == 0x1fff86d4
This implies that it's something to do with
void * ptr = triple_buffer_begin_write_internal(obj->object_size, tb);
I generally prefer arm-none-eabi-objdump. Anyway, here's the map file
ch.map.txt
The pointer might not be exactly the same as the object pointer, since as the name implies, there's three different buffers, triple_buffer_begin_write_internal, is supposed to return a valid pointer to one of them.
So either it's not returning a wrong pointer, or the size is too big, so that it's writing over the buffer. It could also be some alignment issue, which causes the pointer to be slightly offset from the size that was allocated.
I have now manually verified that the pointer appears to be calculated correctly. However for some reason the size is 20, while it should be 17. The object in question is 16 bytes, and there's one additional byte for the object id.
So now I have to check why the received size is wrong.
I can confirm that I've gotten this context when there's a failure:
from = 0 '\000'
data = 0x1fffa225 <states+5> "\001"
size = 20, id = 2 '\002'
start = 0x1fff86b0 <remote_object_current_status+104> "\030"
tb = 0x1fff86b0 <remote_object_current_status+104>
ptr = 0x1fff86d4
As a temporary measure, I suggest only copying min(object->size,size)
Yes, I will add that sanity check in any case, even if I fix the bug. But I will probably just ignore the data if the size is wrong. The protocol is lossy any way, so it doesn't really matter.
I just added some print statements, and the sender sends the correct size. And for the objects the received size appears to be correct.
Unfortunately I can't add any printing on the right hand side which receives the object in question, so I don't know if the size is always wrong, or just sometimes.
In validator_recv_frame, I have:
link = 0 '\000', data = 0x1fffa225 <states+5> "\001", size = 25
data:
0x1fffa225 <states+5>: 0x01 0x00 0xff 0xa8 0x5c 0x81 0xd3 0x00
0x1fffa22d <states+13>: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x1fffa235 <states+21>: 0x00 0x00 0x00 0x02 0xff 0x42 0x42 0x2f
0x1fffa23d <states+29>: 0x78
I haven't looked into the protocol much yet, so I'm not sure if that's useful information or not
I was able to confirm the bug on the left hand size. The size suddenly went from 18 to 39. But just for one packet.
So there's either some other data corruption, or the byte stuffer and crc check doesn't detect corrupted packets. They are both quite thoroughly unit tested though.
Is it possible that some lost bytes could produce a valid CRC?
It's certainly possible, the crc32 check is not a very strong one, but still the expected error rate is 2.3E-10. And add that we should expect very few errors on the actual physical link. And we are sending maybe 100 packets per second.
So I don't think we would see the error as often as we see it, if that was the case.
One thing that I noticed, is that it always gets corrupted to the same 39 value. So it's probably something else writing it there, rather than a problem with the link.
Moreover since I don't type anything, and the packet in question is the keyboard state, the data should stay the same. Which means that the byte stuffler should generate the same data to be sent over the link.
So it's probably something else corrupting it... Maybe we are only seeing the secondary cause of another memory corruption.
Is this on the left-hand side?
Yes, the left hand side (or the master), receives the keyboard_matrix object.
Ok, I found out that the link is very bad, or I have some bug in my stuffer code. It's constantly receiving the wrong sizes, and generates invalid CRC checks. So it's probably one of those checks that passes as you suspected.
Yes, that's indeed what's happening. The size is wrong, already at the CRC check. But the check passes. It's a little bit strange though, since statistically this shouldn't happen no where as often as it does.
For the bad link quality itself, the baud rate could be dropped. But on the other hand, I'm using the same speed as Haata claimed he had tested without any errors with the official firmware. Also he doesn't use any other checks than the parity bits, so if there are errors, we should probably see wrong characters being typed. I don't remember were he said that anymore.
Ok it turns out that
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01
and
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x5A 0x9A 0xC9 0x61 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01
generates the same CRC32 (0x61C99A5A). The data is basically duplicated, with the crc value as a part of the actual data. I think the fix for this would be to add the size as an additional criteria.
I still have to figure out why there's so much corruption though. It could be real physical corruption, but it might also be that there's trouble finding the start of the frame, or some bug in the byte stuffling code.
My suspicion is that you're not losing any data on the wire. I suspect that you're losing data by not reading it fast enough. You check for that by registering for the error UARTx_S1_OR in the serial driver's error field
Yes, I do indeed get overrun errors.
However when debugging this, I found out that I don't receive the events for CHN_INPUT_AVAILABLE at all (but I receive the error events from the same event handler). So there might be a bug in the Chibios serial driver, or I'm using it the wrong way.
This would mean that I'm only reading the data when the keyboard has something to send. So that would explain why so much data get lost.
I also noticed two other things,
- I'm only using 16 bytes for SERIAL_BUFFERS_SIZE (the internal Chibios buffer), this should probably be bigger.
- And I'm also running my serial thread with too low priority. It's currently running the same priority as the visualizer, which isn't good, since the visualizer is quite slow for some functionality. That must be something I have forgot to do, since I certainly had plans for having that thread priority order. The actual keyboard loop runs with higher priority as it should though, so it can handle the debouncing properly.