infinity_ergodox The LCD screen gets corrupted after a while

When you leave the keyboard on for a while, the graphics on the LCD screen can get corrupted.

I don't really know what's causing it, or what could be done to fix it, so ideas are welcome.

Do other people have the same problem?

Apr 09 '16 18:04 fredizzimo

It seems like not even putting the keyboard to sleep will fix the issue. In fact it seems like the sleep command is ignored and the graphics is still shown on the LCD.

Apr 09 '16 18:04 fredizzimo

I have the same issue. It starts with the LED backlight on the right half going dark, then the right LCD turns off. Finally, the LCD on the left half gets corrupted.

May 11 '16 10:05 bremoran

I have not personally observed the LED backlight going dark. But @tthayer reported a similar issue when he was testing the LED support https://github.com/fredizzimo/infinity_ergodox/issues/5.

But that problem makes the issue even more strange, since the two has nothing to do with each other, and are handled in completely different parts of the code.

Typically memory corruptions could cause issues like this, but since both systems are handled at completely different places, I have a hard time figuring out what variables could get corrupted and how.

An SWD debugger would be really useful. At least I think it could be used, but it seems like it would use the same pins as the interconnect, so I would only be able to debug with only one half connected. But currently I don't have any, and I don't know which one to buy, and from where. So I guess I will have to use print based debugging.

May 11 '16 19:05 fredizzimo

I'm working on getting a SWD debugger up and running. I was able to find a colleague who has a tagconnect and I have a debugger handy. I'm just working out the pin mapping from tagconnect to the debugger.

It looks like SWD_DIO and SWD_CLK are dedicated lines so there is no reason that debugging can't happen on both at the same time.

May 11 '16 21:05 bremoran

I'm seeing the LED backlight going dark on the right half, but not the corruption of the screen.

May 12 '16 01:05 seancaffery

@bremoran, yes you are right. I believe the trace port would be quite useful for debugging this problem though. But it's connected to tx0, which is used for the UART communication.

May 12 '16 07:05 fredizzimo

Yes, if it's down to needing trace, that will be a problem, however, I am hoping to track the problem down with just SWD debugging.

May 12 '16 07:05 bremoran

When the right-hand LCD backlight goes dark, lcd_backlight_color() is called with (hue=64 '@', saturation=176 '\260', intensity=255 '\377'), however, current_brightness is set to 0, so there is no backlight.

lcd_backlight_brightness() is only called once, which points to a memory corruption bug.

The actual point where current_brightness gets set to 0 is in a memcpy()

Here is the traceback:

#0  0x000033bc in memcpy ()
#1  0x00008f32 in transport_recv_frame (from=<optimized out>, 
    data=data@entry=0x1fffa225 <states+5> "\001", size=size@entry=20)
    at ./tmk_serial_link/serial_link/protocol/transport.c:89
#2  0x00008d24 in route_incoming_frame (link=<optimized out>, 
    data=0x1fffa225 <states+5> "\001", size=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/frame_router.c:44
#3  0x00008dea in validator_recv_frame (link=<optimized out>, 
    data=<optimized out>, size=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/frame_validator.c:112
#4  0x00008c26 in byte_stuffer_recv_byte (link=link@entry=0 '\000', 
    data=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/byte_stuffer.c:74
#5  0x00009146 in read_from_serial (driver=<optimized out>, 
    link=link@entry=0 '\000')
    at ./tmk_serial_link/serial_link/system/system.c:68
#6  0x00009196 in serialThread (arg=<optimized out>)
    at ./tmk_serial_link/serial_link/system/system.c:101
#7  0x000022aa in _port_thread_start ()

In my build, current-brightness exists at 0x1fff86e4

I'm still working out the exact conditions for the failure, but it appears to be a flaw in transport_recv_frame()

May 13 '16 12:05 bremoran

Thanks a lot,

There's some seriously dangerous code there in the transport protocol, mostly because I'm used to relying on C++ and type safety. I also wanted to keep the system from allocating.

Anyway this information is probably enough for me to find the bug.

May 13 '16 12:05 fredizzimo

The issue occurs for id == 2 with ptr == 0x1fff86d4. Not sure if that helps

May 13 '16 12:05 bremoran

Let me know if you need more help with this and I'll take another look.

May 13 '16 13:05 bremoran

I'm not able to look into the issue properly right now, since I'm at work.

One thing that might help, is the map file in the build folder, which should specify where all variables are in memory, as I might not get the same memory locations if I build the code locally. At least I think that's file, but it could also be in some other file in that folder.

May 13 '16 13:05 fredizzimo

Ah, one other piece of information:

remote_objects[2] == 0x1fff8648 <remote_object_current_status> and ptr == 0x1fff86d4

This implies that it's something to do with

void * ptr = triple_buffer_begin_write_internal(obj->object_size, tb);

May 13 '16 13:05 bremoran

I generally prefer arm-none-eabi-objdump. Anyway, here's the map file ch.map.txt

May 13 '16 13:05 bremoran

The pointer might not be exactly the same as the object pointer, since as the name implies, there's three different buffers, triple_buffer_begin_write_internal, is supposed to return a valid pointer to one of them.

So either it's not returning a wrong pointer, or the size is too big, so that it's writing over the buffer. It could also be some alignment issue, which causes the pointer to be slightly offset from the size that was allocated.

May 13 '16 13:05 fredizzimo

I have now manually verified that the pointer appears to be calculated correctly. However for some reason the size is 20, while it should be 17. The object in question is 16 bytes, and there's one additional byte for the object id.

So now I have to check why the received size is wrong.

May 13 '16 21:05 fredizzimo

I can confirm that I've gotten this context when there's a failure:

from = 0 '\000'
data = 0x1fffa225 <states+5> "\001"
size = 20, id = 2 '\002'
start = 0x1fff86b0 <remote_object_current_status+104> "\030"
tb = 0x1fff86b0 <remote_object_current_status+104>
ptr = 0x1fff86d4

As a temporary measure, I suggest only copying min(object->size,size)

May 13 '16 21:05 bremoran

Yes, I will add that sanity check in any case, even if I fix the bug. But I will probably just ignore the data if the size is wrong. The protocol is lossy any way, so it doesn't really matter.

I just added some print statements, and the sender sends the correct size. And for the objects the received size appears to be correct.

Unfortunately I can't add any printing on the right hand side which receives the object in question, so I don't know if the size is always wrong, or just sometimes.

May 13 '16 21:05 fredizzimo

In validator_recv_frame, I have:

link = 0 '\000', data = 0x1fffa225 <states+5> "\001", size = 25

data:

0x1fffa225 <states+5>:  0x01    0x00    0xff    0xa8    0x5c    0x81    0xd3    0x00
0x1fffa22d <states+13>: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x1fffa235 <states+21>: 0x00    0x00    0x00    0x02    0xff    0x42    0x42    0x2f
0x1fffa23d <states+29>: 0x78

I haven't looked into the protocol much yet, so I'm not sure if that's useful information or not

May 13 '16 21:05 bremoran

I was able to confirm the bug on the left hand size. The size suddenly went from 18 to 39. But just for one packet.

So there's either some other data corruption, or the byte stuffer and crc check doesn't detect corrupted packets. They are both quite thoroughly unit tested though.

May 13 '16 21:05 fredizzimo

Is it possible that some lost bytes could produce a valid CRC?

May 13 '16 21:05 bremoran

It's certainly possible, the crc32 check is not a very strong one, but still the expected error rate is 2.3E-10. And add that we should expect very few errors on the actual physical link. And we are sending maybe 100 packets per second.

So I don't think we would see the error as often as we see it, if that was the case.

May 13 '16 21:05 fredizzimo

One thing that I noticed, is that it always gets corrupted to the same 39 value. So it's probably something else writing it there, rather than a problem with the link.

Moreover since I don't type anything, and the packet in question is the keyboard state, the data should stay the same. Which means that the byte stuffler should generate the same data to be sent over the link.

So it's probably something else corrupting it... Maybe we are only seeing the secondary cause of another memory corruption.

May 13 '16 21:05 fredizzimo

Is this on the left-hand side?

May 13 '16 21:05 bremoran

Yes, the left hand side (or the master), receives the keyboard_matrix object.

May 13 '16 22:05 fredizzimo

Ok, I found out that the link is very bad, or I have some bug in my stuffer code. It's constantly receiving the wrong sizes, and generates invalid CRC checks. So it's probably one of those checks that passes as you suspected.

May 13 '16 22:05 fredizzimo

Yes, that's indeed what's happening. The size is wrong, already at the CRC check. But the check passes. It's a little bit strange though, since statistically this shouldn't happen no where as often as it does.

For the bad link quality itself, the baud rate could be dropped. But on the other hand, I'm using the same speed as Haata claimed he had tested without any errors with the official firmware. Also he doesn't use any other checks than the parity bits, so if there are errors, we should probably see wrong characters being typed. I don't remember were he said that anymore.

May 13 '16 22:05 fredizzimo

Ok it turns out that

0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01

and

0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x5A 0x9A 0xC9 0x61 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01

generates the same CRC32 (0x61C99A5A). The data is basically duplicated, with the crc value as a part of the actual data. I think the fix for this would be to add the size as an additional criteria.

I still have to figure out why there's so much corruption though. It could be real physical corruption, but it might also be that there's trouble finding the start of the frame, or some bug in the byte stuffling code.

May 14 '16 00:05 fredizzimo

My suspicion is that you're not losing any data on the wire. I suspect that you're losing data by not reading it fast enough. You check for that by registering for the error UARTx_S1_OR in the serial driver's error field

May 14 '16 06:05 bremoran

Yes, I do indeed get overrun errors.

However when debugging this, I found out that I don't receive the events for CHN_INPUT_AVAILABLE at all (but I receive the error events from the same event handler). So there might be a bug in the Chibios serial driver, or I'm using it the wrong way.

This would mean that I'm only reading the data when the keyboard has something to send. So that would explain why so much data get lost.

I also noticed two other things,

I'm only using 16 bytes for SERIAL_BUFFERS_SIZE (the internal Chibios buffer), this should probably be bigger.
And I'm also running my serial thread with too low priority. It's currently running the same priority as the visualizer, which isn't good, since the visualizer is quite slow for some functionality. That must be something I have forgot to do, since I certainly had plans for having that thread priority order. The actual keyboard loop runs with higher priority as it should though, so it can handle the debouncing properly.

May 14 '16 09:05 fredizzimo