lwip-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] Out of memory in PCP_PCB pool after 2^32 milliseconds


From: Adam Baron
Subject: Re: [lwip-users] Out of memory in PCP_PCB pool after 2^32 milliseconds
Date: Mon, 31 May 2021 07:56:19 +0200

Hello Trampas,
yes, well, I can only agree with you. But still I consider ChibiOs to be well designed and supported. That means I still put a bit of respect and  trust into the code and libraries I start to use. But of course I try to understand them first.

And nice well thought article, thank you.

Adam

pá 28. 5. 2021 v 22:24 odesílatel Trampas Stern <trampas@gmail.com> napsal:
As far as the  ChibiOs  time issues I have a simple rule:

On my embedded systems every line of code I put into the project becomes my problem! 

That is if I use LWIP and it has a bug, customers do not care if it is in LWIP or not, it is my problem to fix.  Hence every line of code becomes my problem.  As such I try not to use code I do not understand.  Often (LWIP as example) you have to use libraries but do so knowing that their problems become yours.  Yes, LWIP has bitten me more than once where it did not work the way I thought it should/would.  That was my fault and my problem to fix. 

I often go to extremes and I will not use processor vendor defined drivers until I have done a code review and understand them.  I have been bitten more than once where vendor's drivers are just "example code."    One vendor told me that their code should never be used in production, one vendor had drivers full of bugs and corner cases where it would fail, but insisted their code was production ready.  I have seen vendor drivers violate the datasheet.  So detailed code reviews of all code is required. 

Hence you use ChibiOs  and it has a bug,  well it is now your bug to fix. That is every line of code in ChibiOs is now your problem..  

Trampas








On Fri, May 28, 2021 at 4:05 PM Trampas Stern <trampas@gmail.com> wrote:
So a trick I use in my code and libraries is to use typedef's for variables.  

typedef uint32_t milliseconds_t; 
milliseconds_t getMillis();  

Then I use milliseconds_t to define all variables.  This allows me to change it to uint64_t in one location depending on the project. 

I have started using more typedef's like this as a form of documentation.   That is code is easier to read and follow when variables are defined based on the use/type. 

A neat fixed point unsigned math trick is when doing comparisons... 

milliseconds_t start=   getMillis();

// This is bad 
while( getMillis()<(start +10) ){  //wait for 10ms 
.... 
}

To understand why assume milliseconds_t is uint8_t.  Now we get start and say it is 255,  this means (start+10) = 9, now getMillis() on the first loop is still 255... So the comparison becomes while (255<9).  So you exit while loop early

A better way to do this is 
milliseconds_t start=   getMillis();

// This is good
while( (getMillis()-start)<10 ){  //wait for 10ms 
.... 
}

Here you if start and getMills() are 255 the first loop is while(0<10).  Now next millisecond we have (getMillis()-start)  = (0-255) =1  to understand this look at the math as in binary:
 0000 0000
-1111 1111
= 1 0000 0001 where the first 1 is the negative bit, but since we are 8 bit unsigned the value is 1.  This means when doing unsigned subtraction you end up with a modulo absolute difference.  

Now with that said the code works but other developers might not understand it, and you risk them adding code or modifying that breaks things.  Therefore often I just use uint64_t just to make sure other developers do not break the code.  If speed becomes an issue I can optimize the code to use the fixed point math tricks, but only as a last resort.   

Note I know many developers that refuse to use unsigned variables due to math issues like above.  So they try to use signed integers for most everything.  You still have overflow issues but you do not have math issues. 

Here is a blog article I wrote on embedded systems and time: 

Trampas





On Fri, May 28, 2021 at 3:25 PM Adam Baron <vysocan76@gmail.com> wrote:
Hello Trampas,

thanks for the hints. I initialized the sys ticks with 2^32 - 120 seconds, and I got mqtt pbuf=NULL in around 120 seconds + 120 keep alive seconds.

The ChibiOs sys_arch.c port includes sys_now() (current time in milliseconds) following simplified implementation:
  return ((u32_t)chVTGetSystemTimeX() - 1) / 10 + 1;
Since it ticks at 100 uS.

I guess it might cause the problems as it overflows back to 0 leaving the lwip timers waiting for value higher than (2^32)/10.

To support my guess, I turned on another debug option and last lwip timer message I see is:
sys_timeout: 2000C5DC abs_time=429497730 handler=ip_reass_tmr arg=805B28C


Adam

pá 28. 5. 2021 v 13:45 odesílatel Trampas Stern <trampas@gmail.com> napsal:
Increase the counter to a uint64_t. 

You can also start the counter at something other than zero to prove root cause faster.

Trampas

On Fri, May 28, 2021 at 7:08 AM Adam Baron <vysocan76@gmail.com> wrote:
Czesc Tomek :),

I'll try to add it. Thanks.

However, I feel like it is rather related to the problem of overflowing a uint32 counter of some kind. Since the TCP_PCBs are not freed after 2^32 ticks.

Adam

pá 28. 5. 2021 v 9:44 odesílatel Tomasz W <wilkxt@gmail.com> napsal:
Hi (Cześć)
Lok for this https://lists.nongnu.org/archive/html/lwip-devel/2020-12/msg00014.html
In my case it solved the problem of the web server dying after a few days


pt., 28 maj 2021 o 08:58 Adam Baron <vysocan76@gmail.com> napisał(a):
>
> Hello all,
>
> I'm having a small STM32F4 application running on devel branch of lwip, It includes httpd, sntp, smtp client, and mqtt client. All is running well until the fifth day, when mqtt client starts to receive pbuf=NULL and disconnects. My reconnect routine reconnects it in some short time, but it receives pbuf=NULL shortly after.
>
> Also later on I noticed in log: memp_malloc: out of memory in pool TCP_PCB.
> I'm having defined MEMP_NUM_TCP_PCB as 30 and it seems enough for normal operation, I also upped it to 50, but ended with the same problem
> In statistics the NUM_TCP_PCB increases and decreases as it should, but after uptime past 5 days it stays high with an error flag triggered.
>
> Quite interestingly it happens exactly after 2^32 milliseconds uptime. I tried to keep OpenOCD connected to start to peek in, but yet I did not manage to keep the openOCD running for so long without dropping the connection.
>
> Does anyone have any ideas please?
>
> Thanks in advance,
> --
> 731435556
> Adam Baron
> _______________________________________________
> lwip-users mailing list
> lwip-users@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/lwip-users



--
Pozdrawiam
Tomek

_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users


--
731435556
Adam Baron
_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users
_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users


--
731435556
Adam Baron
_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users
_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users


--
731435556
Adam Baron

reply via email to

[Prev in Thread] Current Thread [Next in Thread]