[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [certi-dev] Handling crash of federate
Re: [certi-dev] Handling crash of federate
Wed, 9 Jul 2014 14:23:26 +0300
I decided to offer the clean version as patch since I will be off next
three weeks (I will read this list and my mail though, but will not
touch the code on that time).
Since you did not reply I just offer patch for 3.4.3 source (I think
it will go fine to main repo too). I did not see so much consistency
how to name functions etc. but feel free to modify it (obviously).
I patched ieee1516-2000 and ieee1516-2010 too, but I did not test
them. However difference seems completely trivial as most stuff is in
RTIA which is same for all version. libRTI side only has that pipe
activation which looks pretty much identical between all versions.
Simple way to test this is just kill parent process and look if it's
rtia.exe dies too. And notice that it requires real kill or crash. If
it goes down by default signal handler etc. then sockets die and this
fix is not needed. This is only needed when federate really crashes.
BTW. I also fixed bug that I reported two years ago:
That is quite important fix too after crash, since attribute update
request is usually done when federate rejoins to running federate
(actually that bug report is for bit different issue, but this patch
fixes also that bug).
Hope to hear how these patches work for you.
2014-07-04 10:31 GMT+03:00 Timi Tuohenmaa <address@hidden>:
> Thanks for your very quick reply. Also late thanks for the white papers
> you linked quite a time ago. They helped a lot with my Master's Thesis
> (which I wrote in Finnish, so I don't bother to link it).
> 2014-07-03 14:38 GMT+03:00 Eric Noulard <address@hidden>:
>> 2014-07-03 10:18 GMT+02:00 Timi Tuohenmaa <address@hidden>:
>>> I have been looking how to handle situation where federate program
>>> crashes in Windows environment. Currently rtia.exe can't notice it
>>> since Windows does not inform child processes about parent crash and
>>> TCP socket between parent and child does not cut until long timeout.
>>> This is a problem when trying to make as robust system as possible and
>>> recovering from some crashes (like 3d visual which is merely listening
>>> HLA and therefore would be easy to rejoin to system).
>> Hi Timi,
>> If you want some hsitorical view on the Windows implementation
>> of Federate<-->RTIA communication you can read this:
>>> In Linux this is probably not a problem as I think Unix Sockets closes
>>> when parent dies and therefore gets notified correctly.
>>> Unfortunately this is surprisingly difficult to solve in Windows.
>>> Windows offers Job Object -system that could terminate rtia.exe when
>>> parent dies, but as far as I understand it does only offer option for
>>> violent terminate (like kill -9) and it's not good as then rtia.exe's
>>> sockets to rtig.exe would need timeout death.
>>> One way to notice parent crash would be opening pipe between them:
>>> This additional pipe could be watched if select-function would be
>>> changed to WaitForMultipleObjects WINAPI crap, but that forcefully
>>> changes sockets to nonblocking state and it's not too good for current
>>> CERTI logic. By adding check to few recv errors I managed to make this
>>> work, but it was cpu heavy as it caused busy loop.
>> I see...
> I tested WaitForMultipleObjects option and it wasn't good. It triggers from
> stdin-pipe instantly so it does not help to solve this at all.
>>> Other way would be change infinite selects to timed ones internally
>>> and to check that pipe now and then. Or making a thread that would be
>>> checking pipe and then using additional TCP-socket to reset select
>>> when parent dies.
>>> Now I wonder if bit more complex patch would even be taken to base
>>> CERTI code at all. I sure hope so as I find this important issue and I
>>> am ready to solve this.
>> Having a robust behavior is worth the effort.
>> If the complexity comes from the platform (i.e. Windows) then the patch
>> and the associated extra cmplexity should be platform specific.
>> I'm not a Windows specialist so I guess I would ask other CERTI windows
>> users for their expertise?
>> In the meantime I'll try to seek a little about this issue on my own.
> I tried the thread solution and it might actually be quite decent. My
> current testing
> has a separate h-file that contains simple thread that does nothing but waits
> blocking read for stdin (which is actually pipe from parent executable
> aka federate)
> and when parent crashes the blocking function exits in fail and then I trigger
> _exit(EXIT_FAILURE). In practice this works very well and causes very little
> other code than starting the thread and some extra stuff when rtia.exe
> is executed.
> Using _exit() is of course not perfect as rtia.exe does not exit with correct
> uninitialization, but it does close sockets so rtig notices the exit.
> More beautiful
> solution will need some messaging that will trigger infinite select in
> Communications::readMessage. Since WINAPI has obsoleted way to trigger
> WSAEINTR (
> the only option seems to be adding extra tcp-socket that is listened there and
> closed in thread. I'm not sure if it would be legal (and working) to
> close rtia-socket
> in thread, but that could be option too. Both solutions would need
> some extra public
> functions or function modifications to Communications (like giving extra
> handle in constructor or getting SocketUN to thread).
>> Whatever the result I will definitely have a look at your patch and consider
>> it seriously.
>> Please open an entry in the bug tracker or patch tracker for that.
>> Remember that CERTI main repository is now using git:
> I'll open the entry to bug tracker or patch tracker when I get your
> opinion of keeping
> it very simple (_exit()) or more clean (extra socket).
> I did do these tests with 3.4.3 source, but I can do the patch with
> main repo too.
>> L'élection n'est pas la démocratie -- http://www.le-message.org
>> CERTI-Devel mailing list