[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [certi-dev] Handling crash of federate

From: Timi Tuohenmaa
Subject: Re: [certi-dev] Handling crash of federate
Date: Fri, 4 Jul 2014 10:31:21 +0300


Thanks for your very quick reply. Also late thanks for the white papers
you linked quite a time ago. They helped a lot with my Master's Thesis
(which I wrote in Finnish, so I don't bother to link it).

2014-07-03 14:38 GMT+03:00 Eric Noulard <address@hidden>:
> 2014-07-03 10:18 GMT+02:00 Timi Tuohenmaa <address@hidden>:
>> Hi,
>> I have been looking how to handle situation where federate program
>> crashes in Windows environment. Currently rtia.exe can't notice it
>> since Windows does not inform child processes about parent crash and
>> TCP socket between parent and child does not cut until long timeout.
>> This is a problem when trying to make as robust system as possible and
>> recovering from some crashes (like 3d visual which is merely listening
>> HLA and therefore would be easy to rejoin to system).
> Hi Timi,
> If you want some hsitorical view on the Windows implementation
> of Federate<-->RTIA communication you can read this:
> https://savannah.nongnu.org/patch/?6893
>> In Linux this is probably not a problem as I think Unix Sockets closes
>> when parent dies and therefore gets notified correctly.
>> Unfortunately this is surprisingly difficult to solve in Windows.
>> Windows offers Job Object -system that could terminate rtia.exe when
>> parent dies, but as far as I understand it does only offer option for
>> violent terminate (like kill -9) and it's not good as then rtia.exe's
>> sockets to rtig.exe would need timeout death.
>> One way to notice parent crash would be opening pipe between them:
>> http://stackoverflow.com/questions/3342941/kill-child-process-when-parent-process-is-killed/13614987#13614987
>> This additional pipe could be watched if select-function would be
>> changed to WaitForMultipleObjects WINAPI crap, but that forcefully
>> changes sockets to nonblocking state and it's not too good for current
>> CERTI logic. By adding check to few recv errors I managed to make this
>> work, but it was cpu heavy as it caused busy loop.
> I see...

I tested WaitForMultipleObjects option and it wasn't good. It triggers from
stdin-pipe instantly so it does not help to solve this at all.

>> Other way would be change infinite selects to timed ones internally
>> and to check that pipe now and then. Or making a thread that would be
>> checking pipe and then using additional TCP-socket to reset select
>> when parent dies.
>> Now I wonder if bit more complex patch would even be taken to base
>> CERTI code at all. I sure hope so as I find this important issue and I
>> am ready to solve this.
> Having a robust behavior is worth the effort.
> If the complexity comes from the platform (i.e. Windows) then the patch
> and the associated extra cmplexity should be platform specific.
> I'm not a Windows specialist so I guess I would ask other CERTI windows
> users for their expertise?
> In the meantime I'll try to seek a little about this issue on my own.

I tried the thread solution and it might actually be quite decent. My
current testing
has a separate h-file that contains simple thread that does nothing but waits in
blocking read for stdin (which is actually pipe from parent executable
aka federate)
and when parent crashes the blocking function exits in fail and then I trigger
_exit(EXIT_FAILURE). In practice this works very well and causes very little
other code than starting the thread and some extra stuff when rtia.exe
is executed.
Using _exit() is of course not perfect as rtia.exe does not exit with correct
uninitialization, but it does close sockets so rtig notices the exit.
More beautiful
solution will need some messaging that will trigger infinite select in
Communications::readMessage. Since WINAPI has obsoleted way to trigger
the only option seems to be adding extra tcp-socket that is listened there and
closed in thread. I'm not sure if it would be legal (and working) to
close rtia-socket
in thread, but that could be option too. Both solutions would need
some extra public
functions or function modifications to Communications (like giving extra socket
handle in constructor or getting SocketUN to thread).

> Whatever the result I will definitely have a look at your patch and consider
> it seriously.
> Please open an entry in the bug tracker or patch tracker for that.
> Remember that CERTI main repository is now using git:
> https://savannah.nongnu.org/git/?group=certi

I'll open the entry to bug tracker or patch tracker when I get your
opinion of keeping
it very simple (_exit()) or more clean (extra socket).
I did do these tests with 3.4.3 source, but I can do the patch with
main repo too.

> --
> Erk
> L'élection n'est pas la démocratie -- http://www.le-message.org
> --
> CERTI-Devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/certi-devel

reply via email to

[Prev in Thread] Current Thread [Next in Thread]