[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [certi-dev] Handling crash of federate

From: Eric Noulard
Subject: Re: [certi-dev] Handling crash of federate
Date: Thu, 3 Jul 2014 13:38:24 +0200

2014-07-03 10:18 GMT+02:00 Timi Tuohenmaa <address@hidden>:

I have been looking how to handle situation where federate program
crashes in Windows environment. Currently rtia.exe can't notice it
since Windows does not inform child processes about parent crash and
TCP socket between parent and child does not cut until long timeout.
This is a problem when trying to make as robust system as possible and
recovering from some crashes (like 3d visual which is merely listening
HLA and therefore would be easy to rejoin to system).

Hi Timi,

If you want some hsitorical view on the Windows implementation
of Federate<-->RTIA communication you can read this:

In Linux this is probably not a problem as I think Unix Sockets closes
when parent dies and therefore gets notified correctly.

Unfortunately this is surprisingly difficult to solve in Windows.
Windows offers Job Object -system that could terminate rtia.exe when
parent dies, but as far as I understand it does only offer option for
violent terminate (like kill -9) and it's not good as then rtia.exe's
sockets to rtig.exe would need timeout death.

One way to notice parent crash would be opening pipe between them:

This additional pipe could be watched if select-function would be
changed to WaitForMultipleObjects WINAPI crap, but that forcefully
changes sockets to nonblocking state and it's not too good for current
CERTI logic. By adding check to few recv errors I managed to make this
work, but it was cpu heavy as it caused busy loop.

I see...

Other way would be change infinite selects to timed ones internally
and to check that pipe now and then. Or making a thread that would be
checking pipe and then using additional TCP-socket to reset select
when parent dies.

Now I wonder if bit more complex patch would even be taken to base
CERTI code at all. I sure hope so as I find this important issue and I
am ready to solve this.

Having a robust behavior is worth the effort.
If the complexity comes from the platform (i.e. Windows) then the patch
and the associated extra cmplexity should be platform specific.

I'm not a Windows specialist so I guess I would ask other CERTI windows users for their expertise?

In the meantime I'll try to seek a little about this issue on my own.

Whatever the result I will definitely have a look at your patch and consider it seriously.
Please open an entry in the bug tracker or patch tracker for that.

Remember that CERTI main repository is now using git:

L'élection n'est pas la démocratie -- http://www.le-message.org

reply via email to

[Prev in Thread] Current Thread [Next in Thread]