mldonkey-bugs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Mldonkey-bugs] [bugs #11384] Source of Orphaned File Descriptor Bug


From: spiralvoice
Subject: [Mldonkey-bugs] [bugs #11384] Source of Orphaned File Descriptor Bug
Date: Thu, 13 Jan 2005 00:25:08 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20050110 Firefox/1.0 (Debian package 1.0+dfsg.1-2)

This is an automated notification sent by Savannah.
It relates to:
                bugs #11384, project mldonkey, a multi-networks file-sharing 
client

==============================================================================
 LATEST MODIFICATIONS of bugs #11384:
==============================================================================

               Posted by: spiralvoice <spiralvoice>
               Posted on: 2005-01-13 00:25 (Europe/Berlin)
    _______________________________________________________

Follow-up Comment:
Saw that as well, but I don't know how to download OCaml CVS version. Some
months ago I used these commands:



cvs -d:pserver:address@hidden:/caml login

cvs -z3 -d:pserver:address@hidden:/caml co -P ocaml



as described on this page: http://camlcvs.inria.fr/cvsserver-eng.html



But now I get Ocaml 3.08+2 instead of Ocaml 3.09 CVS:-(

==============================================================================
 OVERVIEW of bugs #11384:
==============================================================================

URL:
  <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11384>

                 Summary: Source of Orphaned File Descriptor Bug
                 Project: mldonkey, a multi-networks file-sharing client
            Submitted by: shunga
            Submitted on: Thu 12/23/2004 at 15:46
                Category: Core
                Severity: 5 - Average
              Item Group: Program malfunction
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
                 Release: None
                 Release: 2.5-22
        Platform Version: Mac OS X Jaguar
         Binaries Origin: CVS / Self compiled
                CPU type: PowerPC

    _______________________________________________________


On Mac OS X and I assume other systems there are two bugs in mlnet which
together generate hundreds of orphaned file descriptors causing mlnet to
eventually hang.  I worked with mlnet 2.5-22 source and ocaml 3.07-p12 to
debug the source of the problem:



I don't know the source code or OCAML well enough to suggest exactly why it
is happening or the best way to fix it, but I have done enough debugging to
figure out the cause of the probem. There are two issues: 



1. The first is in src/daemon/common/commonChat.ml in the routine
send_paquet_to_mlchat. The Unix.connect fails with "Connection refused :
connect" but the error is not trapped and the socket is not closed. Trapping
the error and closing the socket fixes this one. 



2. The rest of the orphaned file descriptors is in
src/utils/net/tcpServerSocket.ml in the routine tcp_handler. The Unix.accept
fails with Exception tcp_handler: failed: Address family not supported by
protocol family" but apparently has created a new socket which is never
closed. If I trap the exception and issue the following "close t
(Closed_for_error (Printexc2.to_string e));" I find that lsof only shows one
orphaned socket after hours of running. I assume that issuing "close t"
closes the original socket that is being listened to and this stop this
Unix.accept from being called again. I don't know why it is getting this
error unless perhaps the previous bind failed and that wasn't trapped, but
maybe there is some other reason. 



So if a developer who knows the code and OCAML can fix these two problems and
get the patches in the current release then that should solve the orhpaned
file descriptor problem which causes mlnet to hang after running for some
hours

    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Thu 01/13/2005 at 00:25       By: spiralvoice <spiralvoice>
Saw that as well, but I don't know how to download OCaml CVS version. Some
months ago I used these commands:



cvs -d:pserver:address@hidden:/caml login

cvs -z3 -d:pserver:address@hidden:/caml co -P ocaml



as described on this page: http://camlcvs.inria.fr/cvsserver-eng.html



But now I get Ocaml 3.08+2 instead of Ocaml 3.09 CVS:-(

-------------------------------------------------------
Date: Wed 01/12/2005 at 21:56       By: Shunga <shunga>
I've received an email from caml that the OCAML issue with the open file
descriptors (id = 3423)  has been fixed in the CVS version of OCAML.

-------------------------------------------------------
Date: Mon 01/10/2005 at 21:20       By: spiralvoice <spiralvoice>
FYI: There are bug reports at Inria so this problem will be fixed sometime.

http://caml.inria.fr/bin/caml-bugs/incoming?id=3422

http://caml.inria.fr/bin/caml-bugs/incoming?id=3423

-------------------------------------------------------
Date: Fri 01/07/2005 at 04:06       By: Shunga <shunga>
I think I've verified that the problem is in the OCAML unix library, not in
mldonkey.  I modified OCAML's unix library and moved the unix_error call that
was in alloc_sockaddr to the routines that called alloc_sockaddr and returned
a value of -1 from within alloc_sockaddr.  That allowed me in unix_accept to
close(retcode) before calling unix_error.  I rebuilt 2.5.29ab with the
begin/end added to tcp_handler, but removed the "close t" because that is
incorrect.  The log file indicates that I've gotten a number of tcp_handler
errors but now I have only one orphaned file descriptor (the one due to
send_paquet_to_mlchat and I've verified that I still have a high ID (since
the close t was removed).



It is probably a good idea to continue to trap the error in the tcp_handler
routine so mine now looks like:



let tcp_handler t sock event =

  match event with 

  | CAN_READ

  | CAN_WRITE ->

     begin

     try

       let s,id = Unix.accept (fd sock) in

       if !verbose_bandwidth > 1 then lprintf "[BW2 %6d] accept on %s\n"
(last_time ()) t.name;

       (match t.accept_control with

          None -> () | Some cc -> 

             cc.nconnections_last_second <- cc.nconnections_last_second +
1);

       incr nconnections_last_second;

       t.event_handler t (CONNECTION (s,id))

 with e ->

    lprintf "Exception tcp_handler: %s\n" (Printexc2.to_string e);

    raise e

    end

  | _ -> t.event_handler t (BASIC_EVENT event)



although this doesn't fix the problem, it does keep other lines of code from
executing when an error occurs.



The final orphaned file descriptor can be fixed by changing
send_paquet_to_mlchat in src/daemon/common/commonChat.ml to close chan_out on
a connect error.  My routine looks like:



let send_paquet_to_mlchat (p : C.packet) =

  let domain = Unix.PF_INET in

  let sock = Unix.socket domain Unix.SOCK_STREAM 0 in

  let inet_addr =

    let host = !!O.chat_app_host in

    try Unix.inet_addr_of_string host

    with _ ->

        let h = Unix.gethostbyname host in

        h.Unix.h_addr_list.(0)

  in

  let sockaddr = Unix.ADDR_INET (inet_addr, !!O.chat_app_port) in

  let chanout = Unix.out_channel_of_descr sock in

  try

    Unix.connect sock sockaddr;

    Chat_proto.write_packet_channel chanout p;

    flush chanout;

    close_out chanout

  with

  | Unix.Unix_error (e,s1,s2) ->

      let s = (Unix.error_message e)^" : "^s1^" "^s2 in

      lprintf "%s\nchat_app_host=%s chat_app_port=%d\n" s

        !!O.chat_app_host !!O.chat_app_port;

      close_out chanout

  | e ->

      lprintf "%s\nchat_app_host=%s chat_app_port=%d\n"

        (Printexc2.to_string e)

      !!O.chat_app_host !!O.chat_app_port;

      close_out chanout



With the change to the OCAML unix library and these changes to mldonkey, I no
longer have any orphaned file descriptors and continue to have a high ID.



So if someone that knows the OCAML developers, maybe you could get them to
change the OCAML unix library.



The other question is on Mac OS X, why is sa_family sometimes not defined in
alloc_sockaddr?  I have not looked at what the value is, but it may be that
the value is AF_INET6 and HAS_IPV6 is not defined in the makefile for making
OCAML on Mac OS X.  It could be that if HAS_IPV6 was defined for Mac OS X
when OCAML was made, that the error would not have occurred.



Shunga





-------------------------------------------------------
Date: Fri 01/07/2005 at 02:26       By: Anonymous
Although I don't understand the source of the problem, I think I now know the
reason for the orphaned socket on Mac OS X.  I believe the source of the
problem is the OCAML Unix.accept routine.  This routine I think calls the
OCAML unix_accept routine defined in ocaml-3.08.0/otherlibs/unix/accept.c and
the code looks like this:



CAMLprim value unix_accept(value sock)

{

  int retcode;

  value res;

  value a;

  union sock_addr_union addr;

  socklen_param_type addr_len;

  

  addr_len = sizeof(addr);

  enter_blocking_section();

  retcode = accept(Int_val(sock), &addr.s_gen, &addr_len);

  leave_blocking_section();

  if (retcode == -1) uerror("accept", Nothing);

  a = alloc_sockaddr(&addr, addr_len);

  Begin_root (a);

    res = alloc_small(2, 0);

    Field(res, 0) = Val_int(retcode);

    Field(res, 1) = a;

  End_roots();

  return res;

}



Notice that the unix accept routine is called which creates a file descriptor
in retcode (assuming no error).    Then this routine calls alloc_sockaddr
which is defined in ocaml-3.08.0/otherlibs/unix/socketaddr.c and this routine
looks like this:



value alloc_sockaddr(union sock_addr_union * adr /*in*/,

                     socklen_param_type adr_len)

{

  value res;

  switch(adr->s_gen.sa_family) {

#ifndef _WIN32

  case AF_UNIX:

    { value n = copy_string(adr->s_unix.sun_path);

      Begin_root (n);

        res = alloc_small(1, 0);

        Field(res,0) = n;

      End_roots();

      break;

    }

#endif 

  case AF_INET: 

    { value a = alloc_inet_addr(&adr->s_inet.sin_addr);

      Begin_root (a);

        res = alloc_small(2, 1);

        Field(res,0) = a;

        Field(res,1) = Val_int(ntohs(adr->s_inet.sin_port));

      End_roots();

      break;

    }

#ifdef HAS_IPV6

  case AF_INET6:

    { value a = alloc_inet6_addr(&adr->s_inet6.sin6_addr);

      Begin_root (a);

        res = alloc_small(2, 1);

        Field(res,0) = a;

        Field(res,1) = Val_int(ntohs(adr->s_inet6.sin6_port));

      End_roots();

      break;

    }

#endif

  default:

    unix_error(EAFNOSUPPORT, "", Nothing);

  }

  return res;

}



Note that if the sa_family doesn't match any case, the default is to set the
unix error to "EAFNOSUPPORT" which is exactly the error that is seen in
mldonkey.   Now I'm not sure exactly what happens when unix_error is called,
but if alloc_socketaddr doesn't return in unix_accept that would keep retcode
from being put into the result of unix_accept which would mean that there
would be no way in mldonkey to access the file descriptor to close it when
this error occurs.   If this is true, the problem has to be corrected in the
OCAML library by closing retcode if alloc_sockaddr gets an error.



On the other hand if alloc_sockaddr does return after calling unix_error and
unix_accept does fill in the return value with the file descriptor (retcode),
then the mldonkey code would have to "close s" in the error part of the
routine and I don't know OCAML well enough to figure out how to get to s to
close it.



So, bottom line is that it may be that the OCAML library needs to be modified
to close the file descriptor in retcode if alloc_sockaddr gets an error.



Shunga

-------------------------------------------------------
Date: Thu 01/06/2005 at 16:55       By: spiralvoice <spiralvoice>
Just my observation: with 2-5-28i and this patch I always got LowID´s,
without it HighID´s.

-------------------------------------------------------
Date: Thu 01/06/2005 at 14:47       By: Shunga <shunga>
Adding the begin/end around the "try" in tcp_handler fixed the compilation
warning bug didn't change anything.  It still appears that when "close t" is
performed that the socket for port 4662 is closed and from that point on
because that port is now closed future server connections get a low ID.   The
error in the log "Exception tcp_handler:  failed: Address family not supported
by protocol family" appears to be an error due to a Unix connect call and not
due to the accept call, although when I added a Unix.Unix_error match to the
try, Unix.Unix_error was never matched, only "e".  What I would like to try
but don't know how to do is when the error occurs, close the socket "s" that
was created by the Unix.accept call instead of closing t (which appears to be
the socket for port 4662).  I expect that whaterver the problem is, it is not
actually in tcp_handler, but somewhere else and perhaps related to a connect
call.



Shunga



-------------------------------------------------------
Date: Thu 01/06/2005 at 14:06       By: Anonymous
I compiled the tag-2-5-29ab source that you pointed to that schlumpf provided
with no changes.  After I posted the bug and the "try" suggestion to
illustrate the error, I did notice the compilation warning and added the
begin/end combination just as you did in the patch.  As I recall, it didn't
change anything, but I'll try it once more to make sure using the
tag-2-5-29ab source..



Shunga

-------------------------------------------------------
Date: Thu 01/06/2005 at 13:40       By: Anonymous
I compiled the tag-2-5-29ab source that you pointed to that schlumpf provided
with no changes.  After I posted the bug and the "try" suggestion to
illustrate the error, I did notice the compilation warning and added the
begin/end combination just as you did in the patch.  As I recall, it didn't
change anything, but I'll try it once more to make sure using the
tag-2-5-29ab source..



Shunga

-------------------------------------------------------
Date: Thu 01/06/2005 at 04:55       By: Amorphous <amorphous>
did you try that with or without the patch i posted in the forum linked in my
last comment to this bug? if without please try with it applied. (oh and no
need to message me through savannah i get notified on changes of bugs i
posted to)



-------------------------------------------------------
Date: Thu 01/06/2005 at 03:33       By: Shunga <shunga>
When I suggested the try, I noticed the same thing that I'm noticing with
2.5.29ab.  When I originally suggested the "try" in tcp_handler to illustrate
the error, I noticed the following issue which also occurs with the patch. 
When the error occurs and the "close t" is issued, that close apparently
closes the socket on port 4662.  After that occurs, it is true that there are
no more orphaned file descriptors, however, it appears that any additional
connections to servers results in a lowid.  For example in my console log,
after the error occurs, I start seeing the following when attaching to new
servers:



+-- From server  [193.41.142.148:10000] ------

| WARNING : You have a lowid. Please review your network config and/or your
settings.



+-- From server DonkeyServer No6  [62.241.53.4:4242] ------

| WARNING : You have a lowid. Please review your network config and/or your
settings.



+-- From server www.MESSENGER7.NET [205.209.178.170:12933] ------

| WARNING : Your 4662 port is not reachable. Please review your network
config.

| server version 17.1 (lugdunum)



Before the error occured, servers did not report lowid.



Shunga.



-------------------------------------------------------
Date: Wed 01/05/2005 at 18:53       By: Amorphous <amorphous>
it's in the svn repository mentioned in another thread in that forum-group. i
added a link to an archive of the source, schlumpf provided.

-------------------------------------------------------
Date: Wed 01/05/2005 at 09:34       By: Anonymous
I would like to try this but with what? I read the post at
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72
and I have looked at the CVS-page but I don't know where the/which source to
download - where is the 29ab-version?

-------------------------------------------------------
Date: Wed 01/05/2005 at 09:30       By: Anonymous
I would like to try this but with what? I read the post at
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72
and I have looked at the CVS-page but I don't know where the/which source to
download - where is the 29ab-version?

-------------------------------------------------------
Date: Wed 01/05/2005 at 08:23       By: Amorphous <amorphous>
could you confirm if this is fixed with 2.5.29ab? see
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72

-------------------------------------------------------
Date: Thu 12/23/2004 at 20:51       By: Shunga <shunga>
Programmer asleep at the switch :-).  I was wrong about some other software
change.  Turns out tcp_handler does fail if changed as follows:



let tcp_handler t sock event =

  match event with 

  | CAN_READ

  | CAN_WRITE ->

      try 

        let s,id = Unix.accept (fd sock) in

        if !verbose_bandwidth > 1 then lprintf "[BW2 %6d] accept on %s\n"
(last_time ()) t.name;

        (match t.accept_control with

            None -> () | Some cc ->

              cc.nconnections_last_second <- cc.nconnections_last_second +
1);

        incr nconnections_last_second;

        t.event_handler t (CONNECTION (s,id))

      with  e ->

        lprintf "Exception tcp_handler: %s\n" (Printexc2.to_string e);

        close t (Closed_for_error (Printexc2.to_string e)); 

        raise e

  | _ -> t.event_handler t (BASIC_EVENT event)



and it leaves one socket orphaned which I assume is "s".  I don't know how to
get "s" down into the "with -> e" so that it can be closed and I don't know if
I need to "close t" as is indicated in the code which at the moment is closing
one of the server sockets that is being listened to., however, with this
change "mlnet" has run for hours with only one orphaned socket.  This plus
the commonChat change should get rid of the orphaned sockets.  I'll leave it
up to the experts to figure out what is really going on and how to best fix
it.

-------------------------------------------------------
Date: Thu 12/23/2004 at 20:03       By: Shunga <shunga>
Well it would appear that the failure in tcp_handler was due to some other
change that I must have made while attempting to debug this.  When I start
over with a fresh copy of the source the handler doesn't fail and the file
descriptors start building up.



Guess I have to go back and see if I can figure out what else it was that I
changed.  :-(







    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
hgd                                 | 
address@hidden                    | 




==============================================================================

This item URL is:
  <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11384>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]