Re: [Bug-wget] Concurrency and wget

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Concurrency and wget

From:	Tim Ruehsen
Subject:	Re: [Bug-wget] Concurrency and wget
Date:	Tue, 3 Apr 2012 11:17:56 +0200
User-agent:	KMail/1.13.7 (Linux/3.2.0-2-amd64; KDE/4.7.4; x86_64; ; )

Hi Giuseppe, hi Micah,

while couldn't sleep last night, I thought about wget and concurrency...

I had the idea of using a top-down approach to outline what wget is doing.
Just to have a overview without struggling with the details of implementation.
As a side effect one would have a (textual? graphical?) starting point for
contributors to rush into the project. A chance to have a clear and well 
documented design.

Since maintenance of a flowchart is time-consuming and requires some extra
skills and tools, pure texts in the form of a "programming language" seems to 
fit.

Here is just a beginning, let's say a basis for discussions.
If you don't mind, I would like take part in ongoing development.

Basic wget functionality (download given URI/IRI):

main (URI) {
        put <URI> into <queue>

        while <queue> is not empty {
                download_and_analyse(next <queue> entry)
        }
}

download_and_analyse (URI) {
        download URI to FILE
        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}


Extended for simple multitasking (threaded, multi processes or even 
distributed).
This is just one possible design for concurrent downloads.
Maybe you have a more elegant idea.

main (URI) {
        create <N> downloaders
        put <URI> into <queue>

        wait for status message from downloader {
                print status
                if <queue> is empty {
                        stop downloaders
                        we are done
                }
        }
}

downloader {
        wait for and allocate entry in <queue> {
                download_and_analyse(entry)
        }
}

download_and_analyse (URI) {
        download URI to FILE
        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}


Extended to download a URI from several sources in parallel.
main and downloader stay the same, just download_and_analyse() is extended.

download_and_analyse (URI) {
        /* download URI to FILE */
        put <X> chunk entries into <chunk_queue>
        create <X> chunkloaders
        wait for status message from chunkloader {
                send modified status message to main
                if <chunk_queue> is empty {
                        stop chunk_loaders
                        end loop
                }
        }

        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}

chunk_loader {
        wait for and allocate entry in <chunk_queue> {
                download(entry)
                remove entry from <chunk_queue>
        }
}

After some iterations we should come to a point where we can make further 
decisions:
- how to implement concurrency (threads, processes, distributed process, 
(cloud))
- how to implement communication between tasks
- is a wget rewrite reasonable ?
- which existing code to recycle ?
- creating libraries from existing code (e.g. libwget) or use external 
libraries
  (e.g. for network stuff, parsing and creating URI/IRIs, etc.)
- create a list of test code, especially for the library code
- ... etc etc ...


    Tim

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] Concurrency and wget, Tim Ruehsen <=
- Re: [Bug-wget] Concurrency and wget, Tim Ruehsen, 2012/04/10
  - Re: [Bug-wget] Concurrency and wget, Micah Cowan, 2012/04/10
    - Re: [Bug-wget] Concurrency and wget, Tim Ruehsen, 2012/04/11
  - Re: [Bug-wget] Concurrency and wget, Anthony Bryan, 2012/04/11
    - Re: [Bug-wget] Concurrency and wget, Tim Ruehsen, 2012/04/13

Prev by Date: Re: [Bug-wget] Fix for crash on invalid STYLE tag
Next by Date: [Bug-wget] [PATCH] enable client certificates with wget when linked against GnuTLS
Previous by thread: [Bug-wget] Fix for crash on invalid STYLE tag
Next by thread: Re: [Bug-wget] Concurrency and wget
Index(es):
- Date
- Thread