[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Concurrency and wget
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] Concurrency and wget |
Date: |
Tue, 3 Apr 2012 11:17:56 +0200 |
User-agent: |
KMail/1.13.7 (Linux/3.2.0-2-amd64; KDE/4.7.4; x86_64; ; ) |
Hi Giuseppe, hi Micah,
while couldn't sleep last night, I thought about wget and concurrency...
I had the idea of using a top-down approach to outline what wget is doing.
Just to have a overview without struggling with the details of implementation.
As a side effect one would have a (textual? graphical?) starting point for
contributors to rush into the project. A chance to have a clear and well
documented design.
Since maintenance of a flowchart is time-consuming and requires some extra
skills and tools, pure texts in the form of a "programming language" seems to
fit.
Here is just a beginning, let's say a basis for discussions.
If you don't mind, I would like take part in ongoing development.
Basic wget functionality (download given URI/IRI):
main (URI) {
put <URI> into <queue>
while <queue> is not empty {
download_and_analyse(next <queue> entry)
}
}
download_and_analyse (URI) {
download URI to FILE
add URI to <downloaded>
remove URI from <queue>
scan FILE and add URIs to <queue> if not already in <downloaded>
}
Extended for simple multitasking (threaded, multi processes or even
distributed).
This is just one possible design for concurrent downloads.
Maybe you have a more elegant idea.
main (URI) {
create <N> downloaders
put <URI> into <queue>
wait for status message from downloader {
print status
if <queue> is empty {
stop downloaders
we are done
}
}
}
downloader {
wait for and allocate entry in <queue> {
download_and_analyse(entry)
}
}
download_and_analyse (URI) {
download URI to FILE
add URI to <downloaded>
remove URI from <queue>
scan FILE and add URIs to <queue> if not already in <downloaded>
}
Extended to download a URI from several sources in parallel.
main and downloader stay the same, just download_and_analyse() is extended.
download_and_analyse (URI) {
/* download URI to FILE */
put <X> chunk entries into <chunk_queue>
create <X> chunkloaders
wait for status message from chunkloader {
send modified status message to main
if <chunk_queue> is empty {
stop chunk_loaders
end loop
}
}
add URI to <downloaded>
remove URI from <queue>
scan FILE and add URIs to <queue> if not already in <downloaded>
}
chunk_loader {
wait for and allocate entry in <chunk_queue> {
download(entry)
remove entry from <chunk_queue>
}
}
After some iterations we should come to a point where we can make further
decisions:
- how to implement concurrency (threads, processes, distributed process,
(cloud))
- how to implement communication between tasks
- is a wget rewrite reasonable ?
- which existing code to recycle ?
- creating libraries from existing code (e.g. libwget) or use external
libraries
(e.g. for network stuff, parsing and creating URI/IRIs, etc.)
- create a list of test code, especially for the library code
- ... etc etc ...
Tim
- Re: [Bug-wget] Concurrency and wget,
Tim Ruehsen <=