************************************************ * Hget - Web download manager using HarvestMan * * * ************************************************ About ----- Hget is a web downloader written on top of HarvestMan framework. Hget allows one to download file(s) from the Internet quickly by using HTTP multipart, HTTP resume and mirror search features. Hget is not distributed as a separate program. It is part of the HarvestMan source distribution, but works as a separate application. Features -------- Hget 1.0 alpha has the following features. 1. Download URLs from http, https and ftp sites. 2. Download URLs in mutliple parts from HTTP sites which support byte-range downloads. Each piece gets downloaded in its own thread and the final file is assembled by combining the pieces. (multipart download is not available for FTP urls). 3. Mirror downloading for sourceforge.net URLs 4. Mirror search algorithm - Search for common application downloads in mirror search sites and perform multipart downloads. 5. HTTP resuming - Suspend a download and resume it from where it left off later. 6. Mirror failover - For mirror downloads, automatically picks new mirrors if a mirror fails, without interrupting the download. 7. Arbitrary mirror support - Supply "mirror files" to the program for downloading a file from arbitrary mirror URLs. 8. HTTP compression - Supports HTTP compression if the server supports it. 9. Basic Authentication - Support for HTTP basic authentication for URLs requiring authentication. 10. Automatic redirection - Automatic redirection for URLs and mirror URLs if server performs redirection. 11. Auto renaming - Renames the output file smartly depending on redirection, existence of file with the same name etc. Author ------ Anand B Pillai License ------- Hget is released under GNU GPL. See file LICENSE.TXT Version ------- 1.0 alpha NOTE: The entire HarvestMan distribution is currently at 2.0 alpha version. The hget application and related modules which are also part of HarvestMan distribution however are more recent and has a separate version number. WWW --- http://harvestmanontheweb.com Installation ------------ There is only a single installation step for the entire HarvestMan package. See section "How to Install" in Readme.txt for details. Running the program ------------------- Hget requires all options to be passed on the command line. There is no configuration file as in the case of HarvestMan for Hget. If you run Hget without any options, you will see the usage printed. $ hget Usage: /usr/lib/python2.5/site-packages/harvestman/apps/hget.py [options] URL(s) | file(s) Hget 1.0 beta 1: A multithreaded web downloader based on HarvestMan. Author: Anand B Pillai The program accepts URL(s) or an input file(s) containing a number of URLs, one per line. If a file is passed as input, any other program option passed is applied for every URL downloaded using the file. Mail bug reports and suggestions to . Options: -h, --help show this help message and exit -v, --version Print version information and exit -V, --verbose Be verbose -s, --single Single thread mode. If enabled, won't attempt to do multithreaded downloads using byte-range headers -r, --noresume If set, will not try to resume partial downloads -X PROXYSERVER, --proxy=PROXYSERVER Enable and set proxy to PROXYSERVER (host:port) -U USERNAME, --proxyuser=USERNAME Set username for proxy server to USERNAME -W PASSWORD, --proxypass=PASSWORD Set password for proxy server to PASSWORD -u USERNAME, --username=USERNAME Use username USERNAME for HTTP (basic) authentication -p PASSWORD, --passwd=PASSWORD Use password PASSWORD for HTTP (basic) authentication -P NUMPARTS, --numparts=NUMPARTS Force-split download into parts (max 10) -m, --inmem Keep data in memory instead of flushing to disk (Warning: use with care as this might exhaust memory for huge file downloads!) -n, --notemp Use current directory instead of system temp directory for saving intermediate files -f MIRRORFILE, --mirrorfile=MIRRORFILE Load mirror information for the URL(s) from MIRRORFILE (The file must contain a list of valid mirrors for the URL, one per line) -i RELPATHINDEX, --relpathidx=RELPATHINDEX When loading mirrors, use the given index to calculate the relative path of the original URL (If given, the relative path of the original URL will be offset by this value) -N, --norelpath When loading mirrors, do not compute mirror URLs using relative-path (Instead just appends the filename to the mirror URL) -o FILE, --output=FILE Save document to FILE -d DIRECTORY, --outputdir=DIRECTORY Save document to directory Downloading file(s) ------------------ Hget can download files in single thread (single part) and in multiple threads for HTTP downloads (HTTP multipart). We will discuss single part downloads (Simple) and multipart downloads (Advanced) here. By default Hget saves intermediate data in temporary files in $TEMPDIR/harvestman folder where $TEMPDIR is the system temporary directory. For example for UNIX and UNIX like systems this is normally /tmp/harvestman . For Win32 systems it is typically %WINDRIVE%/Documents and Settings/%USER%/Local Settings/Temp/harvestman folder. Hget can also keep the intermediate data in memory. However if this option is used, features like resuming (discussed below) etc will not work. Simple ------ The simplest way to use hget is to simply pass a URL to it. Example, $ hget www.python.org Downloading URL www.python.org ... Connecting to http://www.python.org... Length: 16083 (15K) Type: text/html Content Encoding: plain 12.60K/s (1) eta: 0:00:00 [==================================================>][100%] Saved to index.html. 16083 bytes downloaded in 0:00:03 hours at an average of 4.55 kb/s. Hget displays a progress bar while the download is in progess. This progress bar shows download information on the LHS. This shows the following information, () For example, here is a snapshot of a download in progress. $ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ... Connecting to http://www.python.org... Length: 9357099 (8M) Type: application/x-tar Content Encoding: plain 31.89K/s (1) eta: 0:04:32 [=> ]( 4%) The progress bar here means that, * The current download speed is 31.89 KB per second * There is 1 thread doing the download. * The estimated time of completion is 4 minutes and 32 seconds. * So far 4% of the file has been downloaded. HTTP Resuming ------------- If you are downloading a file from an HTTP or HTTPS URL and if the download is interrupted due to some reason (explicit Ctrl-C, killing the process, power shutdown, timeouts etc), then the next time you try the same download, Hget will try to resume from where it left off, if the temporary files of the previous download are not deleted from the temporary folder. This will work in most HTTP downloads since almost every HTTP server supports resuming. Example, $ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ... Connecting to http://www.python.org... Length: 9357099 (8M) Type: application/x-tar Content Encoding: plain Caught keyboard interrupt... [========> ]( 18%) Download aborted by user interrupt. Cleaning up temporary files... Waiting for threads to finish... Done. The above download was interrupted using an explicit kill (Ctrl-C). If you try the same download again, $ hget http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ... Temporary files from previous download found, trying to resume download... Resuming download... Connecting to http://www.python.org... Length: 7636779 (7M) Type: application/x-tar Content Encoding: plain 17.65K/s (1) eta: 0:06:59 [> ]( 0%) Hget now resume the download from where it left off. The earlier download had completed downloading 1720320 bytes, so only the remaining is downloaded. HTTP resuming works at more than one level. For example, if the above download is interrupted again, the temporary files are stored again and next time download is resumed from that point. At the end of download, the current piece is combined with all earlier pieces in order, to produce the final file. NOTE: HTTP resuming works only if Hget is run in temp-file mode. If run in in-memory mode using the -m or --inmem flag, all temporary data is lost at the end of the program and resuming will not happen. NOTE: If you don't want to resume an interrupted download for some reason, pass the -r flag. NOTE: Resuming does not work for FTP downloads. Advanced -------- Multipart downloads ------------------- Hget can perform HTTP multipart (downloading a file in many pieces) using HTTP/1.0 byte-range requests, if the server supports it. This is done by using the -P or --numparts option and supplying an argument equal to the number of parts you want the file split to. NOTE: The numparts has a maximum value of 20. Any value greater than 20 is truncated to 20. If the server supports multipart downloading, Hget will split the file to the number of pieces requested and schedule each piece to be downloaded it its own thread. After every thread finishes its piece,the final file is produced by combining the pieces. For example, $ hget -P5 http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 Force-split switched on, resuming will be disabled Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ... Connecting to http://www.python.org... Length: 9357099 (8M) Type: application/x-tar Content Encoding: plain Forcing download into 5 parts Server supports multipart downloads Trying multipart download... 113.86K/s (5) eta: 0:01:11 [====> ]( 11%) Towards the end of download, $ hget -P5 http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 Force-split switched on, resuming will be disabled Downloading URL http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 ... Connecting to http://www.python.org... Length: 9357099 (8M) Type: application/x-tar Content Encoding: plain Forcing download into 5 parts Server supports multipart downloads Trying multipart download... 78.24K/s (1) eta: 0:00:00 [==================================================>][100%] Download of http://www.python.org/ftp/python/2.5/Python-2.5.tar.bz2 is complete... Data download completed. Saved to Python-2.5.tar.bz2. 9357099 bytes downloaded in 0:01:59 hours at an average of 76.31 kb/s. NOTE: (5) indicates that currently 5 threads are downloading the file. The progress bar indicates the progress of the complete file, i.e sum of all the pieces. NOTE: When a file is split-downloaded, resuming is disabled, since we do not know whether the same number of parts will be requested in a second download of the same file. So if a split-download is interrupted, all temporary files are cleaned up. NOTE: Hget currently do not support multipart downloads for FTP URLs. So if you request multipart for FTP URLs, it is ignored. $ hget -P5 ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz Force-split switched on, resuming will be disabled Downloading URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz ... Connecting to ftp://ftp.gnu.org... Length: 20403483 (19M) Type: application/x-tar Content Encoding: plain FTP request, not trying multipart download, defaulting to single thread 56.58K/s (1) eta: 0:05:45 [> ]( 1%) Sourceforge downloads --------------------- If a URL is a download from sourceforge.net servers and multipart is requested, Hget splits the download across sourceforge.net mirrors instead of downloading all pieces from the same mirror. This is automatically done. To see the mirror information the -V option can be used. For example, $ hget -P5 -V http://downloads.sourceforge.net/numpy/numpy-1.0.4.tar.gz Force-split switched on, resuming will be disabled Downloading URL http://downloads.sourceforge.net/numpy/numpy-1.0.4.tar.gz ... Connecting to http://downloads.sourceforge.net... Redirected to http://jaist.dl.sourceforge.net/sourceforge/numpy/numpy-1.0.4.tar.gz... Length: 1547541 (1M) Type: application/x-gzip Content Encoding: plain Forcing download into 5 parts Server supports multipart downloads Trying multipart download... Splitting download across mirrors... Worker-4: Downloading url http://easynews.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(0 - 309507) Worker-2: Downloading url http://internap.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(309508 - 619015) Worker-1: Downloading url http://superb-east.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(619016 - 928523) Worker-3: Downloading url http://superb-west.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(928524 - 1238031) Worker-5: Downloading url http://ufpr.dl.sourceforge.net/sourceforge/sourceforge/numpy/numpy-1.0.4.tar.gz, byte range(1238032 - 1547540) 59.37K/s (5) eta: 0:00:18 [============> ]( 26%) Mirror search ------------- Hget supports a rudimentary mirror search algorithm. Mirror search is specified by the -M flag. It is useful only for multipart downloads. If you are trying to download a large file and you want to see if the file is available in public mirrors, it can be searched for in mirror search sites. Such sites return URL information for file mirrors. Hget supports searching in the mirror search site www.findfiles.com . This site provides mirror URLs for most projects of the Apahce foundation and also many other popular downloads. To perform mirror search and download, specify the -M flag and the -P flag with the number of pieces requested. If no mirror URLs are found for this file, then the download is aborted. For example, $ hget -M -P5 ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz Force-split switched on, resuming will be disabled Mirror search turned on Downloading URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz ... Connecting to ftp://ftp.gnu.org... Length: 20403483 (19M) Type: application/x-tar Content Encoding: plain Server supports multipart downloads Trying multipart download... Splitting download across mirrors... Searching mirrors for emacs-21.4a.tar.gz... Searching site http://www.findfiles.com for mirror URLs... Length: 20403483 (19M) Type: application/x-tar Content Encoding: plain Server supports multipart downloads Trying multipart download... Splitting download across mirrors... Cannot search for new mirrors No mirrors found Download of URL ftp://ftp.gnu.org/gnu/emacs/emacs-21.4a.tar.gz not completed. However if mirrors are found, the download proceeds. If the number of mirror URLs found are less than the number of parts requested, download proceeds with those URLs only. For example, $ hget -M -P5 http://apache.mirrors.timporter.net/ant/binaries/apache-ant-1.7.0-bin.tar.gz Force-split switched on, resuming will be disabled Mirror search turned on Downloading URL http://apache.mirrors.timporter.net/ant/binaries/apache-ant-1.7.0-bin.tar.gz ... Connecting to http://apache.mirrors.timporter.net... Length: 8958777 (8M) Type: application/x-gzip Content Encoding: plain Server supports multipart downloads Trying multipart download... Splitting download across mirrors... Searching mirrors for apache-ant-1.7.0-bin.tar.gz... Searching site http://www.findfiles.com for mirror URLs... 2 mirror URLs found, queuing them for multipart downloads... 10.67K/s (2) eta: 0:12:55 [=> ]( 5%) If mirror search returns only a single mirror, then download is aborted, since a single mirror is equivalent to downloading as a single part anyway. For example, $ hget -M -P5 http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2 Force-split switched on, resuming will be disabled Mirror search turned on Downloading URL http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2 ... Connecting to http://www.apache.org... Length: (Unknown) Type: text/html Content Encoding: plain Server supports multipart downloads Trying multipart download... Splitting download across mirrors... Searching mirrors for apache-maven-2.0.8-bin.tar.bz2... Searching site http://www.findfiles.com for mirror URLs... 1 mirror URLs found, queuing them for multipart downloads... Only single mirror found Download of URL http://www.apache.org/dyn/closer.cgi/maven/binaries/apache-maven-2.0.8-bin.tar.bz2 not completed. Also, for sourceforge.net URLs, if mirror search is specified, it is ignored, and mirror search algorithm defaults to mirroring among the sourceforge.net mirror servers. Mirror files ------------ Mirror URLs can be specified in a file and download can be scheduled among them. However this option is not of a lot of use, since the user has to do a lot of work in finding the mirror URLs and putting them in a file. Some sample mirror files are available in the "apps" folder. The mirror files can be used with options -f, -i and -N. Other options ------------- Here are the other relevant options. 1. -n : If you dont want to use the system temp folder for any reason (folder full etc), this option can be used to save temporary files to the current directory. 2. -X, -U, -W: Proxy server options. If your network goes through a proxy, then use these options to make hget request downloads through the proxy. 3. -u, -p: Username and password for HTTP basic authentication. This can be used for URLs requiring HTTP Basic Auth. 4. -m : For some reason, if you dont want to create temporary files at all, use this option to keep all intermediate data in-memory. Note that this could possibly produce out of memory errors for large data downloads, since everything is kept in memory. If enabling this, make sure you have ample RAM. 5. -d: Save file to another directory instead of the current one. 6. -o: Save file in a different filename instead of the one determined by the URL. 7. -V: Be more verbose.