wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wget-dev] wget2 | Restricting domains with host-spanning doesn not work


From: 一郎
Subject: [Wget-dev] wget2 | Restricting domains with host-spanning doesn not work (#483)
Date: Fri, 18 Oct 2019 19:19:45 +0000


一郎 created an issue: https://gitlab.com/gnuwget/wget2/issues/483



I'm running this command in the hope of crawling subdomains under kedo.gov.cn:

`wget2 -r -w 8 --filter-mime-type="text/html" -a wget_log -H -D kedo.gov.cn 
http://www.kedo.gov.cn`

If my assumptions are correct, when combined, `-H` enables host-spanning and 
`-D` restricts the domains. However, after a minute of operation, I end up with 
the following folder structure:

```
.
├── story.kedo.gov.cn
│   ├── index.html
│   ├── stories
│   │   └── kxr
│   │       └── index.html
│   └── story
│       └── legend
│           └── classics
│               └── index.html
├── wget_log
├── www.kedo.gov.cn
│   └── index.html
└── www.kepuchina.cn
    ├── index.html
    └── public
        └── 201710
            └── t20171031_253123.shtml
```

While the `www.kedo.gov.cn` and `story.kedo.gov.cn` folders, and their contents 
are desirable, the `www.kepuchina.cn` is *not*. It should clearly be excluded 
by `-D`. I'm familiar with these two flags from the original `wget` 
documentation, and have used them in the past.

How do I get wget2 to honor `-D`?

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/issues/483
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]