bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Overly permissive hostname matching


From: Jeffrey Walton
Subject: Re: [Bug-wget] Overly permissive hostname matching
Date: Wed, 19 Mar 2014 11:26:41 -0400

On Wed, Mar 19, 2014 at 10:59 AM, Daniel Kahn Gillmor
<address@hidden> wrote:
> On 03/19/2014 06:19 AM, Tim Ruehsen wrote:
>> As a programmer, I want to have control. E.g. the option to load from a
>> different file, or to switch off loading. Why ? e.g. for testing purposes, or
>> simply imagine a "swiss army knife" client for experts - maybe they want to
>> have control via CLI args. Or you are in a controlled environment and simply
>> don't want to waste CPU cycles when downloading a single file from a trusted
>> server. Just some examples.
>> And than, clients like Wget would like to have access, at least for checking
>> cookies.
>
> i understand, and i think we're probably not disagreeing -- you want the
> ability to control it; i want sane defaults so that people who don't
> touch it get sensible behavior.
>
>> I just took a quick look but I am not sure about the API (i did not have this
>> 'aha' effect). But what I don't like is the dependency on PHP which is used 
>> to
>> 'compile' the PSL before the C functions can use it. While the idea of
>> compilation/preprocessing is a good one, it should at least be optional.
>
> pre-compilation/preprocessing is probably a reasonable performance
> optimization for heavy use; we might even want a C library to embed a
> precompiled version of the most recent known list at time of
> compilation, so that it can be used with no initialization step or when
> no file is available.
This may help with seeding thoughts for an implementation. I'm
fortunate because I work in C++.

I have a 'precooked' list with, "com", "mil", ...  "ak.us, "co.uk",
etc. One entry for each line.

There can be multiple dots. For example, "sekikawa.niigata.jp".

I read the list into a vector, sort it in n*log(n), and then get
log(n) lookups for the lifetime of the program. I pay the cost of the
sort because I make frequent lookups.

When I match names with wild cards, I take a DNS name like
*.example.com. I change it to example.com, and see if its banned. Its
a simple algorithm but its effective.

I embed the list in my executable with GNU's assembler (*.S file). Its
essentially a string with both a length and a NULL terminator:

    ;; eff_tld_list.S
    .section .rodata

    ;; Mozilla's Effective TLD list
    .global eff_tld_list
    .type   eff_tld_list, @object
    .align  8
eff_tld_list:
eff_tld_list_start:
    .incbin "res/eff_tld_list.lst"
eff_tld_list_end:
    .byte 0

    ;; The string's size (if needed)
    .global eff_tld_list_size
    .type   eff_tld_list_size, @object
    .align  4
eff_tld_list_size:
    .int    eff_tld_list_end - eff_tld_list_start

Below is the script I use to fetch Mozilla's list.

Jeff

**********

#! /bin/bash

MOZILLA_LIST=MOZILLA_LIST=eff_tld_list.lst

wget "http://publicsuffix.org/list/effective_tld_names.dat"; -O $MOZILLA_LIST

# Remove comments
sed "/^\/\//d" $MOZILLA_LIST > temp-1.txt
mv temp-1.txt $MOZILLA_LIST

# Remove empty lines
sed "/^$/d" $MOZILLA_LIST > temp-2.txt
mv temp-2.txt $MOZILLA_LIST

# Remove lines that begin with "!"
sed "s/^!//g" $MOZILLA_LIST > temp-3.txt
mv temp-3.txt $MOZILLA_LIST

# Remove lines that begin with "*."
sed "s/^\*\.//g" $MOZILLA_LIST > temp-4.txt
mv temp-4.txt $MOZILLA_LIST

# Pre-sort it
cat $MOZILLA_LIST | sort > temp-8.txt
mv temp-8.txt $MOZILLA_LIST

# Copy it to resources
cp $MOZILLA_LIST ../res



reply via email to

[Prev in Thread] Current Thread [Next in Thread]