www/server/source source.html find_duplicate

www-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
www/server/source source.html find_duplicate_li...

From:	Félicien PILLOT
Subject:	www/server/source source.html find_duplicate_li...
Date:	Fri, 16 Aug 2019 17:56:23 -0400 (EDT)
CVSROOT:        /web/www
Module name:    www
Changes by:     FÃ©licien PILLOT <felandral>    19/08/16 17:56:22

Modified files:
        server/source  : source.html 
Added files:
        server/source  : find_duplicate_links make_patch_addresses 

Log message:
        Add two new scripts, update source.html

CVSWeb URLs:
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/source.html?cvsroot=www&r1=1.55&r2=1.56
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/find_duplicate_links?cvsroot=www&rev=1.1
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/make_patch_addresses?cvsroot=www&rev=1.1

Patches:
Index: source.html
===================================================================
RCS file: /web/www/www/server/source/source.html,v
retrieving revision 1.55
retrieving revision 1.56
diff -u -b -r1.55 -r1.56
--- source.html 20 Sep 2017 10:43:36 -0000      1.55
+++ source.html 16 Aug 2019 21:56:21 -0000      1.56
@@ -26,6 +26,21 @@
 <a href="https://savannah.gnu.org/cvs/?group=www";>Savannah CVS page</a>,
 the &ldquo;Webpages repository&rdquo; information.</p>
 
+<h3><a id="find_duplicate_links">find_duplicate_links</a></h3>
+<ul>
+  <li><a 
href="http://web.cvs.savannah.gnu.org/viewvc/www/server/source/find_duplicate_links/?root=www";>Source
 code</a></li>
+  <li>Author: <a href="mailto:address@hidden";>F&eacute;licien Pillot</a></li>
+</ul>
+
+<p>This Perl script scans every file under <a 
href="/proprietary">/proprietary</a> to check if an URL is used twice in the 
same page. It runs monthly on fencepost from user felicien's cron.</p>
+
+<h3><a id="make_patch_addresses">make_patch_addresses</a></h3>
+<ul>
+  <li><a 
href="http://web.cvs.savannah.gnu.org/viewvc/www/server/source/make_patch_addresses/?root=www";>Source
 code</a></li>
+  <li>Author: <a href="mailto:address@hidden";>F&eacute;licien Pillot</a></li>
+</ul>
+
+<p>This script can be ran from a GNU package webroot to replace wrong broken 
links reporting addresses (address@hidden) by good ones (i.e. project mailing 
list). Patches can then be applied by webmasters or package maintainers.</p>
 
 <h3><a id="linc">linc</a></h3>
 
@@ -166,7 +181,7 @@
 
 <p class="unprintable">Updated:
 <!-- timestamp start -->
-$Date: 2017/09/20 10:43:36 $
+$Date: 2019/08/16 21:56:21 $
 <!-- timestamp end --></p>
 </div>
 </div>

Index: find_duplicate_links
===================================================================
RCS file: find_duplicate_links
diff -N find_duplicate_links
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ find_duplicate_links        16 Aug 2019 21:56:21 -0000      1.1
@@ -0,0 +1,202 @@
+#!/usr/bin/env perl
+# find_duplicate_links
+# This Perl script scans html files in gnu.org (especially in proprietary/*)
+# to detect duplicated items.
+# Send any comment, correction, optimization, etc. to <address@hidden>
+#
+# Copyright 2017 (C) Felicien PILLOT <address@hidden>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+# 
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+# 
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+#### Modules
+use Cwd;               # get current directory
+use Getopt::Long;      # parse command line options
+use HTML::TreeBuilder; # parse HTML file into a tree
+use strict;            # be strict
+use warnings;          # warn
+
+#### Global variables (as less as possible)
+my (%allItems,         # Hash containing every <a> item from html page
+    %dupItems,         # Hash containing only duplicated <a> items
+    $cron_output,      # Integer specifying the type of output
+    $verbose);         # Integer specifying verbosity
+
+#### Subroutines (sorted alphabetically)
+sub check_element
+{
+    # Declare local variables
+    my ($a,          # HTML::Element: contains a <a> tag
+       $class,      # String: contains the class of the <a>
+       $file,       # String: contains the currently processed file
+       $text1,      # String: contains the whole text from the parent <p>
+       $text2,      # String: idem as text1
+       $url);       # String containing the href (url) of the <a>    
+    # Set variables related to the currently processed element (<a>)
+    $a = $_[0];
+    $file = $_[1];
+    $url = $a->attr ('href');
+    $class = $a->attr ('class');
+
+    # Check if there is a duplicated link
+    if (grep (/^$url/, keys %allItems)) {
+       if (($class // '') eq "not-a-duplicate" ||
+           $allItems{$url}->attr ('class') eq "not-a-duplicate") {
+           print "$url is $class\n" if ($verbose > 0);
+       } else {
+           my $outp = ($class // '');
+           print "$outp\n";
+           print "$url found twice\n" if ($verbose > 0);
+           # Get context (text from <p>)
+           $text1 = $a->parent ()->as_text ();
+           $text2 = $allItems{$url}->parent ()->as_text ();
+           # Remove beginning spaces
+           $text1 =~ s/^\s+//;
+           $text2 =~ s/^\s+//;
+           # Store it in %dupItems
+           $dupItems{$file}->{$url} = [$text1, $text2];
+       }
+    }
+    # Store every external urls in %allItems
+    $allItems{$url} = $a if ($url =~ /^http/);
+}
+
+# Print top and bottom texts, for mailing the output through the cron job
+sub cron_output
+{
+    print "*** This is an automatic email adressed to webmasters\@gnu.org\n"
+       . "*** It will be sent twice a month, according to #1198654 RT ticket\n"
+       . "*** about duplicate links found in www.gnu.org/proprietary/*\n"
+       . "*** The crontab from which this mail is sent is located at\n"
+       . "*** /var/spool/cron/crontabs/felicien\n"
+       . "*** For any information, contribution, comment, etc., contact "
+       . "<felicien\@gnu.org>\n"
+       . "Here is the output of the Perl script 'find_duplicate_links':\n\n";
+    if (output ()) {
+       print "Please check these occurrences, if they are true duplications,\n"
+           . "remove the redundant one or make a reference to the first item"
+           . " ;\nif they are two different items with the same link, set a"
+           . " class\nattribute containing 'not-a-duplicate' in the <a> tag:\n"
+           . "<a class=\"not-a-duplicate\" href=\"...\">\n";
+    }
+}
+
+# Print help information
+sub help
+{
+    print "help ()\n" if ($verbose > 1);
+    print "Usage: find_duplicate_links.sh [OPTIONS] [PATH]\n"
+       . "OPTIONS:\n"
+       . "  -c, --cron   \t\t\tFormat the output accordingly to a 'cron sent'"
+       . " email\n  -h, --help   \t\t\tDisplay this help message then exit\n"
+       . "  -p, --pattern=FILE_PATTERN\tSet FILE_PATTERN as explained below\n"
+       . "      --verbose\t\t\tIncrease debug & information display\n"
+       . "      --version\t\t\tDisplay version then exit\n"
+       . "FILE_PATTERN is a part (without extension) of one or more filenames"
+       . " from\nthe current directory. Don't worry, the script won't edit "
+       . "translated files.\nFor example: the FILE_PATTERN \"malware\" will "
+       . "check malware-adobe.html,\nmalware-amazon.html, malware-apple.html "
+       . "but not malware-adobe.fr.html.\n";
+    exit;
+}
+
+# Generate HTML::Tree from file, then check each <a> element
+sub main_sub
+{
+    print "main_sub ()\n" if ($verbose > 1);
+    # Declare local variables
+    my ($dir,        # String: contains the current directory
+       $pattern,    # String: given by user -- see help() for details
+       $tree);      # HTML::Tree: contains all elements from the html page
+    # Get variables
+    if ($ARGV[0]) {
+       if ($ARGV[0] =~ /^\//) {
+           $dir = $ARGV[0];
+       } else {
+           $dir = getcwd ()."/".$ARGV[0];
+       }
+    } else {
+       $dir = getcwd ();
+    }
+    $pattern = $_[1] // "";
+    print "Directory: $dir\n" if ($verbose > 0);
+    # Parse files
+    opendir (DIR, $dir) or die $!;
+    while (readdir(DIR)) {
+        next unless (/^$pattern[^\.]*\.html/);
+       print "Selected file: $_\n" if ($verbose > 0);
+       # Reset some variables
+       %allItems = ();
+       $tree = HTML::TreeBuilder->new_from_file ($dir."/".$_);
+       # Walk through each <a> tag
+       foreach $a ($tree->find ('a')) {
+           check_element ($a, $_);
+       }
+    }
+    closedir(DIR);
+    
+    # Display results
+    $cron_output ? cron_output () : output ();
+    print "The script ended successfully\n" if ($verbose > 0);
+    exit 1;
+}
+
+# Display output for terminal stdout
+sub output {
+    # Declare local variables
+    my ($file,       # String: filename contained in %dupItems
+       $url);       # String: url (href) contained in $dupItems{$file}
+    # Walk through %dupItems
+    if (%dupItems) {
+       foreach $file (keys %dupItems) {
+           foreach $url (keys %{$dupItems{$file}}) {
+               # Display results for each duplication
+               print "In $file, these items point to the same link:\n"
+                   . "$url\n"
+                   . "* $dupItems{$file}{$url}[0]\n"
+                   . "* $dupItems{$file}{$url}[1]\n\n";
+           }
+       }
+       return 1;
+    } else {
+       print "No duplicated link has been found.\n";
+       return 0;
+    }
+}
+
+# Print version information
+sub version
+{
+    print "version ()\n" if ($verbose > 1);
+    print "find_duplicate_links 0.7\n"
+       . "Copyright (C) 2017 Felicien Pillot <felicien\@gnu.org>\n"
+       . "License GPLv3+: GNU GPL version 3 or later "
+       . "<http://gnu.org/licenses/gpl.html>\n"
+       . "This is free software: you are free to change and"
+       . "redistribute it.\n"
+       . "There is NO WARRANTY, to the extent permitted by law.\n";
+    exit 1;
+}
+
+### Main
+
+$verbose = 0;
+$cron_output = 0;
+# Parse the command line arguments
+GetOptions ("cron"        => \$cron_output,
+           "help"        => \&help,
+           "pattern=s"   => \&main_sub,
+           "verbose+"    => \$verbose,
+           "version"     => \&version);
+# Even if no --pattern has been given, try the main loop
+main_sub ();

Index: make_patch_addresses
===================================================================
RCS file: make_patch_addresses
diff -N make_patch_addresses
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ make_patch_addresses        16 Aug 2019 21:56:21 -0000      1.1
@@ -0,0 +1,54 @@
+#!/bin/sh
+#
+# Generates a patch for replacing <address@hidden> with the
+# correct broken link reporting address.
+#
+# Copyright 2019 (C) FÃ©licien PILLOT <address@hidden>
+# 
+# This is free software: you can redistribute it and/or modify under
+# the terms of the GNU General Public License as published by the Free
+# Software Foundation, either version 3 of the License, or (at your
+# option) any later version.
+# 
+# This file is distributed in the hope that it will be useful, WITHOUT
+# ANY WARRANTY; without even the implied warranty of or FITNESS FOR A
+# PARTICULAR PURPOSE.  See the General Public License for more
+# details.
+# 
+# You should have received a copy of the GNU General Public License
+# with this file.  If not, see <http://www.gnu.org/licenses/>.
+
+# If no argument is passed, display help message
+if [ $# -eq 0 ]
+then
+    echo \
+"This script generates a patch for replacing <address@hidden> with
+the correct broken link reporting address.
+No need to apply the patch in working dir.
+You must provide some arguments:
+1. The package name
+2. The correct address (optional: if not given, will assume 
bug-<package>@gnu.org)"
+fi
+
+# Get the package name -- typically it's $(basename $(pwd))
+PACKAGE_NAME=$1
+
+# If no more argument is passed, build the default ML address
+if [ $# -eq 1 ]
+then
+    NEW_ADDRESS="bug-${PACKAGE_NAME}@gnu.org"
+else
+    NEW_ADDRESS=$2
+fi
+
+# Search for files and lines to edit, replace addresses with sed
+for FILE in $(grep -Rls mailto:webmasters *)
+do
+    sed -i "/mailto:/ s/address@hidden/${NEW_ADDRESS}/g" $FILE
+done
+
+# Get a diff from the last commit
+cvs diff -U1 -r1 * > ${PACKAGE_NAME}.patch 2> /dev/null
+
+# Warn the user if nothing has happen
+[ -s $PACKAGE_NAME.patch ] || echo "WARNING: ${PACKAGE_NAME}.patch is empty."
[Prev in Thread]
Current Thread
[Next in Thread]
www/server/source source.html find_duplicate_li..., Félicien PILLOT <=
Prev by Date: www/proprietary malware-microsoft.fr.html propr...
Next by Date: www/server/source source.html
Previous by thread: www/proprietary malware-microsoft.fr.html propr...
Next by thread: www/server/source source.html
Index(es):
- Date
- Thread