[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
www/server/source source.html find_duplicate_li...
From: |
Félicien PILLOT |
Subject: |
www/server/source source.html find_duplicate_li... |
Date: |
Fri, 16 Aug 2019 17:56:23 -0400 (EDT) |
CVSROOT: /web/www
Module name: www
Changes by: Félicien PILLOT <felandral> 19/08/16 17:56:22
Modified files:
server/source : source.html
Added files:
server/source : find_duplicate_links make_patch_addresses
Log message:
Add two new scripts, update source.html
CVSWeb URLs:
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/source.html?cvsroot=www&r1=1.55&r2=1.56
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/find_duplicate_links?cvsroot=www&rev=1.1
http://web.cvs.savannah.gnu.org/viewcvs/www/server/source/make_patch_addresses?cvsroot=www&rev=1.1
Patches:
Index: source.html
===================================================================
RCS file: /web/www/www/server/source/source.html,v
retrieving revision 1.55
retrieving revision 1.56
diff -u -b -r1.55 -r1.56
--- source.html 20 Sep 2017 10:43:36 -0000 1.55
+++ source.html 16 Aug 2019 21:56:21 -0000 1.56
@@ -26,6 +26,21 @@
<a href="https://savannah.gnu.org/cvs/?group=www">Savannah CVS page</a>,
the “Webpages repository” information.</p>
+<h3><a id="find_duplicate_links">find_duplicate_links</a></h3>
+<ul>
+ <li><a
href="http://web.cvs.savannah.gnu.org/viewvc/www/server/source/find_duplicate_links/?root=www">Source
code</a></li>
+ <li>Author: <a href="mailto:address@hidden">Félicien Pillot</a></li>
+</ul>
+
+<p>This Perl script scans every file under <a
href="/proprietary">/proprietary</a> to check if an URL is used twice in the
same page. It runs monthly on fencepost from user felicien's cron.</p>
+
+<h3><a id="make_patch_addresses">make_patch_addresses</a></h3>
+<ul>
+ <li><a
href="http://web.cvs.savannah.gnu.org/viewvc/www/server/source/make_patch_addresses/?root=www">Source
code</a></li>
+ <li>Author: <a href="mailto:address@hidden">Félicien Pillot</a></li>
+</ul>
+
+<p>This script can be ran from a GNU package webroot to replace wrong broken
links reporting addresses (address@hidden) by good ones (i.e. project mailing
list). Patches can then be applied by webmasters or package maintainers.</p>
<h3><a id="linc">linc</a></h3>
@@ -166,7 +181,7 @@
<p class="unprintable">Updated:
<!-- timestamp start -->
-$Date: 2017/09/20 10:43:36 $
+$Date: 2019/08/16 21:56:21 $
<!-- timestamp end --></p>
</div>
</div>
Index: find_duplicate_links
===================================================================
RCS file: find_duplicate_links
diff -N find_duplicate_links
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ find_duplicate_links 16 Aug 2019 21:56:21 -0000 1.1
@@ -0,0 +1,202 @@
+#!/usr/bin/env perl
+# find_duplicate_links
+# This Perl script scans html files in gnu.org (especially in proprietary/*)
+# to detect duplicated items.
+# Send any comment, correction, optimization, etc. to <address@hidden>
+#
+# Copyright 2017 (C) Felicien PILLOT <address@hidden>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+#### Modules
+use Cwd; # get current directory
+use Getopt::Long; # parse command line options
+use HTML::TreeBuilder; # parse HTML file into a tree
+use strict; # be strict
+use warnings; # warn
+
+#### Global variables (as less as possible)
+my (%allItems, # Hash containing every <a> item from html page
+ %dupItems, # Hash containing only duplicated <a> items
+ $cron_output, # Integer specifying the type of output
+ $verbose); # Integer specifying verbosity
+
+#### Subroutines (sorted alphabetically)
+sub check_element
+{
+ # Declare local variables
+ my ($a, # HTML::Element: contains a <a> tag
+ $class, # String: contains the class of the <a>
+ $file, # String: contains the currently processed file
+ $text1, # String: contains the whole text from the parent <p>
+ $text2, # String: idem as text1
+ $url); # String containing the href (url) of the <a>
+ # Set variables related to the currently processed element (<a>)
+ $a = $_[0];
+ $file = $_[1];
+ $url = $a->attr ('href');
+ $class = $a->attr ('class');
+
+ # Check if there is a duplicated link
+ if (grep (/^$url/, keys %allItems)) {
+ if (($class // '') eq "not-a-duplicate" ||
+ $allItems{$url}->attr ('class') eq "not-a-duplicate") {
+ print "$url is $class\n" if ($verbose > 0);
+ } else {
+ my $outp = ($class // '');
+ print "$outp\n";
+ print "$url found twice\n" if ($verbose > 0);
+ # Get context (text from <p>)
+ $text1 = $a->parent ()->as_text ();
+ $text2 = $allItems{$url}->parent ()->as_text ();
+ # Remove beginning spaces
+ $text1 =~ s/^\s+//;
+ $text2 =~ s/^\s+//;
+ # Store it in %dupItems
+ $dupItems{$file}->{$url} = [$text1, $text2];
+ }
+ }
+ # Store every external urls in %allItems
+ $allItems{$url} = $a if ($url =~ /^http/);
+}
+
+# Print top and bottom texts, for mailing the output through the cron job
+sub cron_output
+{
+ print "*** This is an automatic email adressed to webmasters\@gnu.org\n"
+ . "*** It will be sent twice a month, according to #1198654 RT ticket\n"
+ . "*** about duplicate links found in www.gnu.org/proprietary/*\n"
+ . "*** The crontab from which this mail is sent is located at\n"
+ . "*** /var/spool/cron/crontabs/felicien\n"
+ . "*** For any information, contribution, comment, etc., contact "
+ . "<felicien\@gnu.org>\n"
+ . "Here is the output of the Perl script 'find_duplicate_links':\n\n";
+ if (output ()) {
+ print "Please check these occurrences, if they are true duplications,\n"
+ . "remove the redundant one or make a reference to the first item"
+ . " ;\nif they are two different items with the same link, set a"
+ . " class\nattribute containing 'not-a-duplicate' in the <a> tag:\n"
+ . "<a class=\"not-a-duplicate\" href=\"...\">\n";
+ }
+}
+
+# Print help information
+sub help
+{
+ print "help ()\n" if ($verbose > 1);
+ print "Usage: find_duplicate_links.sh [OPTIONS] [PATH]\n"
+ . "OPTIONS:\n"
+ . " -c, --cron \t\t\tFormat the output accordingly to a 'cron sent'"
+ . " email\n -h, --help \t\t\tDisplay this help message then exit\n"
+ . " -p, --pattern=FILE_PATTERN\tSet FILE_PATTERN as explained below\n"
+ . " --verbose\t\t\tIncrease debug & information display\n"
+ . " --version\t\t\tDisplay version then exit\n"
+ . "FILE_PATTERN is a part (without extension) of one or more filenames"
+ . " from\nthe current directory. Don't worry, the script won't edit "
+ . "translated files.\nFor example: the FILE_PATTERN \"malware\" will "
+ . "check malware-adobe.html,\nmalware-amazon.html, malware-apple.html "
+ . "but not malware-adobe.fr.html.\n";
+ exit;
+}
+
+# Generate HTML::Tree from file, then check each <a> element
+sub main_sub
+{
+ print "main_sub ()\n" if ($verbose > 1);
+ # Declare local variables
+ my ($dir, # String: contains the current directory
+ $pattern, # String: given by user -- see help() for details
+ $tree); # HTML::Tree: contains all elements from the html page
+ # Get variables
+ if ($ARGV[0]) {
+ if ($ARGV[0] =~ /^\//) {
+ $dir = $ARGV[0];
+ } else {
+ $dir = getcwd ()."/".$ARGV[0];
+ }
+ } else {
+ $dir = getcwd ();
+ }
+ $pattern = $_[1] // "";
+ print "Directory: $dir\n" if ($verbose > 0);
+ # Parse files
+ opendir (DIR, $dir) or die $!;
+ while (readdir(DIR)) {
+ next unless (/^$pattern[^\.]*\.html/);
+ print "Selected file: $_\n" if ($verbose > 0);
+ # Reset some variables
+ %allItems = ();
+ $tree = HTML::TreeBuilder->new_from_file ($dir."/".$_);
+ # Walk through each <a> tag
+ foreach $a ($tree->find ('a')) {
+ check_element ($a, $_);
+ }
+ }
+ closedir(DIR);
+
+ # Display results
+ $cron_output ? cron_output () : output ();
+ print "The script ended successfully\n" if ($verbose > 0);
+ exit 1;
+}
+
+# Display output for terminal stdout
+sub output {
+ # Declare local variables
+ my ($file, # String: filename contained in %dupItems
+ $url); # String: url (href) contained in $dupItems{$file}
+ # Walk through %dupItems
+ if (%dupItems) {
+ foreach $file (keys %dupItems) {
+ foreach $url (keys %{$dupItems{$file}}) {
+ # Display results for each duplication
+ print "In $file, these items point to the same link:\n"
+ . "$url\n"
+ . "* $dupItems{$file}{$url}[0]\n"
+ . "* $dupItems{$file}{$url}[1]\n\n";
+ }
+ }
+ return 1;
+ } else {
+ print "No duplicated link has been found.\n";
+ return 0;
+ }
+}
+
+# Print version information
+sub version
+{
+ print "version ()\n" if ($verbose > 1);
+ print "find_duplicate_links 0.7\n"
+ . "Copyright (C) 2017 Felicien Pillot <felicien\@gnu.org>\n"
+ . "License GPLv3+: GNU GPL version 3 or later "
+ . "<http://gnu.org/licenses/gpl.html>\n"
+ . "This is free software: you are free to change and"
+ . "redistribute it.\n"
+ . "There is NO WARRANTY, to the extent permitted by law.\n";
+ exit 1;
+}
+
+### Main
+
+$verbose = 0;
+$cron_output = 0;
+# Parse the command line arguments
+GetOptions ("cron" => \$cron_output,
+ "help" => \&help,
+ "pattern=s" => \&main_sub,
+ "verbose+" => \$verbose,
+ "version" => \&version);
+# Even if no --pattern has been given, try the main loop
+main_sub ();
Index: make_patch_addresses
===================================================================
RCS file: make_patch_addresses
diff -N make_patch_addresses
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ make_patch_addresses 16 Aug 2019 21:56:21 -0000 1.1
@@ -0,0 +1,54 @@
+#!/bin/sh
+#
+# Generates a patch for replacing <address@hidden> with the
+# correct broken link reporting address.
+#
+# Copyright 2019 (C) Félicien PILLOT <address@hidden>
+#
+# This is free software: you can redistribute it and/or modify under
+# the terms of the GNU General Public License as published by the Free
+# Software Foundation, either version 3 of the License, or (at your
+# option) any later version.
+#
+# This file is distributed in the hope that it will be useful, WITHOUT
+# ANY WARRANTY; without even the implied warranty of or FITNESS FOR A
+# PARTICULAR PURPOSE. See the General Public License for more
+# details.
+#
+# You should have received a copy of the GNU General Public License
+# with this file. If not, see <http://www.gnu.org/licenses/>.
+
+# If no argument is passed, display help message
+if [ $# -eq 0 ]
+then
+ echo \
+"This script generates a patch for replacing <address@hidden> with
+the correct broken link reporting address.
+No need to apply the patch in working dir.
+You must provide some arguments:
+1. The package name
+2. The correct address (optional: if not given, will assume
bug-<package>@gnu.org)"
+fi
+
+# Get the package name -- typically it's $(basename $(pwd))
+PACKAGE_NAME=$1
+
+# If no more argument is passed, build the default ML address
+if [ $# -eq 1 ]
+then
+ NEW_ADDRESS="bug-${PACKAGE_NAME}@gnu.org"
+else
+ NEW_ADDRESS=$2
+fi
+
+# Search for files and lines to edit, replace addresses with sed
+for FILE in $(grep -Rls mailto:webmasters *)
+do
+ sed -i "/mailto:/ s/address@hidden/${NEW_ADDRESS}/g" $FILE
+done
+
+# Get a diff from the last commit
+cvs diff -U1 -r1 * > ${PACKAGE_NAME}.patch 2> /dev/null
+
+# Warn the user if nothing has happen
+[ -s $PACKAGE_NAME.patch ] || echo "WARNING: ${PACKAGE_NAME}.patch is empty."
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- www/server/source source.html find_duplicate_li...,
Félicien PILLOT <=