<p> KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 1 OF 36</p><p>Weblint: A Tool for Checking Instrument Web Pages</p><p>Author Date Comment Gregory D. Wirth 2011-Aug-17 Original Marc Kassis 2015-Jun-7 Updated following webserver changes: Modification History added. Updated weblint invocation table. Updated weblint source code. Added source code for additional scripts.</p><p>Synopsis This document describes the use of weblint, a locally-developed utility which performs various checks on instrument web pages.</p><p>Background Keck instrument web pages are designed with a number of considerations: o Pages should have a similar look and feel o Any web page should be accessible within 2 clicks from the home page of the respective instrument o Web pages should have index entries that allow them to be automatically added into the index page for the respective instrument Furthermore, there are several maladies which are tedious to diagnose manually but which can easily be detected by a suitable script: o Web pages should have no bad or invalid links o Web pages should not refer to non-local images; i.e., all images should be in the same directory as the web page itself, lest they become separated.</p><p>The weblint script was developed to address these issues with the instrument web pages. The remainder of this document describes how the script works </p><p>Procedure For each web page in the current directory, weblint will check whether: 1. the page is linked to the home page, or is linked to a file which is linked to the homepage, thus satisfying the “2-click” criterion; 2. the file includes a valid index entry; 3. the file refers to images which are not in the current directory; 4. the file adheres to the correct format for Keck web instrument pages; 5. the file includes any invalid links.</p><p>Here are more details on what the script does for each of these checks. KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 2 OF 36</p><p>Orphan Check A page is considered an “orphan” if it cannot be reached from the instrument’s home page within a specified number of clicks, where the number (or “depth”) is user- configurable. The default depth is 3, meaning that 3 clicks are required from the homepage to reach the web page. The script begins by compiling a list of all HREF links on the instrument home page. It then scans all of these web pages to which the HREF links refer, and adds these links to the list, and continues this process until the requested depth is achieved. The script then iterates through all web pages (i.e., files with names of the form *.htm or *.html) and determines whether they can be reached within the specified number of links from the home page. If not, then it generates a warning message for those web pages.</p><p>Index check For each web page in the directory, the script will compile a list of index entries and will determine whether they are valid. Every instrument web page is expected to have an index entry which can be used by the makeindex program to generate an index of the web pages in the directory. These index entries appear as comments in the HTML code and take the form: </p><p><!-- index=”topic:subtopic” --></p><p>The weblint script will generate a warning message for a given web page if any of the following conditions is met: 1. The index tag is missing. 2. The index tag is set to an invalid value (e.g., “??”) as specified by the user. 3. The index tag duplicates the topic:subtopic entry of another page in the same directory.</p><p>Image Check It is generally good practice to keep images in the same directory as any web pages which refer to them, thus preventing the pages from being separated from the images. For each web page, the weblint script will review all links in the file to determine whether the following conditions are both met: 1. The link points to a file with an extension ending in .jpg, .jpeg, .gif, or .png, and 2. The link points to a file which is not in the current directory. If these conditions are both met, the weblint will generate a warning message for the given web page.</p><p>Format Check Keck instrument web pages are expected to adhere to a certain format which makes use of CSS to define the look and feel. As a rough guess as to whether a particular page employs this format, the script will check to see whether the file contains a string of the form: KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 3 OF 36</p><p> src="../web_docs/scripts/include_main.js"</p><p>If not, then weblint generates a warning message for that page.</p><p>Invalid Links The weblint script will attempt to retrieve every URL referenced in the file. If any cannot be retrieved, the script adds a warning message to the output file.</p><p>Invocation</p><p>Command-Line The weblint script has numerous command-line switches which allow the user to control which tests are run and how the tests are performed. Please refer to Appendix A for a detailed description of the syntax.</p><p>Makefile Each instrument’s directory on the web server has been configured with a Makefile with a target called weblint, allowing the user to run the script by simply typing </p><p> make weblint</p><p>The Makefile entry should be edited as required to define the desired parameters for weblint, as in:</p><p> weblint: ../bin/weblint -m gwirth,mkassis -t "??,NONE" -O 3 -TLF</p><p>Current Makefile entries are shown in the following table:</p><p>Instrument Makefile entry as of June 2015 ADC DEIMOS ../bin/weblint -m deimos_info -t "??,NONE" -O 3 -TLF -I @weblint_ignore_pages ESI ../bin/weblint -m esi_info -t "??,NONE" -O 3 -TLF HIRES ../bin/weblint -m hires_info -t "??,NONE" -O 3 -TLF LRIS ../bin/weblint -m lris_info -t "??,NONE" -I @weblint_ignore_pages -O 3 -TLF -H lrishome.html MOSFIRE ../bin/weblint -m mkassis -t "??,NONE" -I @weblint_ignore_pages -O 3 -TLF NIRC2 ../bin/weblint -m nirc2_info -t "??,NONE" -O 3 -TLF -i home.html NIRSPEC ../bin/weblint -m nirspec_info -t "??,NONE" -I @weblint.ignorefiles -O 3 -TLF OSIRIS ../bin/weblint -m osiris_info -t "??,NONE" -I @weblint.ignorefiles -O 4 -TLF -x ../web_docs/SideMenus/osiris_menu.js common ../bin/weblint -m sas -t "??,NONE" -TLF -I @weblint_ignore_pages KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 4 OF 36</p><p>Cron A front-end scripts called run_do_weblint and do_weblint enable easy execution of the weblint script as a cron job. The do_weblint script will simply go to the home directory for the specified instrument and execute the make weblint command. The run_do_weblint script is called by the cron. The cronjobs are typically invoked via the webservice at https://www.keck.hawaii.edu/software/inventory/useCases/useCases.php? useCase=crons&subsystem=WWW. As of June 2015, the cron runs every 7th and 23rd of the month. </p><p>These scripts are located in /webFiles/www/public/realpublic/inst/bin.</p><p>Output The weblint script writes a comprehensive logfile which lists all of the warnings for each web page. The default logfile location is /webFiles/www/public/inst/log/weblint-inst-log.html where inst is the name of the instrument (e.g., lris).</p><p>There is also a summary scorecard in the log directory that compares the webpage summaries of all instruments.</p><p>Modifications completed June 2015 </p><p>In 2015, the webserver was updated to a more secure web server. Modifications were necessary for weblint to work on the new web server and below is a summary of all the script modifications. Included below are all the script associated with weblint. These files live in /webFiles/www/public/realpublic/inst/bin run_do_weblint – header modified to !/bin/csh do_weblint – updated path to the correct directories. header modified to !/bin/csh, removed perl dependancies. weblint – updated path to the correct directories. header modified to !/bin/csh, removed perl dependancies. web_lint_summary - updated path to the correct directories. header modified to !/bin/csh, removed perl dependancies. ../inst/Makefile for all instruments and common pages: header modified to !/bin/csh,</p><p>/webFiles/www/public/inst/log/Makefile - header modified to /bin/csh</p><p>To execute weblint a cron is now invoked via the web service at: https://www.keck.hawaii.edu/software/inventory/useCases/useCases.php ?useCase=crons&subsystem=WWW KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 5 OF 36</p><p>KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 6 OF 36</p><p>Appendix A: Invocation Usage: weblint [-h] [-d] [-v] [-i <indexfile>] \ [-I <file,file,...,file>] \ [-S <file,file,...,file>] \ [-l <logfile>] [-t <tag,tag,...,tag>] [-H <file>] \ [-m <user,user,...,user>] \ [-O <maxdepth>] [-T] [-M] [-L] [-F]</p><p>Options: -h = help -d = debug mode -v = verbose mode; print info on all files -I = ignore; read comma-delimmited list of files to ignore when scanning pages for syntax errors. If the file begins with the "@" character, it is interpreted as the name of a file which is a LIST of files to remove (one line per file). -i = index file; treat specified file as index [default: idx-index.html] -l = send output to named logfile (HTML format) [default: weblint-<inst>-log.html] -t = read comma-delimited list of bad tags [default: none] -H = use named file as home page [default: index.html] -m = send email output to named users</p><p>Switches: -O = check for orphaned pages at maxdepth N, where N is the maximum number of clicks from the homepage required to get there -T = check for bad/missing index tags -M = check for non-local images -L = check for bad links -F = check format [NOTE: if none are specified, all are enabled by default]</p><p>Output: - To logfile (HTML format) - To STDOUT (if debug mode enabled)</p><p>Restrictions: - Process must run on a machine with perl5.6 or higher - Subindices must be linked to home page</p><p>Exit values: 0 = normal completion 1 = wrong number of arguments KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 7 OF 36</p><p>Example: 1) Check for all pathologies using defaults, with output to default logfile: weblint</p><p>2) Check for pages which require more than 4 links to each, bad tags, and bad links, with "??" as the bad tag type and output sent to file mylog.html: weblint -TL -O 4 -t "??" -l mylog.html KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 8 OF 36</p><p>Appendix B: Source Code The following are complete copies of the Perl source code for the weblint software as of 2015-June-07. The current location of these scripts are in: /webFiles/www/public/realpublic/inst/bin/ And the logfiles are located in: /webFiles/www/public/inst/log/*.html Run_do_weblint: #!/bin/csh /webFiles/www/public/realpublic/inst/bin/do_weblint adc /webFiles/www/public/realpublic/inst/bin/do_weblint deimos /webFiles/www/public/realpublic/inst/bin/do_weblint esi /webFiles/www/public/realpublic/inst/bin/do_weblint hires /webFiles/www/public/realpublic/inst/bin/do_weblint lris /webFiles/www/public/realpublic/inst/bin/do_weblint mosfire /webFiles/www/public/realpublic/inst/bin/do_weblint nirc2 /webFiles/www/public/realpublic/inst/bin/do_weblint nirspec /webFiles/www/public/realpublic/inst/bin/do_weblint osiris /webFiles/www/public/realpublic/inst/bin/do_weblint common</p><p>Do_weblint: #!/bin/csh #+ # do_weblint -- interface to run weblint task # # Purpose: # Execute weblint for given instrument # # Usage # do_weblint <dir> # # Arguments: # dir = name of directory in inst web area # # Output: # to stdout # # Restrictions: # should run from kics in order to assure access to write logfile # # Invocation: # from run_do_weblint # # Exit values: # 0 = normal completion KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 9 OF 36</p><p># 1 = wrong number of arguments # # Example: # #- # Modification history: # 2009-Jan-16 GDW Original version # 2015-Jul-7 MK Webserver update: Updated header with # /bin/csh. changed /home to # /webFiles. Removed perl dependancies. #------set head = `dirname $0` if ( "$head" == "" || "$head" == ".") set head = $cwd set cmd = `basename $0` set usage = "Usage:\n\t$cmd dir" #setenv PERL5LIB /home/gwirth/lib/perl #setenv PERL5LIB /home/gwirth/lib/perl</p><p># verify args... if ( $#argv != 1 ) then printf "\a$usage\n" exit 1 endif</p><p>#With the change to a new webserver, we can no longer switch machines # and it must run on the secure webserver. 2014 nov 10 MK # this must run on a machine with the right perl... #if ( ! -e /usr/perl5/5.8.4/bin/perl) then # printf "[$cmd] re-invoking self on kealia\n" # exec rsh kealia $head/$cmd $* #endif</p><p># parse args.. if ( $#argv >= 1 ) then set subdir = $1 shift endif</p><p>#set directory: set dir = /webFiles/www/public/realpublic/inst/$subdir</p><p># verify access to directory... if ( ! -d $dir ) then printf "[$cmd] ERROR: can't find directory $dir -- abort!\n" exit 1 endif</p><p># go to directory... KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 10 OF 36 cd $dir</p><p># verify that a suitable target exists... gmake -n weblint > /dev/null if ( $status != 0 ) then printf "[$cmd] ERROR: Makefile in $dir has no entry for weblint -- abort!\n" exit 1 endif</p><p># generate logfile... gmake --silent weblint set s = $status if ( $s != 0 ) then printf "[$cmd] ERROR: make command returned non-zero exit status ($s)\n" endif</p><p># verify access to summary directory... set dir = /webFiles/www/public/inst/log if ( ! -d $dir ) then printf "[$cmd] ERROR: can't find directory $dir -- abort!\n" exit 1 endif</p><p># go to directory... cd $dir</p><p># verify that a suitable target exists... gmake -n weblint-index.html > /dev/null if ( $status != 0 ) then printf "[$cmd] ERROR: Makefile in $dir has no entry for weblint index -- abort!\n" exit 1 endif</p><p># generate logfile... gmake --silent weblint-index.html set s = $status if ( $s != 0 ) then printf "[$cmd] ERROR: make command for index returned non-zero exit status ($s)\n" endif weblint: #!/usr/bin/perl -w #+ KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 11 OF 36</p><p># weblint -- sanity-check web pages for various problems # # # Purpose: # For each web page in the current directory, check whether: # (1) the page is linked to the home page, or is linked to a # file which is linked to the homepage, (2) the file includes a # valid index entry (3) the file refers to images which are not # in the current directory # # Usage: # weblint [-h] [-d] [-v] [-i <indexfile>] \ # [-I <file,file,...,file>] \ # [-x <file,file,...,file>] \ # [-l <logfile>] [-t <tag,tag,...,tag>] [-H <file>] \ # [-m <user,user,...,user>] \ # [-O <maxdepth>] [-T] [-M] [-L] [-F] # # Options: # -h = help # -d = debug mode # -v = verbose mode; print info on all files # -i = index file; treat specified file as index # [default: idx-index.html] # -I = ignore; read comma-delimmited list of files to # ignore when scanning pages for syntax errors. If the # file begins with the "@" character, it is interpreted # as the name of a file which is a LIST of files to # remove (one line per file). # -l = send output to named logfile (HTML format) # [default: weblint-<inst>-log.html] # -t = read comma-delimited list of bad tags # [default: none] # -H = use named file as home page [default: index.html] # -m = send email output to named users # -x = extra; read comma-delimmited list of additional files to # scan for LINKS ONLY (no format or orphan checks). If # the file begins with the "@" character, it is # interpreted as the name of a file which is a LIST of # files to scan (one line per file). This option is # intended to allow you to scan non-standard files such # as javascript side menus which may contain links to # pages that are not linked to elsewhere in the tree and # thus would customarily show up as being orphaned when # in fact they are not (e.g., News page) # # Switches: # -O = check for orphaned pages at maxdepth N, where N is the # maximum number of clicks from the homepage required to get KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 12 OF 36</p><p># there # -T = check for bad/missing index tags # -M = check for non-local images # -L = check for bad links # -F = check format # [NOTE: if none are specified, all are enabled by default] # # Output: # - To logfile (HTML format) # - To STDOUT (if debug mode enabled) # # Restrictions: # - Process must run on a machine with perl5.6 or higher # - Subindices must be linked to home page # # Exit values: # 0 = normal completion # 1 = wrong number of arguments # # Example: # 1) Check for all pathologies using defaults, with output to # default logfile: # weblint # # 2) Check for pages which require more than 4 links to reach, # bad tags, and bad links, with "??" as the bad tag type and # output sent to file mylog.html: # weblint -TL -O 4 -t "??" -l mylog.html # # 3) Include additional files (e.g., javascript side menu files): # weblint -x @weblint_extra_pages # where the file weblint_extra_pages contains: # /webFiles/www/public/realpublic/inst/web_docs/SideMenus/common_menu.js # /webFiles/www/public/realpublic/inst/web_docs/SideMenus/adc_menu.js #- # Notes: # The %linkedPages hash is a key component of the program and # understanding its structure is important. The hash has two indices: # 1) The full URL of a web page which is linked to # 2) an attribute which is either: # 'type' = "href", etc. # 'referrer' = name of the HTML file linked from # 'shortname' = name of file relative to base URL # 'status' = 1 if link valid, 0 otherwise #- # Modification history: KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 13 OF 36</p><p># 2009-Jan-06 GDW Original version # 2009-Jan-20 GDW Add option to scan to certain depth with -O # 2009-Jan-21 GDW Add -F option # 2015-Jul-7 MK Webserver updated: changed /home to # /webFiles. updated perl path. #------use strict; use Getopt::Std; use File::Basename; use CGI::Pretty qw(:standard *ul); use LWP::Simple qw(!head); # do not import "head"; conflicts with CGI.pm use HTML::LinkExtor;</p><p># declarations... $X::debug = 0; my( $startTime) = time(); my( $runTime); my( %option); my( $verbose) = 0; my( $file, $infile, $dir, $path, $fullpath, $linkpath); my( $homepage) = "index.html"; my( $webpage); my( %linkedPages); my( $sub); my( @indexTags); my( %ignoreFiles); my( %extraFiles); my( $problem, @problems); my( $tag); my( %badIndexTag, %goodIndexTag); my( @badLinks); my( %linkInfo, $link); my( $nlinks); my( $indexPage) = "idx-index.html"; my( $command) = basename( $0); my( $logfile); my( $logdir) = "/webFiles/www/public/inst/log"; my( $logURL); my( $summaryURL) = "http://www.keck.hawaii.edu/inst/log/weblint-index.html"; my( %check); my( @goodFiles, @badFiles); my( $status) = ""; my( @recipients); my( $subject); my( $startDate) = scalar localtime(); my( $runhost) = `hostname`; chomp $runhost; my( $footer); my( $buf); my( $topdir) = `pwd`; chomp $topdir; my( $subdir); my( $myStyle); my( $maxdepth) = 3; my( %depth); my( $scanned); my( $i);</p><p># generate default logfile name... if( $topdir =~ m|/([^/]+)$|){ $subdir = $1; } if ( -d $logdir and defined($subdir)) { $logfile = "$logdir/$command-$subdir-log.html"; } else { $logfile = "$subdir-log.html"; } KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 14 OF 36</p><p>#------# parse command-line args... #------getopts("hdvi:I:l:t:H:S:m:O:TMLFx:", \%option);</p><p># check for help... if ( $option{"h"}) { &printHelp };</p><p># check for debug mode... if ($option{"d"}){ print "debug mode enabled\a\n"; $X::debug = 1; $verbose = 1; }</p><p># check for verbose mode... if ($option{"v"}){ print "verbose mode enabled\n" if $X::debug; $verbose = 1; }</p><p># check for ignore-file... if( $option{"I"}){ print qq(ignoring files: $option{"I"}\n) if $X::debug; foreach $file (split(",", $option{"I"})){</p><p> if( $file =~ m|^\@(.*)$|){</p><p>$infile = $1;</p><p># expand "@" files... open INFILE, "<$infile" or die "Can't open ingoreFile $file"; while (<INFILE>){ chomp; $ignoreFiles{$_}++ } close INFILE; } else { $ignoreFiles{$file}++ } } }</p><p># check for extra files... KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 15 OF 36 if( $option{"x"}){ print qq(including files: $option{"x"}\n) if $X::debug; foreach $file (split(",", $option{"x"})){</p><p> if( $file =~ m|^\@(.*)$|){</p><p>$infile = $1;</p><p># expand "@" files... open INFILE, "<$infile" or die "Can't open extraFile $file"; while (<INFILE>){ chomp; $extraFiles{$_}++ } close INFILE; } else { $extraFiles{$file}++ } } }</p><p># check for index file... if( $option{"i"}){ $indexPage = $option{"i"}; print qq(index file: $indexPage\n) if $X::debug; }</p><p># check for mail option... if( $option{"m"}){ print qq(email to: $option{"m"}\n) if $X::debug; for (split(",", $option{"m"})){push @recipients, $_} }</p><p># check for bad index tags. The badIndexTag hash will be set to the # list of user-specified strings which are not be accepted as valid # index tags. if( $option{"t"}){ for (split(",", $option{"t"})){$badIndexTag{$_}++} print qq(finding bad tags: $option{"t"}\n) if $X::debug; }</p><p># check for logfile... if( $option{"l"}){ $logfile = $option{"l"}; print qq(logfile: $logfile\n) if $X::debug; }</p><p># check for alternate home page... if( $option{"H"}){ $homepage = $option{"H"}; KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 16 OF 36</p><p> print qq(homepage: $homepage\n) if $X::debug; }</p><p># check scanning options... if( $option{"O"}){ $check{"Orphans"} = 1; $maxdepth = $option{"O"}; print qq(maxdepth: $maxdepth\n) if $X::debug; } if( $option{"T"}){ $check{"Tags"} = 1} if( $option{"M"}){ $check{"Images"} = 1} if( $option{"L"}){ $check{"Links"} = 1} if( $option{"F"}){ $check{"Format"} = 1}</p><p># if NO checks are selected, turn them all ON... unless ( scalar(keys(%check)) > 0){ $check{"Orphans"} = 1; $check{"Tags"} = 1; $check{"Images"} = 1; $check{"Links"} = 1; $check{"Format"} = 1; }</p><p># get current directory... if( $topdir =~ m|^/webFiles/www/public/(.*)$| ){ $X::baseURL = "http://www.keck.hawaii.edu/$1/" ; } else { die "Can't convert this directory to a Keck URL: $topdir\n" }</p><p># define styles... $myStyle=<<END; BODY { background: white; } h1 { display:block; background-color: steelblue; padding:0.2em 0.7em; height:auto; text-align: center; } h2 { display:block; background-color: lightsteelblue; padding:0.2em 0.7em; height:auto; } a:link { color: steelblue } a:visited { color: saddlebrown } a:active { color: red } KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 17 OF 36</p><p>END ;</p><p># initialize logfile... open LOG, ">$logfile" or die "ERROR: can't open logfile $logfile"; print LOG header, start_html( -title=>"$command logfile for $subdir directory", -style=>{-code=>$myStyle}), h1("Weblint logfile for $subdir directory"); print LOG h2('Parameters'), ul( li(b("Base: "), a({href=>$X::baseURL},$X::baseURL),), li(b("Home Page:"), $homepage), li(b("Index Page:"), $indexPage), li(b("Ignored Files: "), scalar(keys(%ignoreFiles)) > 0 ? join(",", keys %ignoreFiles) : "(none)"), li(b("Extra Files: "), scalar(keys(%extraFiles)) > 0 ? join(",", keys %extraFiles) : "(none)"), li(b("Bad index tags: "), scalar(keys(%badIndexTag)) > 0 ? join(",", keys %badIndexTag) : "(none)"), li(b( "Check for orphaned pages: "), $check{"Orphans"} ? "yes" : "no"), li(b( "Check for bad index tags: "), $check{"Tags"} ? "yes" : "no"), li(b( "Check for bad links: "), $check{"Links"} ? "yes" : "no"), li(b( "Check for non-local images: "), $check{"Images"} ? "yes" : "no"), li(b( "Check for current format: "), $check{"Format"} ? "yes" : "no"), li(b("Max Depth: "), $maxdepth), li(b("Debug mode: "), $X::debug ? "on" : "off" ), li(b("Verbose mode: "), $verbose ? "on" : "off" ), li(b("Logfile: "), $logfile), li(b("Date: "), $startDate), li(b("User: "), $ENV{"USER"}), );</p><p># get list of all links in home page... print "Reading links from home page $homepage\n" if $X::debug; $status .= h2("Home page and Subindices") . "Note: pages linked to these pages are not considered orphans."; $status .= start_ul . li($homepage); getLinkInfo( $homepage, \%linkedPages, $X::baseURL, 'href'); $depth{$homepage} = 1;</p><p># we define a minimal entry in linkedPages for the homepage to prevent # it from later being identified as an unlinked page... $linkedPages{$X::baseURL . $homepage}{'shortName'} = $homepage;</p><p># also include links from the menu pages... foreach $file (keys %extraFiles){ KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 18 OF 36</p><p> getLinkInfo( $file, \%linkedPages, $X::baseURL, 'href'); ## $depth{$file} = 1; }</p><p># also scan for links on all pages linked on the home page (except the # index file)... $i = 1; while (1) {</p><p># increment depth... $i++; print "Scanning for web pages at depth $i\n" if $X::debug; $scanned = 0;</p><p> for ( keys %linkedPages){</p><p># get name of webpage... $webpage = $linkedPages{$_}{'shortName'};</p><p># skip pages already done... if( defined( $depth{$webpage})){ print "Skipping previously scanned page $webpage\n" if $X::debug; next; }</p><p># skip non-html pages... unless( $webpage =~ m|\.htm[l]?$|){ print "Skipping non-html page $webpage\n" if $X::debug; next; }</p><p># skip non-local pages... unless( basename( $webpage) eq $webpage){ print "Skipping non-local page $webpage\n" if $X::debug; next; }</p><p># read links... print "Reading links from web page $webpage\n" if $X::debug; $status .= li($webpage); getLinkInfo( $webpage, \%linkedPages, $X::baseURL, 'href' ); $depth{$webpage} = $i; $scanned++; }</p><p># quit loop if no new files were found... last if $scanned == 0; KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 19 OF 36</p><p>} $status .= end_ul;</p><p># print link list... if ($X::debug){ print "List of links found on home page and subpages:\n"; for ( sort keys %linkedPages) { $webpage = $linkedPages{$_}{'shortName'}; print "\t$webpage is linked from $linkedPages{$_}{'referrer'}\n"; } }</p><p># get list of all .html files in this directory... ## my( @infiles) = split( /\s+/, `find . -name '*.html' -print`); my( @infiles) = split( /\s+/, `ls -1 *.html`);</p><p>#------# loop over files... #------$status = h2('Page Status') ; if ( $verbose){ $status .= "Showing both good and bad pages."; } else { $status .= "Showing only bad pages."; } $status .= start_ul; foreach $path( @infiles){</p><p> chomp $path; print "processing file $path\n" if $X::debug;</p><p># verify existence... unless( -r $path){ warn "WARNING: unable to open file $path" if $X::debug; next }</p><p># skip this file it is a symlink to a local file which is already # being scanned... if( -l $path){ $linkpath = readlink($path); print "basename=" . basename( $linkpath) . " linkpath=" . $linkpath . "\n" if $X::debug; if( basename( $linkpath) eq $linkpath){ warn "WARNING: $path links to local file $linkpath; skipping" KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 20 OF 36</p><p> if $X::debug; next } }</p><p># clear problems array... @problems = ();</p><p># process filename... $fullpath = $X::baseURL . $path; $infile = basename( $path); $dir = dirname( $path);</p><p># skip ignore files... if (defined($ignoreFiles{$path})){ print "$path: skipping file\n" if $X::debug; next; }</p><p>#------# check for page not listed in home page and friends... #------if( $check{"Orphans"}){ if (not defined $linkedPages{$fullpath}){ print "\tno link exists to page ($fullpath)\n" if $X::debug; push @problems, b("orphan; file not linked anywhere"); $check{"Orphans"}++; } elsif ( $depth{$path} > $maxdepth){ $buf = sprintf( "more than %d clicks required to reach page %d\n", $maxdepth-1, ($depth{$path})); print "\t$buf\n" if $X::debug; push @problems, $buf; $check{"Orphans"}++; } }</p><p>#------# check for bad index tags... #------</p><p> if( $check{"Tags"}){</p><p># get index entries from file... @indexTags = getIndexTags( $path); if( @indexTags){</p><p>KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 21 OF 36</p><p> print "\tindex tags:\n" if $X::debug;</p><p># warn about bad tags... foreach $tag (@indexTags){ print "\t\t$tag" if $X::debug; if( defined( $badIndexTag{$tag})){ print " [BAD]" if $X::debug; push @problems, b("bad index entry: ") . $tag; $check{"Tags"}++; } elsif (defined( $goodIndexTag{$tag})){ print " [DUPLICATE]" if $X::debug; push @problems, b("duplicate index entry: ") . $tag . " (" . a({href=>"$X::baseURL$path"},$path) . ")"; $check{"Tags"}++; } else { $goodIndexTag{$tag} = $path; } print "\n" if $X::debug; } } else { print "\tno index tags\n" if $X::debug; push @problems, b("no index entry"); $check{"Tags"}++ } }</p><p>#------# check for bad links... #------undef %linkInfo; getLinkInfo( $path, \%linkInfo, $X::baseURL); if( $check{"Links"}){</p><p>$nlinks = 0; foreach $link ( keys %linkInfo){ $nlinks++; if ($X::debug){ print "\tlinks:\n" if $nlinks==1; print "\t\t$link"; print " status=", $linkInfo{$link}{'status'}; print "\n"; } if( defined($linkInfo{$link}{'status'}) and $linkInfo{$link}{'status'} == 0 and not &ignoreLink($link) ){ KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 22 OF 36</p><p> push @problems, b("bad link: ") . a({href=>$link},$link); $check{"Links"}++; } } }</p><p>#------# check for non-local images... #------if( $check{"Images"}){ foreach $link ( keys %linkInfo){ $_ = $linkInfo{$link}{'shortName'}; if ( m/jpg$/ or m/jpeg$/ or m/gif$/ or m/png$/ ){ unless( basename( $_) eq $_){ print "\tnon-local image: $_\n" if $X::debug; push @problems, b("non-local image: ") . $_; $check{"Images"}++; } } } }</p><p>#------# check for bad format... #------if( $check{"Format"}){ unless( checkFormat( $path) ){ print "\tbad format\n" if $X::debug; push @problems, b("bad format"); $check{"Format"}++; } }</p><p># report troubles... if( scalar @problems > 0){ $status .= li(a({href=>"$X::baseURL$path"},$path)); $status .= start_ul; foreach $problem(@problems){ $status .= li($problem)} $status .= end_ul; push @badFiles, $path; } else { $status .= li(a({href=>"$X::baseURL$path"},$path), " [OK]") if $verbose; print "\tno problems\n" if $X::debug; push @goodFiles, $path; } } KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 23 OF 36</p><p>$status .= end_ul;</p><p># computer good and bad fractions... my( $goodFiles) = scalar( @goodFiles); my( $badFiles) = scalar( @badFiles); my( $numFiles) = $goodFiles + $badFiles; my( $goodFrac) = sprintf( "%d", 100.*$goodFiles/ $numFiles); my( $badFrac) = sprintf( "%d", 100.*$badFiles/$numFiles);</p><p># print problem synopsis... print LOG h2("Synopsis"), ul( li( b("Number of good pages: "), scalar @goodFiles, " ($goodFrac%)"), li( b("Number of bad pages: "), scalar @badFiles, " ($badFrac%)"), li( b("Number of orphaned pages: "), $check{"Orphans"} ? $check{"Orphans"}-1 : "(not checked)"), li( b("Number of bad index tags: "), $check{"Tags"} ? $check{"Tags"} -1 : "(not checked)"), li( b("Number of bad links in pages: "), $check{"Links"} ? $check{"Links"} -1 : "(not checked)"), li( b("Number of non-local images: "), $check{"Images"} ? $check{"Images"} -1 : "(not checked)"), li( b("Number of pages not using new format: "), $check{"Format"} ? $check{"Format"} -1 : "(not checked)"), ), $status;</p><p># close logfile... $runTime = time() - $startTime; $footer = "Generated by $command on host $runhost on $startDate ($runTime s)"; print LOG p, hr, $footer, end_html; close LOG; print "Logfile $logfile closed.\n" if $X::debug;</p><p>#------# send email if requested... #------if ( scalar @recipients > 0){ print "Preparing to send email report\n" if $X::debug;</p><p># get URL of logfile... unless ( $logfile =~ m|^/webFiles/www/public/(.*)$|){ die "Can't generate URL for logfile $logfile"; } $logURL = "http://www.keck.hawaii.edu/$1"; print "logURL = $logURL\n" if $X::debug;</p><p># set message subject appropriately based on whether errors occurred... if ( scalar @badFiles > 0){ KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 24 OF 36</p><p>$subject = "$command scanning $subdir found " . scalar @badFiles . " bad files"; } else { $subject = "$command scanning $subdir all pages OK" } print "subject = $subject\n" if $X::debug;</p><p># build command to execute... open MAIL, qq(| /bin/mailx -s "$subject" ) . join(" ", @recipients) or die "unable to send mail";</p><p># send mail... print MAIL "$subject\n", "\n", "topdir: $topdir\n", "\n", "Synopsis:\n"; print MAIL "\tNumber of good pages: ", scalar @goodFiles, " ($goodFrac%)\n"; print MAIL "\tNumber of bad pages: ", scalar @badFiles, " ($badFrac%)\n"; print MAIL "\tNumber of orphaned pages: ", $check{"Orphans"} ? $check{"Orphans"}-1 : "(not checked)", "\n"; print MAIL "\tNumber of bad index tags: ", $check{"Tags"} ? $check{"Tags"} -1 : "(not checked)", "\n"; print MAIL "\tNumber of bad links in pages: ", $check{"Links"} ? $check{"Links"} -1 : "(not checked)", "\n"; print MAIL "\tNumber of non-local images: ", $check{"Images"} ? $check{"Images"} -1 : "(not checked)", "\n"; print MAIL "\tNumber of pages not in new format: ", $check{"Format"} ? $check{"Format"} -1 : "(not checked)", "\n"; print MAIL "\n"; print MAIL "Please see logfile at:\n", "\t$logURL\n"; print MAIL "and scorecard at:\n", "\t$summaryURL\n"; print MAIL "\n", "-" x 72, "\n"; print MAIL "$footer\n"; close MAIL; } print "Execution completed.\n" if $X::debug;</p><p>#------# return list of links from file... #------sub getLinks { my( $file, $array) = @_; KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 25 OF 36</p><p> my( @element, @links, $linkarray, $elt_type, $attr_name, $attr_value); my( $parser);</p><p>$parser = HTML::LinkExtor->new(); $parser->parse_file($file); @links = $parser->links; foreach $linkarray( @links){ @element = @$linkarray; $elt_type = shift @element; while( @element){ ($attr_name, $attr_value) = splice( @element, 0, 2);</p><p># skip internal name references... next if $attr_value =~ m|^#|;</p><p># add element to list of found webpages... $$array{ $attr_value} = $file; } } }</p><p>#------# return list of index entries from file... #------sub getIndexTags { my( $file) = @_; my( %seen);</p><p> undef %seen; open FILE, $file or die "can't open file $file"; while( <FILE> ){</p><p># test pattern match. Note the use of parens to store the # contents of the quoted string into the variable $1 used # below... if ( /<\!\-\-.*index=\"(.*)\".*\-\->/ ) { $seen{$1} = $file; } } close FILE;</p><p># put matched topics in array... return sort keys %seen; } KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 26 OF 36</p><p># #------# # return list of dead links... # #------# sub checkLinks { # my( $file) = @_; # my( @element, @links, $linkarray, $elt_type, $attr_name, $attr_value); # my( $parser); # my( $url) = $X::baseURL . $file;</p><p># $parser = HTML::LinkExtor->new( undef, $url); # $parser->parse(get($url)); # @links = $parser->links; # foreach $linkarray( @links){ # @element = @$linkarray; # $elt_type = shift @element; # while( @element){ # ($attr_name, $attr_value) = splice( @element, 0, 2); # if( $attr_value->scheme =~ /\b(ftp|https?|file)\b/) { # print "name=$attr_name value=", head($attr_value), "\n"; # } # } # } # return undef # }</p><p>#------sub getLinkInfo { #------# Purpose: # return a hash with info on links in file. # # Inputs: # $file: [I] name of the web page to scan for links # $info: [I/O] reference to a hash to contain results # $baseURL: [I] URL corresponding to the current directory # $type: [I] (optional) type of links to obtain (e.g., 'href' # will obtain only href links) # # Source: Perl Cookbook examples 20-2 and 20-5 #------</p><p> use File::Basename;</p><p> my( $file, $info, $baseURL, $type) = @_; my( @element, @links, $linkarray, $elt_type, $attr_name, $attr_value); my( $parser); KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 27 OF 36 my( $shortName); my( $foo); my( $jsfile);</p><p># check access to file... unless( -e $file){ warn "WARNING: can't open file '$file'" if $X::debug; return; }</p><p># if this is a menu file (*.js), then convert to plain html and scan... my($base, $dir, $extn) = fileparse($file, qr/\.[^.]*/); $extn = "" unless defined($extn); if ( $extn eq '.js' ) { $jsfile = "temp_$base.html"; &convertJavaScriptToHtml( $file, $jsfile); $file = $jsfile; }</p><p># form url... my( $url) = $baseURL . $file;</p><p># get links from file... $parser = HTML::LinkExtor->new( undef, $url); $parser->parse( LWP::Simple::get($url));</p><p># loop over links in file... @links = $parser->links; foreach $linkarray( @links){ @element = @$linkarray; $elt_type = shift @element; while( @element){ ($attr_name, $attr_value) = splice( @element, 0, 2);</p><p># skip things of wrong type... next if (defined($type) and $attr_name ne $type); print "\t$attr_name \-\> $attr_value\n" if $X::debug;</p><p># record type of reference (href, src, etc.)... $$info{ $attr_value}{'type'} = $attr_name;</p><p># record name of referring page... $$info{ $attr_value}{'referrer'} = $file;</p><p># shorten and store the name... KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 28 OF 36</p><p>$shortName = $attr_value; $shortName =~ s/$baseURL//; $$info{ $attr_value}{'shortName'} = $shortName;</p><p># for valid URLs, check status... if( $attr_value->scheme =~ /\b(https?|file)\b/) {</p><p>$$info{ $attr_value}{'status'} = LWP::Simple::head($attr_value) ? 1 : 0; } elsif( $attr_value->scheme =~ /\b(ftp)\b/) {</p><p># found I had to do this because the "head" method does not # work for FTP... $$info{ $attr_value}{'status'} = LWP::Simple::get($attr_value) ? 1 : 0;</p><p>} else { # we end up here with mailto and javascript entries... warn "WARNING unsupported scheme " . $attr_value->scheme . " for file " . $attr_value . "\n" if $X::debug; } } }</p><p># remove temp file... if(defined $jsfile){ unlink $jsfile};</p><p> return undef }</p><p>#------# return a list of links in file... #------sub getLinkList { my( $info, $list) = @_; return keys %$info; }</p><p>#------sub printHelp { #------# Print out header #------my( $command) = q(perl -ne 'if(m/^#\+/){$a=1;next LINE};exit if /^#\-/;next LINE unless $a==1; s/^#// ; print' ) . $0; my( $pager) = $ENV{PAGER} || "more"; exec( "$command | $pager" ) == 0 or die "could not execute command $command"; } KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 29 OF 36</p><p>#------# check file format by verifying that it invokes the java format. # Returns 1 if it finds the specified string, else 0... #------sub checkFormat { my( $infile) = @_; my( @lines); my( $status) = 0; my( $string) = q(src="../web_docs/scripts/include_main.js");</p><p># oprn file... open INFILE, "<$infile" or die "can't open input file $infile";</p><p># loop over lines... while(<INFILE>){ if (m|$string|){ $status = 1; last; } }</p><p># close file... close INFILE;</p><p># return status... return $status; }</p><p>#------# convert javascript file to HTML... #------sub convertJavaScriptToHtml{ my( $infile, $outfile) = @_; my( $contents);</p><p># open input file... open INFILE, "<$infile" or die "can't open input file $infile";</p><p># slurp file... undef $/; $contents = <INFILE>;</p><p># close file... KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 30 OF 36</p><p> close INFILE;</p><p># open output file... open OUTFILE, ">$outfile" or die "can't open output file $outfile";</p><p># print header... print OUTFILE <<EOF <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Javascript</title> </head></p><p><body> EOF ;</p><p># find matches... while( $contents =~ /document\.write\(\'(.*)\'\)/g){ print OUTFILE "$1\n"; }</p><p># print footer... print OUTFILE <<EOF </body> </html> EOF ;</p><p># close file... close OUTFILE;</p><p>}</p><p>#------sub ignoreLink { #------# Given a URL, return 1 if it should be ignored and 0 if not. #------</p><p> my( $link) = shift; my( $l);</p><p># create a list of links which we ignore when checking for bad links # because they always cause problems... my( @ignoreLink) = ( 'https://www2.keck.hawaii.edu/inst/PILogin/login.php', KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 31 OF 36</p><p>'http://en.wikipedia.org', 'https://docs.google.com' );</p><p># if the passed link matches any of the ignorable links, we return 1 foreach $l (@ignoreLink) { if ( $link =~ m/$l/ ){ print "ignoring link $link because it matched $l\n" if $X::debug; return 1; } }</p><p> return 0; }</p><p>Weblint_summary: #!/usr/bin/perl -w #+ # weblint-summary -- summarize results from weblint # # Purpose: # Create a summary of weblint output # # Usage: # weblint-summary # # Arguments: # None # # Output: # to stdout # # Restrictions: # Should be run under the kics account in order to create file # # Exit values: # 0 = normal completion # 1 = wrong number of arguments # # Example: # 1) Generate summary: # weblint-summary #- # Modification history: # 2011-Aug-17 GDW Original version #------KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 32 OF 36</p><p># load packages... use strict; use Getopt::Std; use CGI::Pretty qw(:standard *ul); use File::Basename;</p><p># declarations... my( $runTime, $myStyle, $subdir, $infile); my( $n_good, $pct_good); my( $logfile) = '/webFiles/www/public/inst/log/weblint-index.html'; my( $command) = basename( $0); my( $runhost) = `hostname`; chomp $runhost; my( $startDate) = scalar localtime(); my( $startTime) = time(); my( $footer); my( $buf); my( $null) = qq( ); my( $inst, $n_all, $n_bad, $pct_bad, $n_orphans, $n_bad_index_tags, $n_bad_links, $n_non_local_images, $n_bad_format); my( %rec, @lines, @ordered_lines, $line); my( $color);</p><p># define styles... $myStyle=<<END; BODY { background: white; } h1 { display:block; background-color: steelblue; padding:0.2em 0.7em; height:auto; text-align: center; } h2 { display:block; background-color: lightsteelblue; padding:0.2em 0.7em; height:auto; } a:link { color: steelblue } a:visited { color: saddlebrown } a:active { color: red } END ;</p><p># initialize logfile... open LOG, ">$logfile" or die "ERROR: can't open logfile $logfile"; print LOG header, start_html( -title=>"Weblint scorecard", -style=>{-code=>$myStyle}), h1("Weblint Scorecard"); KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 33 OF 36</p><p># intialize table... print LOG '<center>', qq(<table border="1" cellpadding="3">), thead( Tr( {-bgcolor=>'lightsteelblue'}, th( 'Inst'), th( 'Total Pages'), th( 'Good Pages'), th( 'Good %'), th( 'Bad Pages'), th( 'Bad %'), th( 'Orphans'), th( 'Bad Index Tags'), th( 'Bad Links'), th( 'Non-Local Images'), th( 'Bad Format') )), qq(<tbody>);</p><p># get list of all .html files in this directory... my( @infiles) = split( /\s+/, `ls -1 weblint-*-log.html`);</p><p>#------# loop over files... #------foreach $infile ( @infiles){</p><p> if ( $infile =~ m/weblint-(.+)-log.html/ ){ $inst = $1 }</p><p># slurp file.. open INFILE, "<$infile" or die "can't open infile $infile"; $buf = ""; while (<INFILE>){ $buf .= $_; } close INFILE;</p><p>## print "\n\ninfile is $buf\n\n";</p><p># grab data... if ( $buf =~ m|Number of good pages\:\s*\<\/b\>\s*(\d+)\s*\((\d+)\%\)|) { $n_good = $1; $pct_good = $2; KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 34 OF 36</p><p>} else { $n_good = $null; $pct_good = $null; }</p><p># grab data... if ( $buf =~ m|Number of bad pages\:\s*\<\/b\>\s*(\d+)\s*\((\d+)\%\)|) { $n_bad = $1; $pct_bad = $2; } else { $n_bad = $null; $pct_bad = $null; }</p><p> if ( $buf =~ m|Number of orphaned pages\:\s*\<\/b\>\s*(\d+)\s*|){ $n_orphans = $1; } else { $n_orphans = $null; }</p><p> if ( $buf =~ m|Number of bad index tags\:\s*\<\/b\>\s*(\d+)\s*|){ $n_bad_index_tags = $1; } else { $n_bad_index_tags = $null; }</p><p> if ( $buf =~ m|Number of bad links in pages\:\s*\<\/b\>\s*(\d+)\s*|){ $n_bad_links = $1; }else { $n_bad_links = $null; }</p><p> if ( $buf =~ m|Number of non-local images\:\s*\<\/b\>\s*(\d+)\s*|){ $n_non_local_images = $1; }else { $n_non_local_images = $null; }</p><p> if ( $buf =~ m|Number of pages not using new format\:\s*\<\/b\>\s*(\d+)\s*|){ $n_bad_format = $1; }else { $n_bad_format = $null; }</p><p># print "inst=$inst\n"; KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 35 OF 36</p><p># print "n_good=$n_good pct_good=$pct_good\n"; # print "n_bad=$n_bad pct_bad=$pct_bad\n"; # print "n_orphans=$n_orphans\n"; # print "n_bad_index_tags=$n_bad_index_tags\n"; # print "n_bad_links=$n_bad_links\n"; # print "n_non_local_images=$n_non_local_images\n"; # print "n_bad_format=$n_bad_format\n";</p><p> if ( $n_good ne $null and $n_bad ne $null ){ $n_all = $n_good + $n_bad; } else { $n_all = $null; }</p><p># save score,,, if ( $pct_good ne $null){ $rec{score} = $pct_good; if ( $pct_good >= 100 ) { $color = "#C8CFBD"; } elsif ( $pct_good >= 50 ) { $color = "#E9D6AB"; } else { $color = "#D9BFA8"; } } else { $rec{score} = -1; $color = "#CCB8AD"; }</p><p>$rec{text} = Tr( {-bgcolor=>$color, -align=>'CENTER'}, td( a({href=>$infile},$inst)), td( $n_all), td( $n_good), td( "$pct_good%"), td( $n_bad), td( "$pct_bad%"), td( $n_orphans), td( $n_bad_index_tags), td( $n_bad_links), td( $n_non_local_images), td( $n_bad_format) ); push @lines, {%rec}; }</p><p># sort lines... KECK INSTRUMENT TECHNICAL NOTE KITN 0021 PAGE 36 OF 36</p><p>@ordered_lines = sort byscore @lines; foreach $line (@ordered_lines) { print LOG $$line{text}; } sub byscore { $b->{score} <=> $a->{score}; }</p><p># end table... print LOG qq(</tbody></table></center>);</p><p># shut down... $runTime = time() - $startTime; $footer = "Generated by $command on host $runhost on $startDate ($runTime s)"; print LOG p, hr, $footer, end_html; close LOG;</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages36 Page
-
File Size-