May 09, 2010 Archives

09-05-2010 17:19

Poor mans Specto

I loved the idea of Specto when I stumbled upon it. But using CentOS at work and Mandriva at home, I did not have access to it in the repositories. So I simply made my own cron job to monitor for a websites change.

My idea was simple. Grab the default home page, store it, then at the next time interval grab it again and do an ms5sum comparison before the new page and the previous. Then I found wget has a Timestamp switch [-N Turn on time-stamping.] So using that, I came up with the below cron job command to check if a page has changed using the timestamp of modification.

01      *       *       *       *       cd /home/david/website_diffs/wsp && wget -N http://wspirates.com/ 2>&1 |grep -q "o newer" || printf "Wspirates web page appears to have updated.\n\nSuggest you check it out.\n\n"|mail -s "Pirates page updated." david@email.com

To break this down

We have a this run every hour. We have first created the folder /home/david/website_diffs/ and then create a folder in there for each web page. wsp in this example.

  • We change to this folder.
  • We grab the current page with wget with the -N switch on. This will check if a file of the same name in the working directory has the same or newer timestamp. If it does, it does not download it and prints a message saying "Server file no newer than local file `index.html' -- not retrieving." and a 0 exit status. The command here sends all of wget's output, both errors and standard output to the standard output stream so it can be piped over to grep.
  • We grep silently for "o newer" which is a way to search for the message above. If we find this, meaning, the page it not newer, we end there.
  • If we do not find this message and grep exits with a non-zero status, then the 'or' (||) control operator kicks in and we run the ensuing command.
  • The final command simply emails someone about our discovery.
  • So quiet simple really. It doesn't work as well as Specto, as Specto allows for a percentage change option which is good for sites with advertising. This could possibly we done with using diff to compare the previously downloaded page and the new one [every hour or so] and work out a percentage of lines that have changed compared to the whole page. But this I did not need as all I want to know if it a page has been updated. I hope this is useful for someone.


Posted by DaveQB | Permanent Link | Categories: IT