User:Kellen/Scripts

Scripts I've written for manipulating wikibooks data. Be sure to copy from the edit box, not the rendered page, as some html entities are used.

No guarantee of efficiency, correctness, or beauty. A guarantee of hacky sed regular expressions.

Isolate cookbook wantedpages:
 * 1) !/bin/bash

lynx -source -dump "http://en.wikibooks.org/w/index.php?title=Special:Wantedpages&limit=5000&offset=0" | \ grep Cookbook | \ grep edit | \ grep -v "Talk:Cookbook" | \ grep -v "Cookbook_talk:" | \ grep -v " " | \ sed 's/<[/li]\{2,3\}>//g' | \ sed 's/_/ /g' | \ sed 's/(]*>\([^<]*\)[^)]*/\2/g' | \        sed 's/&amp;target=/\//g' | \        sed '1~2N;s/\n/ /g' | \        sed 's/^/*/g'

Get redirects from cookbook allpages:
 * 1) !/bin/bash

FILE=cookbookredirs SERVER="http://en.wikibooks.org" PAGE="/w/index.php?title=Special%3AAllpages&from=&namespace=102"
 * 1) filename prefix
 * 1) servername
 * 1) start page

num=0 while [ -n "$PAGE" ] do num=$(($num+1)) echo "Getting page number ${num}, ${PAGE}" wget -q -O "${FILE}.${num}" "${SERVER}${PAGE}"
 * 1) fetch all pages

# get the next page url PAGE=`grep -o 'Next page ([^)]*)' "${FILE}.${num}" | grep -o 'href="[^"]*"' | grep -o '"[^"]*"' | sed 's/"//g' | sed 's/&amp;/\&/g'` #if [ -n "$PAGE" ]; then  #  echo "Next page is ${PAGE}"  #fi done

echo "Got ${num} files."

i=0 while [ $i -lt "$num" ] do i=$(($i+1)) FN="${FILE}.${i}" if [ ! -f $FN ]; then echo "Can't find ${FN}" break fi
 * 1) strip each file down to only redirects

# add a marker to beginning of page list sed -i 's/ /\nBREAKHERE\n/' $FN # kill everything above page list marker sed -i '0,/BREAKHERE/d' $FN # find end of page list and kill everything after sed -i 's/<\/table>/\n/1' $FN sed -i '2,$d' $FN # add a linebreak after each item, replacing /td sed -i 's| |\n|g' $FN # remove all remaining tr's td's and ending tr's sed -i 's|<[trd/]\{2,3\}>||g' $FN # strip down to just title sed -i 's/.*$/\1\n/g' $FN # only get redirects sed -i -n '/allpagesredirect/p' $FN sed -i 's/ \(.*\)$/* \1/g' $FN done

i=0 CATFILES="" while [ $i -lt "$num" ] do i=$(($i+1)) CATFILES="${CATFILES} ${FILE}.${i}" done FINAL="${FILE}.final" cat $CATFILES > $FINAL rm $CATFILES
 * 1) Join files together

lines=`wc -l < $FINAL` col=`expr $lines / 3` pattern="${col}~${col}G" sed -i $pattern $FINAL
 * 1) add a blank line so we can easily find where to put colunms

echo "Resultant file is ${FINAL}"