User:Visviva/Bash
From Wikipedia, the free encyclopedia
I'm fairly new to Bash, but if these scripts are of any use to you please feel free to use & adapt them.
If you think you can improve anything on this page, please share your ideas either here or on the Talk page.
[edit] Uncat.sh
I find that this only processes about 300,000 lines per hour on my desktop machine. It would therefore take about 400 hours to process the entire text of Wikipedia.
#!/bin/bash #This is a bash script for extracting files from a EN Wikipedia XML dump. #This script takes one argument, the name of the file it will process. #If you know a way to make this script faster, please share. #Make a special pipe for the file exec 3< $1 in=0 cat=0 #Start while read <&3 line; do #Scan for categories if [ "$in" -eq "1" ] then case $line in *[[Category:* | *[[category:* | *REDIRECT* | *redirect* | *disambig* | *dis}}* | *CC}}* | *Disambig* | *Redirect* ) in=0;; esac fi #Scan for title -- also tells us if the last page is over title="" title=" $(echo $line | grep '<title>')" if [ "$title" != " " ] then oldtitle=$PAGE_TITLE title=$(echo $line | grep '<title>' | sed -e s'@<title>\(.*\)</title>@\1@ ') export PAGE_TITLE=$title if [ "$in" -eq "1" ] then echo "*[[$oldtitle]]" fi in=1 case $title in *deletion* | *Deletion* ) in=0;; esac fi done