User talk:Bluemoose/DataBaseSearchTool

From Wikipedia, the free encyclopedia

Hey Bluemoose. It might be useful for the not so technical among us to give some instructions for the following problem I encountered: First I downloaded the .NETframework then your program. When I types my first inquiry into your program I get the message:

"Please open up an "Articles" XML data-dump file from the file menu See the About menu for where to download this file".

So I followed these instructions, but when I clicked on "open XML dump" I got a new screen with a file name written in "current or articles XML file" and my only options are "open" or "cancel". When I click on open I get the error message: current or articles XML file does not exist. I have no idea what to do from here. Fuhghettaboutit 01:06, 28 January 2006 (UTC)

The file you want is here, the one called pages-articles.xml.bz2, I havent linked to that page directly in the program because when a newer dump is available it will be on a different page. thanks Martin 09:46, 28 January 2006 (UTC)

Sorry Bluemoose, but still bewildered. I understand now to some extent--the dump has all text files on wikipedia as of a certain date that gets redumped periodically as Wikipedia changes(?), but I still don't know how to access that file with your program. When I try to access the XML file, it looks in my computer--must I download the 997 MB file to my computer in order to do this? I need to be spoonfed here. Thanks for any help. Fuhghettaboutit 15:35, 28 January 2006 (UTC)
OK, download the 997MB file to your computer, extract it (it is a .bz2 file which is just like a normal .zip file) with a program like winzip or winrar, then start up the database search tool and "open" the extracted file which will be called enwiki-20060125-pages-articles.xml then you are ready to start searching. hope that helps Martin 16:33, 28 January 2006 (UTC)
It does indeed. Thank you. In fact, that's the conclusion I had sort of reached above, but I was having trouble swallowing the fact that I needed to downloaded almost 1,000 MB to my computer first. Thanks again. Fuhghettaboutit 16:53, 28 January 2006 (UTC)
I am more than happy to do any searches for you, just let me know. Martin 17:14, 28 January 2006 (UTC)

Contents

[edit] Thank you

thank you, Thank you, THANK YOU. This software is incredibly useful for my work on the Ancient Egypt project!

—-- That Guy, From That Show! (talk) 2006-02-22 03:06Z

[edit] Thanks, some technical questions

Hi Bluemoose, thanks for your great tool. I'm working on a tool that process a full xml-dump (with history ~190 GB) and wondered if C#'s XML - features could handle such a vast amount of data in one file. I'd appreciate it, if you could fill me in with some minor details of your SW. Would be a shame to buy a new harddisk and recognizing that there's no possibility for parsing ;)

MMF (Sorry, no account --> no signing)

Well you certainly won't want to load the whole thing into memory ;-), but yes, I can't see why the xml features will not be able to handle a file of any size. Not sure what "SW" means, but just ask about any other details. Martin 14:56, 28 February 2006 (UTC)


Thanks for your answer. I assume you are using the native XML - C# - API with XMLTextReader etc? What I'm interested in, is some exchange of experiences about the maximum size .NET's XML-API can handle (Microsoft states 2GB max per file). Or ar you using SAX.NET for parsing? (With JAXP (Java-Sax) from SUN I should be able to parse 190GB of XML) And SW stands for Software, I'll try to be more precise - I'm not writing in my native language (obviously ;) ) The problem - I need the full history of articles for statistical purposes, and parsing the online wikipedia in HTML seems a bit tough to handle. But 190 GB of XML in one file - OMG :D Maybe you could write a short statement about your used classes (Standard .NET or some kind of SAX for .NET) and the maximum filesize you have sucessfully tested with your tool. Thx in advance
MMF

The reader part of the code looks something like this (using the System.Xml namespace):

           Stream stream = new FileStream(fileName, FileMode.Open);
           using (XmlTextReader reader = new XmlTextReader(stream))
           {
               while (reader.Read())
               {
               if (reader.LocalName.Equals("title"))
                   Console.WriteLine(reader.ReadInnerXml().ToString());
               }
           }
           stream.Close();

This simple example would write every title to the console, I have used it on files almost 4Gb, I suspect it will work on any file size, as it is reading as it goes, rather than opening the whole file up. Martin 21:53, 28 February 2006 (UTC)

Thanks, I'll give it a try. MMF

[edit] Some ideas

  • Convert < into &lt; and > into &gt; when trying to search these
  • Title startswith (just like the title does contain option)
  • Alphabetic sorting
    • header (==B== etc.) when letter changes. --HartzR 08:46, 5 March 2006 (UTC)
Damn, I was using an old version! So probably these are done already... Version history would be great btw. --HartzR 08:50, 5 March 2006 (UTC)
There could be a stop button. --HartzR 08:57, 5 March 2006 (UTC)

[edit] Other languages

Does this support unicode? I'd like to use this tool for Bengali wikipedia. --Ragib 21:25, 27 September 2006 (UTC)

Yes, though I haven't tested it with many other languages, but ti should work. Martin 08:12, 28 September 2006 (UTC)