User talk:Piotrus/List of Poles

From Wikipedia, the free encyclopedia

Converting to a list shouldn't be a problem, and won't require a bot. Export the .xls file to a .csv or text format (if possible, only export the column with the name) then run a text editor that will do macros on it to format the data. I can help if you provide a link to the .xls file or the exported data. --ChrisRuvolo (t) 16:48, 9 May 2005 (UTC)

Csv download available from main. All other info is useful - for article creation, for example. I can't write a macro any better then a bot, though. --Piotr Konieczny aka Prokonsul Piotrus Talk 15:01, 10 May 2005 (UTC)

These names are hard to parse. Firstly, they are not normalized. For example, consider the following lines:

  • 14,1,,"ABRAHAM Antoni (1869-1923) działacz społeczny","M",1923
  • 15,1,,"ABRAHAM ben Joszijahu z Trok (1636-1687) lekarz i mistyk","M",1687
  • 16,1,,"ABRAHAM Hirszowicz (XVIII/XIX w.) nadworny faktor Stanisława Augusta","M",
  • 53,1,,"ADAM z Brzezin (zm. 1552) profesor medycyny UJ","M",1552

These all have different formatting for the name and date. The first one appears to be the most common format. The second one appears to mix the surname "ABRAHAM ben Joszijahu" with the given name "z Trok" — if that indeed is the given name (I'm not familiar with Polish names). The third one has some roman numerals that are not the birth date. The fourth one has only the death date, and another confusion about the name is it "z Brzezin Adam" or something else?

These really need to be normalized before they can be useful. Having all that text in one column doesn't work well. Ideally, they should be split up into several columns:

  1. given name
  2. surname
  3. title postfix (eg: Andrezej II)
  4. birth year
  5. death year
  6. notes

or alternately:

  1. name as wanted to appear on wikipedia
  2. birth year
  3. death year
  4. notes

Without this, I don't think it is possible to parse what the proper name order should be. For example:

  • "d' ABANCOURT de Franqueville Karol" should be what? Either:
    • Karol d' Abancourt de Franqueville
    • de Franqueville Karol d' Abancourt
    • ?? something else?
  • "ABRAHAM ben Joszijahu z Trok" should be what? Either:
    • z Trok Abraham ben Joszijahu
    • Trok Abraham ben Joszijahu z
    • ben Joszijahu z Trok Abraham
    • .. or any other combination. It its not clear.

In the current form, I can parse only simple cases, such as:

  • "ABICHT Henryk" -> "Henryk Abicht"
  • "ABICHT Jan Henryk" -> "Jan Henryk Abicht"

I hope this is helpful. --ChrisRuvolo (t) 19:32, 10 May 2005 (UTC)

The problem is that the project was started in 1935, so there have been generations of standard changes :( Some names - especially foreign or semi-foregin ones (Jewish) - I myself am not sure what the right order should be. But perhaps a few simple IF statements could help us out. If a name contain 'z' (Polish: from), then leave it as it is, just decapitalise first name (i.e. ABRAHAM ben Joszijahu z Trok becames Abraham ben Joszijahu z Trok, ADAM z Brzezin becames Adam z Brzezin). If a name begins with d' or von, move it and the following capitalised name to the end. Granted, it will still give us some errors, but they should be few. As for stuff inside brackets - leave it as it is, it is not needed for the name - we can treat it as part of the txt description. --Piotr Konieczny aka Prokonsul Piotrus Talk 19:59, 10 May 2005 (UTC)
Ok, that is a good start. Another question I have is how some of these people are typically referred to in English. For example, does "Adam z Brzezin" become "Adam of Brzezin" in english texts? I'm wondering because "z" by itself doesn't seem consistently prouncable for english speakers.
I have found three examples on Wiki - John of Kolno, Spytek z Tarnowa i Jaroslawia and Jan z Tarnowa (1367-1433). While z is of, there is the problem of cases - note: Jan z KolnA but Jan of Kolno, Adam z Brzezin - Adam of BrzezinY. Perhaps living it in orginal would be easiest, and they would be translated when sb wants to write an article. --Piotr Konieczny aka Prokonsul Piotrus Talk 10:10, 11 May 2005 (UTC)
Also, what are the roman numerals? e.g.: "(XVIII/XIX w.)"
In Polish, centuries are written in roman numerals, instead of latin. So I think it means that exact date of birth/death is not known, only centuries. "w." means c.entury. --Piotr Konieczny aka Prokonsul Piotrus Talk 10:10, 11 May 2005 (UTC)
So, I think I'm going to write a perl script to to do this. I'll post the script here when completed, so others might improve upon it from the same source data in the future. --ChrisRuvolo (t) 23:05, 10 May 2005 (UTC)
Tnx. User Avar has done some work on IRC before he gave up yesterday, I'll paste here what he done - maybe it can be of some help:
use encoding 'utf8';
open F, '<new';
while (<F>) {
  # 852,1,345,"BARZI Stanis�aw (zm. 1571) wojewoda krakowski","M",1571
  $_ =~ s/^.*?"//g;<br>
  $_ =~ s/"//g;<br>
  ($_, undef) = split /,/;<br>
  # print;
  m/^(.*)(?=[A�BC�DE�FGHIJKL�MN�O�PRS�TUWYZŚŝ][a�bc�de�fghijkl�mn�oóprs�tuwyzźş])(.*)(?=\()/g;
  print (ucfirst lc $1) . ' '. $2;
  print "\n";
}

Contents

[edit] Perl script

Ok, I've got something that is mostly working. One question, does "ze" mean the same as "z" (english: of/from)? I will update the list in the other page. Note that the non-iso-8859-1 characters are coverted to HTML entities by wikipedia below. --ChrisRuvolo (t) 21:54, 11 May 2005 (UTC)

Yep, 'ze' is same 'z'. What do you mean by non iso characters? --Piotr Konieczny aka Prokonsul Piotrus Talk 01:04, 12 May 2005 (UTC)
Just that in the wiki source of the page the Eastern Europe characters (ł, etc.) show up as HTML entities (&#322;, etc.) so be careful cutting & pasting. --ChrisRuvolo (t) 01:44, 12 May 2005 (UTC)
Scrap that, the data is about 2 megabytes. Way too big for me to put it on a wiki page.. So, here is the output, you can split it up any way you need. [1] --ChrisRuvolo (t) 22:19, 11 May 2005 (UTC)
Tnx!!! --Piotr Konieczny aka Prokonsul Piotrus Talk 01:12, 12 May 2005 (UTC)
No problem. It was an interesting exercise. There will be problems, I'm sure not everything was coverted correctly. Let me know about any adjustments that need to be made. --ChrisRuvolo (t) 01:44, 12 May 2005 (UTC)
      • Files broken up by 1000 lines each are here: [2] --ChrisRuvolo (t) 23:19, 13 May 2005 (UTC)
#!/usr/bin/perl

#input file must be converted to utf-8 and unix linefeed format, piped to stdin

# ©2005 Chris Ruvolo.  GPLv2 only.

use encoding 'utf8';
use strict;

my $text;
my $sex;
my $deathyear;
my $birthyear;
my $first;
my $last;
my $desc;
my $name;
my $years;
my $name;
my $noreorder;

while (<>) {
    chomp;
    $_ =~ s/^[^"]*//;
    while (m/"[^"]+,[^"]+"/) {
        $_ =~ s/("[^"]+),([^"]+")/$1;$2/g;
    }
    ($text, $sex, $deathyear) = split /,/;
    $text =~ s/"//g;
    $sex =~ s/"//g;

    $name = $text;
    $name =~ s/(^[^(]*) .*/$1/;

    $years = $text;
    $years =~ s/^[^(]*\((.*)\).*/$1/;

    $desc = $text;
    $desc =~ s/^[^)]*\) (.*)$/$1/;

    if ($years =~ m/^[0-9]*-[0-9]*$/) {
        ($birthyear, undef) = split(/-/,$years);
    } else {
        $birthyear = "";
    }

    $noreorder = 0;

    my @prepositions = ("de", "d'", "dell'", "del", "van der", "von", "du",
        "von der", "de la", "di", "van", "baron", "kniaź", "ben");
    my @namepieces = split(/ /,$name);
    my $i;


    if ($name =~ m/ z / || $name =~ m/ ze /) {
        $first = "";
        $noreorder = 1;

        $last = ucfirst(lc($namepieces[0]));
        for($i = 1; $i < @namepieces; $i++) {
            $last .= " $namepieces[$i]";
        }
    }

    if ($namepieces[1] =~ m/[IXV]+/) {
        $last = ucfirst(lc($namepieces[0])) . " $namepieces[1]";
        $noreorder = 1;
    }

    if (!$noreorder) {
        # if name part is already lowercase
        if ($namepieces[0] eq lc $namepieces[0]) {
            if ($namepieces[1] eq lc $namepieces[1]) {
                $last = "$namepieces[0] $namepieces[1] " . ucfirst lc $namepieces[2];
                $i = 2;
            } else {
                $last = "$namepieces[0] " . ucfirst lc $namepieces[1];
                $i = 1;
            }
        } else {
            $last = ucfirst lc $namepieces[0];
            $i = 0;
        }

        $i++;
        for my $prep (@prepositions) {
            if ($prep eq $namepieces[$i] || $prep eq "($namepieces[$i])" ) {
                $last .= " $prep $namepieces[$i+1]";
                $i += 2;
                last;
            }
        }

        $first = "";
        while ($i < @namepieces) {
            $first = $first . "$namepieces[$i] ";
            $i++;
        }
        }
        $last =~ s/ *$//;
        $first =~ s/ *$//;
    }

    # non-ascii chars:
    #$uppers = "ÄĄĆÉĘŁŃÓÖŚŠÚÜŹŻ";
    #$lowers = "äąćéęłńóöśšúüźż";
    #$uppers = "ÄĄĆÉĘŁŃÓÖŚŠÚÜŹŻ";
    #$notL1u=  "ÄACÉELNÓÖSSÚÜZZ";
    #$lowers = "äąćéęłńóöśšúüźż";
    #$notL1u=  "äacéelnóössúüzz";

    # chars in iso-8859-2 but not in iso-8859-1
    my $uppers = "ĄĆĘŁŃŚŠŹŻ";
    my $notL1u = "ACELNSSZZ";
    my $lowers = "ąćęłńśšźż";
    my $notL1l = "acelnsszz";

    my $wikifirst = $first;
    my $wikilast = $last;

    eval "\$wikifirst =~ tr/$uppers $lowers/$notL1u $notL1l/, 1" or die $@;
    eval "\$wikilast  =~ tr/$uppers $lowers/$notL1u $notL1l/, 1" or die $@;

    print "# ";
    if ($first ne "") {
        if ($wikifirst eq $first && $wikilast eq $last) {
            print "[[$first $last]]";
        } else {
            print "[[$wikifirst $wikilast|$first $last]]";
        }
    } else {
        if ($wikilast eq $last) {
            print "[[$last]]";
        } else {
            print "[[$wikilast|$last]]";
        }
        if ($last =~ m/.* ze? .*/) {
            $last =~ s/ ze? / of /;
            $wikilast =~ s/ ze? / of /;
            if ($wikilast eq $last) {
                print " / [[$last]]";
            } else {
                print " / [[$wikilast|$last]]";
            }
        }
    }
    if ($birthyear eq "") {
        if ($deathyear ne "") {
            if ($deathyear =~ m/[^0-9]/) {
                print " d. $deathyear";
            } else {
                print " d. [[$deathyear]]";
            }
        } else {
            print " ($years)";
        }
    } else {
        print " [[$birthyear]] - [[$deathyear]]";
    }
    print " $desc\n";
}   

[edit] What shall we do?

So, what do we do with this list? Perhaps its high time we moved it to the main wiki namespace? Halibutt 08:56, Jun 11, 2005 (UTC)

Hmm, I wanted to go through it and fix it myself first, but I guess if we wait for this we may went forever - I think I only fixed 1-2 first lists so far. So, where do you think we should move it? --Piotr Konieczny aka Prokonsul Piotrus Talk 09:36, 11 Jun 2005 (UTC)
No idea, perhaps we could both leave it here as a to-do list for the Polish WPs Notice Board and at the same time copy it to the main List of Poles. After all a partial list is by no means better than the full list. Halibutt 11:07, Jun 11, 2005 (UTC)
Well, now that the UTF/UNICODE problems are solved, the list is slightly outdated... Halibutt July 3, 2005 23:29 (UTC)
So, how about moving it to main wiki namespace? Halibutt 09:07, 16 September 2005 (UTC)
Well, since nothing is going on here, I guess this is the only choice - to merge it with the mainspace list of Poles. Unfortunatly, I doubt I will have time and will to do it soon. Any volunteers? --Piotr Konieczny aka Prokonsul Piotrus Talk 16:41, 16 September 2005 (UTC)

[edit] Script update

Do you need me to modify the script to output both UTF-8 links with polish characters intact and links with only iso8859-1 characters that can be used to create redirects? --ChrisRuvolo (t) 20:58, 16 September 2005 (UTC)

Sure, this would be great. --Piotr Konieczny aka Prokonsul Piotrus Talk 01:26, 17 September 2005 (UTC)

[edit] Su - Ś

List of future PSB articles is here Fjl 14:03, 16 Jun 2005 (UTC)