User talk:Jan Hidders/HTML-free mark-up

From Wikipedia, the free encyclopedia

That would certainly make the work of the parser (and mine;) a lot easier. But it would also mean to automatically replace HTML markup (which people will use no matter what) with wiki format (upon saving), which will be

  1. very tough, especially with tables
  2. a reason for people to cry out loud (I am thinking especiallyof The Cunctator;)

Also, some HTML things are nice, font tags, for example. Labelling an image is quite neat if the label is the same color as the object in the image.

Magnus Manske

You only once have to translate the complete contents of Wikipedia to the new mark-up. After that you always replace the tag delimiters < and > with the entities & gt ; and & lt ;. (That's what PhpWiki does, for example.) People can then type all the HTML they like, it won't work. I agree about the font color, but you can probably invent some mark-up for that too. Jan Hidders

Jan, why do you think it desirable to banish all HTML markup? Isn't it be better to keep the threshold of contributing as low as possible for new users? AxelBoldt

I believe firmly that using HTML actually heightens that threshold. (FWIW, I actually teach XML but still find that it doesn't make sense as a human-readable format.) Remember that the complexities of HTML was exactly the reason that WikiWiki was invented (See "The Wiki Way" by Ward Cunningham, the originator of the concept). The HTML table-syntax, for example, is much more involved and harder to read in ASCII form than PhpWiki/MoinMoin table-syntax. Having two ways to do the same thing (e.g., ' ' and < i >) also doesn't make things simpeler. Also remember that accessible does not just mean that it should be easy for people to write something new, but also that it should be easy to adapt something old. The latter becomes more difficult if a previous writer used some nifty HTML stuff. ... I guess I could go on about this but I have to get back to work now. Jan Hidders

FWIW, I agree, especially about the table syntax - take a look at my (still incomplete) list of food additives and think about why my first run is generated by a Python script from a space-separated file on my local machine. It'd be nice not to have to carefully filter HTML, too, so that things like clicking here aren't possible.

I do have some notes on your proposal, though:

  • We'll still need to be able to enter entities like β as "& beta ;". It'd be nice to be able to enter hexadecimal entities like ’ and have them converted to & #8217; on output for older browsers too.
  • Recognising a "_" or "/", etc., that's supposed to be rendered as itself might be tricky. Maybe a double-underscore?
  • I'd like --- to do em-dash, "—", myself. I wonder how many people use strike-out?
  • I will never remember which is superscript or subscript. How about something more mnemonic like {^superscript^} and {_subscript_}?

Carey Evans

I agree that the entities et cetera should stay, it's only the tags that I don't like. The problem of escaping special mark-up symbols is usually solved by a special escape symbol like "\". I would advocate that here too. I also agree about the em-dash and, yes, I don't think strike is used very much. I also agree that my symbols for sub and superscript are not very intuitive, but {_sub_} looks a bit much like _sub_. -- Jan Hidders

I have to say I like your proposal. Although I'm generally very comfortable editing HTML by hand in Vim, wiki editing with wiki tags seems very appropriate. I like having different level headings indicated by the number of = signs before and after, for instance. However, that particular convention leads to lots of typos: people forget to leave a space between the section heading's text and the equals signs on either side, or they don't balance the number of equals signs on either side so we see a dangling = on the page. Regarding tables, could there be a way to specify/enforce the number of columns in a table at the beginning? I think pages like List of saints would be much easier to edit using the syntax you suggest. Wesley

Thanks for agreeing with me. Enforcing the number of columns given the first line of the table is possible but not easy to implement; the parser then has to remember the number of columns. -- Jan Hidders

One problem with the proposal: How will the new table syntax represent border/borderless cells and rowspan/colspan (necessary for the depiction of the roulette board)? -- Damian Yerrick

Good question. I only gave a notation for colspan. It is enevitable that if you are going to forbid the liberal use of HTML some things will no longer work. On the other hand, if you do want to allow HTML (or a safe subset of it) then you should write a small parser for that if you always want to guarantee correct HTML output and make sure that Magnus's table lay out isn't messed up. -- Jan Hidders

Just wondering: why would I care? --The Cunctator


From the mailing list, with replies:

I do disagree with you there, thinking that ''' is more difficult, although only because more newbies will know <b> to begin with — they are inherently of pretty much the same complexity (I see two minor arguments each for relative simplicity). This is minor, and the difference will probably only lessen with time.

I think the relevant arguments have already been mentioned:
  • both are equally easy to learn
  • many newcomers already know <b>
  • other newcomers are a bit intimidated by HTML tags
  • raw text with ''' is slightly easier to read than with <b>
The first isn't an argument, just a denial of the existence of an argument for the other. The second is the argument that I think wins the case for <b>; I'm claiming that they look about even without that (relevant since that will lose its strength over time). So you missed the two minor arguments in favour of <b>:
  • Beginning and ending are clearer; a misplaced tag is easier to spot when it's rendered as a literal <b> in the text.
  • The letter "b" reminds people of the meaning of the markup; many people already associate this letter with boldface thanks to Microsoft Word and similar programs.
But we're getting down to niggliness here. You already know that I prefer ''' in most situations anyway, and prefer it to <strong>, which is how it actually renders.

I argue that the HTML tag itself is the best wiki markup for most of these. It's just a few situations where we have something better, or where the HTML is so complicated that we *need* something better. Then I'm with you; I just wish that this weren't an antiHTML crusade.

That crusade is just me. Please don't let my extreme point of view stop you from agreeing with more reasonable points of views. :-) Even I could probably be convinced to use HTML tags for certain mark-up if we cannot find good Wiki alternatives. However, if there is a good Wiki alternative then we should use that and that alone. But you probably agree with me there.
Except for "alone". Although I am mulling thoughts in that direction. I also suspect that we'll disagree on what's a "good" alternative, but at least doesn't affect the principle of the thing.

Well, Lee has just informed me that <strong> and <em> are taken care of; it was only Phase II that rendered ''' and '' suboptimally as <b> and <i>.

This means that the mark-up has even now again become more complex because a writer now has to decide between ''' and <b> and know the difference. If there had been only one notation we wouldn't even have had this discussion and/or the developers would have had to consult Wikipedia-l for adding new mark-up. We are failing in keeping the mark-up simple. That is bad.
There is a difference between <strong> and <b>, and what we write here generally should be <strong>. What we need to do now is to deprecate <b> in ordinary Wikipedia usage, and I'll join you on that. But I will resist getting rid of it entirely, for several reasons.

Toby 13:14 Jul 30, 2002 (PDT)

Jan Hidders
Toby 23:04 Jul 30, 2002 (PDT)

Toby,

Since we both seem to agree that ' ' ' is better than < b > I will leave that dicussion be.

What puzzles me is why you would want to keep < b > around if you already have ' ' '. I suspect that this is because you see them as different: one is < strong > and the other is < b >. As I already indicated, I don't think that this distinction is worth the extra complexity. Since simplicity should be our default, it is up to you to prove that we indeed need this distinction. Can you? I would also argue that you need to do the same for < var > and ' '. If your only argument is that Lynx by default will underline variable names as it also does with other italicised text, then that is not enough to convince me that < var > should stay.

-- Jan Hidders 02:23 Aug 2, 2002 (PDT)

First of all, I do not agree that ''' is better than <b>. I think that they're both quite simple and that either would work, but that the wide use of <b> on other web sites (ultimately because of its use in HTML, of course), where others will be familiar with it, makes it preferable. Even after a period of time, when this effect will presumably diminish, I still find them roughly equal. It is <strong> that I consider worse than '''.

But I'm afraid that our difference is more profound. If you check the history of the discussion, you'll see that I argued in favour of keeping both <b> and ''' before I learned that ''' really meant <strong>. Since I preferred <b> but thought that there was no difference in effect, your simplicity argument would have led me to suggest banning '''. I did say that ''' should be banned before <b> was, but I also said that it should not be banned.

Part of this was a desire to see ''' interpreted as <strong>, but I actually thought that there was little chance that this would happen. (I was surprised when Lee told me that he had made this change unilaterally already!) The really good reason to keep ''' is that it's already in wide use in Wikipedia. (<b>, which you would like to ban, is used less often, but it's still fairly common too.) Furthermore, we should keep ''' since people coming from other wikis will expect to be able to use it, just as people coming from some web fora will expect to be able to use <b>.

Generally speaking, I just don't see <b> and ''' as very complicated at all. I've seen technophobes use <b> elsewhere on the web, and I've seen them use ''' here; neither seems very difficult. As for the existence of two different methods of producing the same thing, this is not a problem. On the contrary, it's a benefit! People shouldn't have to remember that you use ''' in one place and <b> in another, and on Wikipedia, they don't have to. Whichever they tries works!

Of course, it's also true that highlighting main terms in an article is a use of strongly emphasised text, while denoting the real line with a special font is a use of boldfaced text. But my opinion on <b> vs ''' doesn't depend on this.

Toby 06:16 Aug 3, 2002 (PDT)

Ok, so you think that < b > is slightly better than ' ' '. I think we both have given the arguments why we feel either way, so I still see no point in continuing that discussion and suggest that we agree to disagree there. What seems to me much more important are the following issues:

  1. Why have two equivalent mark-ups for the same thing?
  2. Is the distinction between < strong > and < b > worth the added complexity?
  3. Is the distinction between < var > and < i > worth the added complexity?

You have addressed the first question above and if I understand you correctly you are saying "because newcomers will know either one of them so it makes writing for them easier". But, in the first place, this is a really weak argument since once you start to edit you are very likely to see on your first edit page already an example of ' ' ' so the problem of < b > not working is really a very very small problem. In the second place you haven't really addressed the argument that I gave against having two notations: it makes people wonder about what the difference is and which should be used where. In fact, you yourself, have given the best example of this by coming up with the suggestion that < b > and ' ' ' should not be the same. Apparently this is something that people think about. You did. If we would have had only one notation then there would never have been any discussion about that and that would have saved us both time.

I'm also still curious what your reply to the other two issues is, although, quite honestly, I'd even rather get back working on my to-do list for Wikipedia. :-)

-- Jan Hidders 11:36 Aug 4, 2002 (PDT)

Fine, we can drop <b> vs ''' as such; I just didn't want my opinion misrepresented. The issue before us is whether to have both, presumably to mean slightly different things.

Here's my answer for "What's the difference?", which I would envision appearing on Wikipedia:How to edit a page.: "Use ''' for strongly emphasised text and '' for emphasised text. You can also use <b> and <i> respectively, if you're more familiar with them. (There is a technical difference between ''' and <b> and between '' and <i>, but it's not a big deal.)" People like me that care about this sort of thing can change <b> to ''' when we see it (or ''' to <b>, although that will be rarer), just as I do now in fact, but more relaxed people can ignore the issue, which is pretty harmless. Meanwhile, those that follow the link to Wikipedia:HTML will (or so I envision) get a nice page explaining all the HTML that Wikipedia renders, when each is appropriately used, and how to get them in our wiki code.

Is the difference important? I think that correct HTML is important, because user agents may not interpret things in the way that we normally expect. Is this important enough? I don't know how to argue that sort of thing. I say yes, you say no, I say /tomeito/, you say /tomAto/. We agree that complexity is added, but nobody has an argument as to how bad this is.

I really need to write Toby Bartels/HTML in Wikipedia. Remind me to get on that.

BTW, what's with the " 65-69 2.13 2.30 60-64 2.33 2.77 "?

Toby 00:30 Aug 5, 2002 (PDT)

Sorry for those funny numbers, I've removed them now. Frankly, I had no idea how they got there, so I did a Google search (you never know) and, lo and behold, they are part of the demographics table of Netherlands in Wikipedia. So some cut'n paste must have gone wrong there. Apologies for that.

Back to our regular program. The default should be to keep things simple. If you cannot argue convincingly that it should be added then it shouldn't be. Period.

But this is precisely the philosophical difference between us. I think that the default should be to allow users (writers, in this case) to do things in multiple ways, whichever feels most comfortable to them. I don't think that we're going to get anywhere arguing over this philosophy anymore. (I also think that backwards compatibility is important, but that's really part of the same philosophy.)

If we really need it then this will become clear in the future and we can always add it then. Such a strict attitude is really the only one that will work because the natural tendency is always to add more and more features. I remember when I started using PhpWiki (that's a conventional Wiki based on PHP with a database back-end) and asked if I could extend it with markup for tables. They said "no". I said I wanted to write articles about the relational model and database normalization, so I really needed tables. They still said no and suggested I used the preformat-markup. I almost got angry at them, but now I understand them and they were right.

<pre> for tables? That's crap. Period.

-- Jan Hidders 07:02 Aug 5, 2002 (PDT)

Toby 16:25 Aug 5, 2002 (PDT)