User:Vegard/Wikitext parsing

From Wikipedia, the free encyclopedia

Reasons for using an LL/LR (context-free) parser:

  • Parsing efficiency
  • Unambiguity
  • Uniform parser/grammar can also help greatly for accessing documents with bots/programs (think DOM)

Don't try to capture all of today's wikitext constructs using a formal grammar. This would be counter-productive. There are bound to be differences anyway. Even if strange things like nesting links within links is possible with the ad-hoc parsing, how many articles actually use it? New parser should be simple and extensible (compare with today's regex hell).

Contents

[edit] Wiki parser

[edit] Recursive descent parser

/* XXX */
document { section* | paragraph* }

/* Container for elements that rely on separated lines for structure (such as
 * lists). */
line { text "\n" }

/* Text may contain mark-up like links and font styles, but only a single
 * contigous line of text (therefore no lists or other elements that span
 * multiple lines). */
text { text-plain | text-italic | text-bold }
text-italic { "''" text "''" }
text-bold { "'''" text "'''" }

/* Plain-text may not contain additional markup. Plain-text may contain
 * markup that is not to be displayed as markup. Umm. */
text-plain {
        /* XXX: Define this. Make sure to include all UTF-8 characters. */
}

section { heading paragraph* }

/* Headings */
heading {
        heading-1 | heading-2 | heading-3 |
        heading-4 | heading-5 | heading-6
}
heading-1 { "=" text "=" }
heading-2 { "==" text "==" }
heading-3 { "===" text "===" }
heading-4 { "====" text "====" }
heading-5 { "=====" text "=====" }
heading-6 { "======" text "======" }

/* A single paragraph of text. May contain some multi-line constructs like
 * lists, but not headings. */
paragraph {
        (text | list)+
}

/* Signatures */
signature { signature-name | signature-name-date | signature-date }
signature-name { "~~~" }
signature-name-date { "~~~~" }
signature-date { "~~~~~" }

/* XXX: Match beginning/end of line */
ruler { "----" }

list { list-element* }
list-element { ("*" | "#" | ":" | ";")+ line }

comment { "<!--" plaintext "-->" }

tag { "<" /* XXX: What to put here? */ ">" }

[edit] Practical implementation

  • Don't make many exceptions and special cases (for example: A closing </nowiki> tag is not required. If it is missing then the rest of the supplied text is treated as nowiki. [1]). Depreciate these obscure features and produce warnings, so that pages in violation can be detected and corrected.
  • Allowing HTML was probably always a bad idea. Provide Wikitext replacements.
  • Use a bot to validate existing pages with the new parser. Maintain list of pages that are not valid with the new parser.

[edit] See also