User:Vegard/Wikitext parsing
From Wikipedia, the free encyclopedia
Reasons for using an LL/LR (context-free) parser:
- Parsing efficiency
- Unambiguity
- Uniform parser/grammar can also help greatly for accessing documents with bots/programs (think DOM)
Don't try to capture all of today's wikitext constructs using a formal grammar. This would be counter-productive. There are bound to be differences anyway. Even if strange things like nesting links within links is possible with the ad-hoc parsing, how many articles actually use it? New parser should be simple and extensible (compare with today's regex hell).
Contents |
[edit] Wiki parser
[edit] Recursive descent parser
/* XXX */ document { section* | paragraph* } /* Container for elements that rely on separated lines for structure (such as * lists). */ line { text "\n" } /* Text may contain mark-up like links and font styles, but only a single * contigous line of text (therefore no lists or other elements that span * multiple lines). */ text { text-plain | text-italic | text-bold } text-italic { "''" text "''" } text-bold { "'''" text "'''" } /* Plain-text may not contain additional markup. Plain-text may contain * markup that is not to be displayed as markup. Umm. */ text-plain { /* XXX: Define this. Make sure to include all UTF-8 characters. */ } section { heading paragraph* } /* Headings */ heading { heading-1 | heading-2 | heading-3 | heading-4 | heading-5 | heading-6 } heading-1 { "=" text "=" } heading-2 { "==" text "==" } heading-3 { "===" text "===" } heading-4 { "====" text "====" } heading-5 { "=====" text "=====" } heading-6 { "======" text "======" } /* A single paragraph of text. May contain some multi-line constructs like * lists, but not headings. */ paragraph { (text | list)+ } /* Signatures */ signature { signature-name | signature-name-date | signature-date } signature-name { "~~~" } signature-name-date { "~~~~" } signature-date { "~~~~~" } /* XXX: Match beginning/end of line */ ruler { "----" } list { list-element* } list-element { ("*" | "#" | ":" | ";")+ line } comment { "<!--" plaintext "-->" } tag { "<" /* XXX: What to put here? */ ">" }
[edit] Practical implementation
- Don't make many exceptions and special cases (for example: A closing </nowiki> tag is not required. If it is missing then the rest of the supplied text is treated as nowiki. [1]). Depreciate these obscure features and produce warnings, so that pages in violation can be detected and corrected.
- Allowing HTML was probably always a bad idea. Provide Wikitext replacements.
- Use a bot to validate existing pages with the new parser. Maintain list of pages that are not valid with the new parser.
[edit] See also
- http://bugs.wikimedia.org/show_bug.cgi?id=7
- http://en.wikipedia.org/wiki/Help:Editing
- http://meta.wikimedia.org/wiki/EBNF
- http://meta.wikimedia.org/wiki/MediaWiki_lexer
- http://meta.wikimedia.org/wiki/MediaWiki_flexer
- http://meta.wikimedia.org/wiki/Alternative_parsers
- http://meta.wikimedia.org/wiki/One-pass_parser
- http://www.mediawiki.org/wiki/User:HappyDog/WikiText_parsing
- http://www.mediawiki.org/wiki/Markup_spec