Rendering Documents

Kuma’s wiki format is a restricted subset of HTML, augmented with KumaScript macros {{likeThis}} that render to HTML for display. Rendering is done in several steps, many of which are stored in the database for read-time performance.

A revision goes through several rendering steps before it appears on the site:

  1. A user submits new content, and Kuma stores it as Revision content
  2. Kuma bleaches and filters the content to create Bleached content
  3. KumaScript renders macros and returns KumaScript content
  4. Kuma bleaches and filters the content again to create Rendered content
  5. Kuma divides and processes the content into Body HTML, Quick links HTML, ToC HTML, Summary text and HTML

There are other rendered outputs:

Revision content

CKEditor provides a visual HTML editor for MDN writers. The raw HTML returned from CKEditor is stored in the Kuma database for further processing.

source

User-entered content, usually via CKEditor from the edit view (URLs ending with $edit)

Developer-submitted content via an HTTP PUT to the API view (URLs ending in $api)

displayed on MDN
“Revision Source” section of the revision detail view (URLs ending with $revision/<id>), in a <pre> tag
database
wiki_revision.content
code
kuma.wiki.models.Revision.content

To illustrate rendering, consider a new document published at /en-US/docs/Sandbox/simple with this Revision content:

<p>{{ CSSRef }}</p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color:red">I am red.</p>

<h2>Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li>{{HTMLElement('div')}}</li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button" onclick="alert('hacked!');"></div>

<script>
  alert('How about this?');
</script>

This document has elements that highlight different areas of rendering:

  • A sidebar macro CSSRef, which will be rendered by KumaScript and extracted for display

  • A <h2> tag, which will gain an id attribute

  • A list of three links:
    1. An HTML link to an existing document
    2. A reference macro HTMLElement which will be rendered by KumaScript
    3. An HTML link to a new document, which will get rel="nofollow" and class="new" attributes
  • An onclick attribute, added in Source mode, which will be removed

  • A <script> section, added in Source mode, which will be escaped

CKEditor has partial support for restricting content to the HTML subset allowed for display. It also enforces a style where paragraphs (<p>) are split by empty lines, start at the first column, and are closed on the same line. Nested elements are indented one space. Plain text is wrapped in <p> tags by default. KumaScript macros, such as {{CSSRef}}, are treated as plain text by CKEditor, so they are also wrapped in <p> tags.

Writers can also switch to “Source” mode, which permits direct editing of the HTML, avoiding formatting and content restrictions. This can be used to attempt to inject scripts like a onclick attribute or a <script>. These attempts are stored in the revision content.

The PUT API can also be used to add new revisions. This experimental API is for staff only at this time.

Bleached content

A revision can contain invalid markup, or elements that are not allowed on MDN. When a new revision is created, the related document is updated in revision.make_current(). This includes updating the title, path, and tags, and also cleaning the content and saving it on the Document record.

source
Revision content, processed with multiple filters
displayed on MDN
The API view (URLs ending in $api)
database
wiki_document.html for current revision, not stored for historical revisions
code

kuma.wiki.models.Document.get_html() (current revision, cached)

kuma.wiki.models.Revision.content_cleaned (any revision, dynamically generated)

The Bleached content of the simple document looks like this:

<p>{{ CSSRef }}</p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2>Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li>{{HTMLElement('div')}}</li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

The first step of cleaning is “bleaching”. The bleach library parses the raw HTML and drops any tags, attributes, or styles that are not on the allowed lists. In the simple document, this step drops the onclick attribute from the <div>, and escapes the <script> section.

Next, the HTML is tokenized by html5lib. The content is parsed for <iframe> elements, and any src attributes that refer to disallowed domains are dropped.

The tokenized document is serialized back to HTML, which may make changes to whitespace or attribute order. In the simple document, this step adds the extra space in style="color: red".

KumaScript content

KumaScript macros are represented by text content in two curly braces, and {{lookLike('this')}}. The KumaScript service processes these macros and replaces them with plain HTML. This intermediate representation is not stored, but instead is further processed to generate the Rendered content.

source
Bleached content, processed by KumaScript
displayed on MDN
not published
database
Errors at wiki_document.rendered_errors, content not stored
code
Errors at kuma.wiki.models.Document.rendered_errors, content not stored

The KumaScript content for the simple document looks like this:

<p><section class="Quick_links" id="Quick_Links"><ol><li><strong><a href="/en-US/docs/Web/CSS">CSS</a></strong></li><li><strong><a href="/en-US/docs/Web/CSS/Reference">CSS Reference</a></strong></li></ol></section></p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2>Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li><a href="/en-US/docs/Web/HTML/Element/div" title="The HTML Content Division element (&lt;div&gt;) is the generic container for flow content. It has no effect on the content or layout until styled using CSS."><code>&lt;div&gt;</code></a></li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

In the sample document, the {{CSSRef}} macro renders a sidebar. It uses data from the mdn/data project (fetched from GitHub), and the child pages of the CSS topic index (fetched from Web/CSS$children on the Kuma API server).

Because the sample document isn’t a real CSS reference page, the sidebar is smaller than usual. The data may specify that a page is in one or more groups, and a cross-reference should be added to the sidebar. For example, on Web/CSS/@media, the mdn/data JSON says it is in the “Media Queries” group, and the cross-reference is populated from API data feteched from Web/CSS/Media_queries$children. These data-driven elements can cause the sidebar to grow to several kilobytes.

The {{HTMLElement('div')}} macro also requires metadata from the <div> page, fetched from Web/HTML/Element/div$json on the Kuma API server, to populate the title attribute of the link.

Macros are implemented as Embedded JavaScript templates (.ejs files), which mix JavaScript code with HTML output. The macro dashboard has a list of macros, provided by the KumaScript service, as well as the count of pages using the macros, populated from site search. The macro source is stored in the KumaScript repo, such as CSSRef.ejs and HTMLElement.ejs. Macro names are case-insenstive, so {{CSSRef}} is the same as {{cssref}}.

If KumaScript encounters an issue during rendering, the error is encoded and returned in an HTTP header, in a format compatible with FireLogger. These errors are stored as JSON in wiki_document.rendered_errors. The rendered HTML isn’t stored, but it passed for further processing. Moderators frequently review documents with errors, and fix those that they can fix.

Environment variables

KumaScript macros often vary on page metadata, stored in the env object in the render context. The render call is a POST where the body is the Bleached content, and the headers include the encoded page metadata:

id
The database ID of the document, like 233925
locale
The locale of the page, like "en-US"
modified
The timestamp of the document modification time, like 1548278930.0
path
The URL path of the page, like /en-US/docs/Sandbox/simple
review_tags
A list of review tags, like ["technical", "editorial"]
revision_id
The database ID of the revision, like 1438410
slug
The slug section of the URL, like Sandbox/simple
tags
A list of document tags for the page, like [] or ["CSS"]
title
The document title, like "A simple page"
url
The full URL of the page, forced to http, like http://developer.mozilla.org/en-US/docs/Sandbox/simple.

Macro rendering speed

It is unpredictable how long it will take to render the macros on a page. After editing, a render is requested, and if it returns quickly, then the rendered page is displayed. Otherwise, rendering is queued as a background task, and the user sees a message that rendering is in progress.

Macros vary on rendering time, stability, and ease of testing based on where they get their data. From simplest to most complex:

functional
The output varies only on the macro inputs, like SimpleBadge
environment data
The output varies on the environment variables, like ObsoleteBadge
local data
The output varies on data packaged with KumaScript, like SpecName (from SpecData.json) or Compat (from the npm-installed browser-compat-data project)
Kuma data
The output varies on data gathered from Kuma API calls to an in-cluster dedicated Kuma API server, like Index, which calls the $children API, or HTMLElement, which calls the $json API.
external data
The output varies on data from an external data source, like Bug (loads data from the Bugzilla API) or CSSRef (loads data from the mdn/data project via the GitHub API)

Rendered content

Rendered content is KumaScript content that has been cleaned up using the same process as Bleached content. This ensures that escaping issues in KumaScript macros do not affect the security of users on displayed pages.

source
Bleached KumaScript content
displayed on MDN
not published
database
wiki_document.rendered_html
code
kuma.wiki.models.Document.get_rendered()

The Rendered content for the simple document looks like this:

<p></p><section class="Quick_links" id="Quick_Links"><ol><li><strong><a href="/en-US/docs/Web/CSS">CSS</a></strong></li><li><strong><a href="/en-US/docs/Web/CSS/Reference">CSS Reference</a></strong></li></ol></section><p></p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2>Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li><a href="/en-US/docs/Web/HTML/Element/div" title="The HTML Content Division element (&lt;div>) is the generic container for flow content. It has no effect on the content or layout until styled using CSS."><code>&lt;div&gt;</code></a></li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

The parser doesn’t allow <section> as a child element of <p>, so the serializer closes the tag with a </p>, and adds another empty paragraph element after the section. This is a side-effect of the differences between the editing format, where {{CSSRef}} is text that needs to be in a paragraph element, and the rendered content, where the macro is expanded as a <section>.

Body HTML

The “middle” of a wiki document is populated by the Body HTML.

source
Extracted from Rendered content, cached in the database
displayed on MDN
On the displayed page, in an <article> element
database
wiki_document.body_html
code
kuma.wiki.models.Document.get_body_html()

The Body HTML for the simple document looks like this:

<p></p><p></p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2 id="Some_Links">Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li><a href="/en-US/docs/Web/HTML/Element/div" title="The HTML Content Division element (&lt;div>) is the generic container for flow content. It has no effect on the content or layout until styled using CSS."><code>&lt;div&gt;</code></a></li>
 <li><a rel="nofollow" href="/en-US/docs/NewDocument" class="new">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

The section <section id="Quick_links"> is discarded, leaving the empty <p></p> elements from the Rendered content. This can cause annoying empty space at the top of a document.

IDs are injected into header elements (such as id="Some_Links"), based on the header text.

Any links on the page are checked to see if they are links to other wiki pages, and if the destination page exists. The link to a_new_document gains a rel="nofollow" as well as class="new", to tell crawlers and humans that the link is to a page that hasn’t been written yet.

ToC HTML

The table of contents is populated from the <h2> elements of the Rendered content, if any, and appears as a floating “Jump to” bar when included. The “Jump to” bar can be supressed in editing mode by opening “Edit Page Title and Properties”, and setting TOC to “No table of contents”. The JavaScript can also decide to keep the bar hidden, such as when there is a single heading. Even when not shown, the ToC HTML is generated and cached.

source
Extracted from Rendered content, cached in the database
displayed on MDN
On the displayed page, in an <ol class="toc-links"> element
database
wiki_document.toc_html
code
kuma.wiki.models.Document.get_toc_html()

For the simple document, the ToC HTML looks like this:

<li><a rel="internal" href="#Some_Links">Some Links</a>

Summary text and HTML

Summary text is used for SEO purposes. An editor can specify the summary text by adding an id="Summary" attribute to the element that contains the summary. Otherwise, the code extracts a summary from the first non-empty paragraph.

source
Extracted from Rendered content, cached in the database
displayed on MDN (text)

On the displayed page, in the <meta name"description"> element and other elements

In internal search results, as the search hit summary

On some document lists, like Documents by tag

displayed on MDN (HTML)

The page metadata view (URLs ending in $json)

The summary view (URLs with ?summary=1) (currently broken, see bug 1523955)

KumaScript macros that use page metadata, for example to populate title attributes

database

wiki_document.summary_text

wiki_document.summary_html

code

kuma.wiki.models.Document.get_summary_text()

kuma.wiki.models.Document.get_summary_html()

For the simple document, the summary text is:

I am a simple document with a CSS sidebar.

The summary HTML is:

I am a <strong>simple document</strong> with a CSS sidebar.

Diff format

MDN moderators and localization leaders are interested in the changes to wiki pages. They want to revert spam and vandalism, enforce documentation standards, and learn about the writer community. They are focused on what changed between document revisions. The differences format, or Diff format, is used to highlight content changes.

source
Revision content, pretty-printed with tidylib, and compared to other revisions.
displayed on MDN

Revision comparison views (URLs ending in $compare)

The Revision dashboard

Page watch emails

First edit emails, sent to content moderators

RSS and Atom feeds

database
wiki_revision.tidied_content
code
kuma.wiki.models.Revision.get_tidied_content()

The simple document in Diff format looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
  <head>
    <title></title>
  </head>
  <body>
    <p>
      {{ CSSRef }}
    </p>
    <p>
      I am a <strong>simple document</strong> with a CSS sidebar.
    </p>
    <p style="color:red">
      I am red.
    </p>
    <h2>
      Some Links
    </h2>
    <ul>
      <li>
        <a href="/en-US/docs/Web/HTML">The HTML Reference</a>
      </li>
      <li>{{HTMLElement('div')}}
      </li>
      <li>
        <a href="/en-US/docs/NewDocument">A new document</a>
      </li>
    </ul>
    <div class="button" onclick="alert('hacked!');"></div>
    <script>
    alert('How about this?');
    </script>
  </body>
</html>

The Revision content is normalized using pytidylib, a Python interface to the C tidylib library, which turns the content into a well-structured HTML 4.01 document.

Content difference reports, or “diffs”, are generated by a line-by-line comparison of the content in Diff format of two revisions. Lines that differ are dropped, so that the reports focus on just the changed content, often without the wrapping HTML tags like <p></p>. These diffs often contain line numbers from the Diff format, which do not correspond to the line numbers in the Revision content because of differences in formatting and whitespace.

Because the Diff format can contain unsafe content, it is not displayed directly on MDN. On Revision comparison views, the Revision dashboard, and in feeds, two Diff formats are processed by difflib.HtmlDiff to generate an HTML <table> showing only the changed lines, and with HTML escaping for the content.

For emails, difflib.unified_diff generates a text-based difference report, and it is sent as a plain-text email without escaping.

Triaged content

When a document is re-edited, the Revision content of the current revision is processed before being sent to the editor. This is a lighter version of the full bleaching process used to create Bleached content and Rendered content.

source
Revision content, with further processing in RevisionForm.
displayed on MDN

Editing <textarea> in the edit view (URLs ending with $edit)

Editing <textarea in the translate view (URLs ending with $translate)

database
not stored
code
not available

For the simple document, this is the Triaged content:

<p>{{ CSSRef }}</p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color:red">I am red.</p>

<h2 id="Some_Links">Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li>{{HTMLElement('div')}}</li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

<script>
  alert('How about this?');
</script>

The headers get IDs, based on the content, if they did not have them before. For example, id="Some_Links" is added to the <h2>.

A simple filter is applied that strips any attributes that start with on, such as the scripting attempt onclick. Further bleaching, for example to remove the <script>, is not applied.

CKEditor will perform additional parsing and formatting at load time. It will sometimes notice the empty <div> and replace it with <div class="button">&nbsp;</div>, especially if it is the last element on the page. It may also remove the <script> element entirely.

If a writer makes a change, these backend and CKEditor changes will be reflected in the new Revision content. This can confuse writers (“I didn’t add those IDs!”).

Preview content

When editing, a user can request a preview of the document. This sends the in-progress document to editing, with a smaller list of environment variables.

source
Triaged content, with CKEditor parsing, passed through KumaScript
output
HTML content at /<locale>/docs/preview-wiki-content
database
not stored
code
not available

The Preview content for the simple document is:

<p></p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2>Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li><a href="/en-US/docs/Web/HTML/Element/div" title="The HTML Content Division element (&lt;div>) is the generic container for flow content. It has no effect on the content or layout until styled using CSS."><code>&lt;div&gt;</code></a></li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

Fewer environment variables are passed to the KumaScript server for preview than when generating the KumaScript content:

url
The base URL of the website, like https://developer.mozilla.org/
locale
The locale of the request, like "en-US"

Some macros use the absence of an environment variable to detect preview mode, and change their output. For example, {{CSSRef}} notices that env.slug is not defined, and outputs an empty string, leaving <p></p> in the preview output.

Other macros don’t have specific code to detect preview mode, and have KumaScript rendering errors in preview.

Some macros, like {{HTMLElement}}, work as expected in preview.

Raw content

A ?raw parameter can be added to the end of a document to request the source for a revision. This is processed in a similar way to the Triaged content, but from the Bleached content.

source
Bleached content, with filters
output
The page with a ?raw query parameter
database
not stored
code
not available

For the simple document, this is the raw content:

<p>{{ CSSRef }}</p>

<p>I am a <strong>simple document</strong> with a CSS sidebar.</p>

<p style="color: red;">I am red.</p>

<h2 id="Some_Links">Some Links</h2>

<ul>
 <li><a href="/en-US/docs/Web/HTML">The HTML Reference</a></li>
 <li>{{HTMLElement('div')}}</li>
 <li><a href="/en-US/docs/NewDocument">A new document</a></li>
</ul>

<div class="button"></div>

&lt;script&gt;
  alert('How about this?');
&lt;/script&gt;

The Bleached content is parsed for filtering . The headers get IDs, based on the content, if they did not have them before. For example, id="Some_Links" is added to the <h2>.

A simple filter is applied that strips any attributes that start with on, such as the scripting attempt onclick. However, this step should do nothing, since these attribute are dropped when creating the Bleached content.

Live sample

Live samples are stored in document content. The content is then processed to extract the CSS, JS, and HTML, and reformat them as a stand-alone HTML document suitable for displaying in an <iframe>.

source
A section extracted from Rendered content, with further processing
output
Live sample documents on a separate domain, such as https://mdn.mozillademos.org
database
Not stored in the database, but cached
code
kuma.wiki.Document.extract.code_sample(section_id)

The simple document does not include one of these samples. The Live samples page on MDN describes how the system works for content authors, and includes a live sample demo.

Most live samples are loaded in an <iframe>, inserted by the macro EmbedLiveSample. If the sample doesn’t work as an <iframe>, LiveSampleLink can be used instead. The <iframe src= URL is Kuma, running on a different domain, such as https://mdn.mozillademos.org, and configured to serve live samples (the code sample view) and attachments. A separate domain for user-created content, often served in an <iframe>, mitigates many security issues.

The live sample is cached on first access, and generated when requested. The extractor looks for <pre> sections with class="brush: html", "brush: css", and "brush: js", to find the sample content, and then selectively un-escapes some HTML and CSS. These sections are used to populate a basic HTML file.

There are other sample types that are not derived from wiki content. These are out-of-scope for this document, but the most significant are listed here for the curious:

Search content

Wiki documents are converted to a JSON format and indexed by ElasticSearch for internal search. This allows searching for words in the wiki content.

source
text extracted from Rendered content
output
text sections from in-content results from internal search
database
stored in ElasticSearch
code
kuma.wiki.search.WikiDocumentType.from_django(Document)

The Django utility strip_tags is used to quickly remove HTML tags. This utility is not guarenteed to generate an HTML safe string, as highlighted in a security advisory. Kuma does not redisplay this string. ElasticSearch applies the HTML Strip Char Filter to this and other content, which also strips tags and replaces HTML entities like &amp; with the character equivalents like &.

When a search result is picked because of a content match, ElasticSearch returns the matching section, highlighting the matching terms in bold. This HTML is redisplayed on the search results page.

Documents are indexed when created and when updated, as an asyncronous process. Documented are removed from the index when deleted. Administrators can also re-create the entire index, for ElasticSearch upgrades or to freshen the data.

There is additional page metadata sent to ElasticSearch to power internal search. This includes page titles, tags, and locales. It also includes KumaScript macros, CSS class names, and HTML attributes, to allow advanced search queries and to power the macro dashboard.

Future Changes

Rendering evolved over years, and this document describes how it works, rather than how it was designed. There are some potential changes that would simplify rendering:

  • Sidebar macros are heavy users of API data and require post-processing of the Rendered content. Sidebar generation could be moved into Kuma instead of being specified by a macro.
  • The Diff format could be replaced by the Bleached content format, which would be stored for each revision rather than just for the most recent document.
  • Content from editing could be normalized and filtered before storing as the Revision content. This may unify the Triaged content, Diff format, and Bleached content.
  • The views that accept new revisions could add IDs to the content before storing the Revision content, rather than wait for the Triaged content or Body HTML.
  • Developers could refactor the code to consistently access and generate content, rather than repeat filter logic in different forms and views.

History

MDN has used different rendering processes in the past.

Prior to 2004, Netscape’s DevEdge was a statically-generated website, with content stored in a revision control system (CVS or similar). This was shut down for a while, until Mozilla was able to negotiate a license for the content.

From 2005 to 2008, MediaWiki was used as the engine of Mozilla Developer Center. The DevEdge content was converted to MediaWiki Markup.

From 2008 to 2011, MindTouch DekiWiki was used as the engine. MindTouch migrated the MediaWiki content to the DekiWiki format, a restricted subset of HTML, augmented with macros (“DekiScript”). During this period, the site was rebranded as Mozilla Developer Network.

In 2011, Kuma was forked from Kitsune, the Django-based platform for support.mozilla.org. The wiki format was as close as possible to the DekiWiki format. A new service KumaScript was added to implement DekiScript-style macros. The macros, also known as templates, were stored as content in the database. The service had a GET API to render pages, and a POST API to render previews.

In 2013, content zones were added, which allowed a “zone” of pages to have a different style from the rest of the site. For example, the Firefox Zone of all the documents under /Mozilla/Firefox had a logo and a shared sub-navigation sidebar. Sub-navigation was similar to quick links, identified by <section id="Subnav">, but stored on the “zone root” (/Mozilla/Firefox) rather than generated by a macro. Zones were part of an effort to consolidate developer documentation on MDN.

In 2016, the macros were exported from the Kuma database into the macros folder in the KumaScript repository. The historical changes were exported to mdn/archived_kumascript. This made rendering faster, and allowed code reviews and automated tests of macros, at the cost of requiring review and a production push to deploy macro changes.

In 2018, the content zones feature was removed. This was part of an effort to focus MDN Web Docs on common web platform technologies, and away from Mozilla-specific documentation. The sub-navigation feature was dropped.

In 2019, the KumaScript engine and macros were modernized to use current features of JavaScript, such as async / await, rather than libraries common in 2011. The API was also unified, so that both previews and standard renders required a POST.