ContNet Markup specification, version 0.3.1 (2017-08-23)

Overview

CNM is a lightweight markup language primarily meant to be used as the hypertext document markup format for ContNet. It is a line-based Unicode text markup format with indentation-delimited blocks. The primary goals of CNM are simple parsing and composition, as well as being readable and writable by humans.

CNM contains semantic content of hypertext pages. It does not include layout, styles or scripts, as all of that is supposed to be handled by the rendering application. As such, it aims to avoid obfuscating content behind presentation and supports responsive design, as every device can render the content to fit its screen and interface.

Syntax

All parts of CNM use the UTF-8 encoding. Any invalid UTF-8 sequence is replaced with the U+FFFD replacement character.

A CNM document is mainly composed of blocks defined by indentation. The core structure of the document consists of nested blocks containing other blocks, with the leaves being either blocks with no child blocks or some form of text that does not contain any blocks.

Each line in the document ends in a line feed character. All raw (not provided as an escape sequence) carriage return or null characters in the document are ignored. If the document does not end with a line feed character, it is parsed as if it had ended with one.

The contents of each block are parsed according to that block's parsing mode. If the block is not known, it can be parsed as a raw text block or skipped entirely.

When whitespace is mentioned in the specification, it refers to the following ASCII whitespace characters: tab (U+0009), line feed (U+000A), form feed (U+000C) and space (U+0020) in their raw Unicode character form, not as an escape sequence. All other Unicode whitespace characters stand for themselves and are not collapsed or used to split fields.

An empty line is a line consisting of at most as much indentation as the parent block's contents and nothing else. Such lines implicitly belong to the last parsed block regardless of the amount of indentation and act the same as if the indentation depth was the same as the block's contents.

TL;DR: Encoded in UTF-8, line-based. LF is line terminator, CR is ignored. Unknown blocks' contents are skipped.

The following general syntactic contexts are commonly used:

Block mode

In block mode, every nonempty line is parsed as a block name line.

The block name line consists of a list of whitespace-delimited simple text tokens. The line is first split on each sequence of one or more whitespace characters that are not a part of a simple text escape sequence (specifically, not "\ "). If there's any leading or trailing whitespace, the first or last token is an empty string after splitting. If the splitting ends with a single empty token (the entire line was just whitespace), the line is treated the same as an empty line and is skipped.

The first token in the block name line is the block name. It defines the meaning of the block and how its contents are parsed. The remaining tokens, if any, represent the block's arguments. All empty tokens in the arguments should be ignored. Some blocks might use the arguments as one single value; in that case, the arguments are joined together with spaces.

Note that excess tabs or space indentation will result in a block with an empty name. This will usually result in an unknown block, which will then be skipped.

All lines following the block name line that are indented at least one level more than the block name or are empty are parsed as the contents of the named block. For every such line, the initial indentation equal to one level more than the block name's is removed and the remainder of the line parsed according to the named block's mode (the inner block keeps any tab characters in excess of the indentation). Block mode parsing in the current block resumes on the first nonempty line that has less indentation than the contents of the last named block.

TL;DR: Block mode contains blocks. Each block starts with line containing simple text name and optional arguments, split by non-escaped whitespace. All lines indented over the indentation of the block name line are contents of that block.

Simple text mode

Simple text is parsed by collapsing all raw (not provided as an escape sequence) whitespace into a single space and removing any leading or trailing spaces, then resolving escape sequences.

Simple text can contain escape sequences. These are C-style sequences of two or more characters that begin with a backslash and are parsed as a single character they represent. The following escape sequences are currently defined (without quotes):

"\b"          ->  U+0008      backspace
"\t"          ->  U+0009      tab
"\n"          ->  U+000A      line feed
"\v"          ->  U+000B      vertical tab
"\f"          ->  U+000C      form feed
"\r"          ->  U+000D      carriage return
"\ "          ->  U+0020      space
"\\"          ->  U+005C      backslash
"\x##"        ->  U+00##      8-bit Unicode character
"\u####"      ->  U+####      16-bit Unicode character
"\U########"  ->  U+########  32-bit Unicode character

The # characters in \x##, \u#### and \U######## escape sequences are arbitrary hexadecimal digits [0-9a-fA-F]. In \U########, the first two digits should generally be zero, since Unicode only supports 21-bit characters. Invalid codepoints are unescaped into the U+FFFD replacement character.

Any other sequence starting with a backslash that is not in the above table, or one of the \x, \u and \U sequences with too few hex digits, are parsed the same as if the backslash itself was escaped: they're left in the text unchanged, with the backslash remaining present.

Simple text mode is mostly used in block mode block names and arguments or as a part of other formats in specific blocks.

TL;DR: Collapse and trim whitespace. Handle C-style escape sequences. Invalid escape sequences are parsed as normal text.

Raw text mode

In raw text mode, all data is parsed as a literal text blob. Whitespace is preserved exactly as-is, including any leading tabs (tabs that are a part of the block's indentation do not count as a part of the block content in block mode) and empty lines inside the content, excluding any leading or trailing empty lines, which are removed. Global text parsing rules (ignoring carriage returns, UTF-8) still apply. Each raw text line also retains its line feed character.

Raw mode is mostly used for the raw block and for the initial parsing of other blocks with their own syntax. In essence, every block could first be parsed in raw mode, then the results of that using the block's parsing mode.

TL;DR: Lines are kept unmodified for later processing.

Structure

The top level of a CNM document is parsed in block mode. It contains blocks containing metadata and the content itself.

None of the top-level blocks in CNM have any arguments.

An empty top-level block is equivalent to an absent one.

If the same top-level block appears multiple times, the contents are merged together, with all child blocks of the first instance ending with it if it is a block mode block (content, site and links). Non-content blocks (title) are merged as if their contents were concatenated with empty lines in between.

The following blocks are defined on the top level:

title

Contains the document title. The contents of the block are parsed as simple text.

Note that the title can be of arbitrary length or even absent and may contain characters like line feed and various control codes. Implementations are not required to display them as such and may instead prefer to display the title, or its prefix up to a certain length if it's too long, as a single line with all whitespace collapsed even after resolving escape sequences.

While a title is recommended, a document is not required to have one. Implementations may display that as an empty title (or not show a title at all) or an implementation-defined placeholder or content excerpt of their choice.

Example:

title
	This is a document title.

TL;DR: Simple text. May be very long or not present at all. Make sure to handle e.g. newlines.

links

The links block can contain an arbitrary number of hyperlinks, which are intended to be a page-wide list of links to relevant parts of the website or other websites.

The block contents are parsed in block mode.

Each block inside the contents of the links block should have a URL as the block name and the hyperlink text as the block arguments joined with spaces. If the argument is not present or empty, the hyperlink name is set to the hyperlink URL. The contents of the URL block are parsed as simple text and represent a link description, which may be optionally displayed by the interactive client (for example, as a title that appears on mouse-over or a footnote), but may as well be hidden.

Links with missing URL (blank block name) are skipped.

Example:

links
	/example Clicking this link leads to /example.
	/test
		The above link has no explicit title,
		so "/test" is used instead.

		However, it has a description.
		Despite the empty line,
		it's displayed as a single line.
	cnp://example.com/ Links can also be absolute URLs.

TL;DR: Block mode. Contains nested blocks with URL in name, link text in argument and description in simple text contents.

site

The site block represents a sitemap. It is used to show a hierarchical tree of the current site. The block contents are parsed in block mode.

Each block inside the site block should have a filename or filepath as the block name, which represents the path on the current site. The arguments, joined together with spaces, are an optional name of the path that is used as the hyperlink text; if not provided, then the path should be used as the name. The contents of each block are parsed in block mode and recursively contain other path blocks.

The path blocks represent an absolute hierarchical filepath within the current site. Each block represents a hyperlink to a certain page. To construct the entire filepath for a specific path block, prepend a slash to its name and the name of every parent block all the way to the site block itself, then join them together into a single string. If a block path contains slashes, it represents several levels of directories; path composition rules are unchanged. If a block path has a trailing slash, it should be preserved in the filepath. The final filepath represents a relative URL based on the document root of the current site.

The client should display these as a list or tree of hyperlinks for navigating the current site. It may assume that a node whose path matches the current page's location is the current page (e.g. shows it in a different color, or shows all other nodes collapsed, etc.). The order of nodes should not be changed and nodes with duplicate path or name should be kept as-is.

Sitemap entries with missing path argument are skipped.

Example:

site
	foo This is a link to /foo
		bar And this to /foo/bar
		baz/quux This one leads to /foo/baz/quux
			test And this to /foo/bar/baz/quux/test
		baz
			quux Above link uses "baz" as the name.
				test2 This leads to /foo/baz/quux/test2
	cnp://example.com/ This leads to /cnp:/example.com/

TL;DR: Block mode. Contains recursive block mode blocks with paths as names and hyperlink text as descriptions. Join the names from the root site block to the selected child node into a filepath.

content

The content top-level block contains the entire body of the document. All of content's child blocks represent the document content.

The block contents are parsed in block mode. The meaning of each child block depends on its name. The following content blocks are currently defined:

section

The section block represents a division of the contents with an optional title.

The contents of the section block are parsed in block mode and can be arbitrary content blocks.

If the block has arguments, they are joined together with spaces and represent the section title. The section title is displayed as a heading and can be used as a content selector inside the document. Nested sections with titles represent subsections.

A section without a title groups the child blocks together without counting as a section (e.g. no table of contents entry). An example use of that is putting multiple text blocks into a list item. As a direct child of the content or section block, a title-less section does nothing and is equivalent to a document that has its child blocks directly inside the parent block in the place of the section block.

Example:

content
	section Section name goes here.

TL;DR: Group of content blocks with a heading.

text

The text block represents text contents.

It is parsed in raw text mode, with additional formatting being applied on top depending on the block arguments.

The text block can be specified with a text format mode as the first argument. The format may be used to add rich text formatting.

Currently, there are three text format modes defined: plain, pre and fmt. If the block argument is empty, the plain format is used. Contents of blocks with unknown format modes can be parsed as if they were raw blocks.

TL;DR: Contains text. Formatting depends on argument.

text plain

The text plain block represents plain text content. It consists of a sequence of paragraphs of simple text. Since it's the default mode for the text block, using the plain argument is not necessary.

A paragraph is a sequence of consecutive nonempty lines of simple text. A paragraph ends with an empty line or the end of the text block. When displaying paragraphs, spacing should be added between them (such as some padding or a blank line). Escaped line feeds in the text itself do not have this spacing.

Example:

content
	text
		This is a paragraph of text.
		This sentence is in the same line as the above.

		This one, however, is a new paragraph.\n
		And the escaped line break above splits this
		sentence into a new line, but not a new paragraph.

		This   is   joined   by   single   spaces.

TL;DR: Contains paragraphs of simple text and escape sequences.

text pre

The text pre block represents preformatted plain text content.

The text pre block contents are parsed the same way as a raw block's, except that simple text escape sequences are still resolved and no syntax highlighting should be done. Whitespace is left untouched and the whole text block is just a single paragraph regardless of blank lines (which are simply literal line feeds).

Example:

content
	text pre
		This is the first line.
		This is on a new line.
		This sentence is\non two lines.

		The above line is empty, but not a paragraph.
		This   line   contains   triple   spaces.

TL;DR: Contains preformatted raw text and escape sequences.

text fmt

The text fmt block represents text that contains simple inline formatting.

First, the text block is split into paragraphs the same way as a plain text block, with whitespace collapsed as in simple text. After that, the CNMfmt formatting is applied to each paragraph. Finally, escape sequences (including CNMfmt specific ones) are resolved.

See the CNMfmt section below for more information.

Example:

content
	text fmt
		This is **bold**, //italic//, __underlined__,
		``monospaced`` and @@/ a hyperlink to /@@.

		**bold //bold+italic **italic __italic+underlined
		still italic+underlined **italic+underlined+bold

		This is no longer bold, italic, or underlined.
		It is also a new paragraph containing a single
		line without formatting.

		@@# This link contains **bold** text.

		**@@# This hyperlink is bold,**@@ but this isn't.

TL;DR: Contains paragraphs of text containing inline CNMfmt formatting.

raw

The raw block represents preformatted text contents.

The block contents are parsed in raw mode. When possible, the contents should be displayed with a monospaced font with all whitespace preserved.

If present, the first block argument represents the type of the contents. That should generally be the MIME type of the data or lowercased name of the language/syntax in the contents of the raw block (for example, text/html or html, text/javascript or application/javascript or javascript). When rendering the block contents, the type may be used to perform syntax highlighting.

Note that, as in all other blocks, it's not possible to include leading or trailing blank lines in the raw block's contents.

Example:

content
	raw
		this is not **bold**
		this is on a new line
		this line is \n all in one line
		above line contains characters "\" and "n"

		the above line was empty

TL;DR: Raw preformatted text. Argument is type name for optional syntax highlighting.

list

The list block represents a list of items.

The block contents are parsed in block mode and can contain arbitrary content blocks. Each child block represents one list item; several blocks can be grouped into a single item using a section block.

The first block argument represents the list type. Currently, two list types are defined: ordered and unordered. Unordered lists are simple lists of items with e.g. bullet points. Ordered lists use Arabic numbers by default; currently, choosing alternate numbering style is not possible, but it may be added in the future. Nested unordered lists may use different bullet style, but are not required to. Nested ordered lists use the same style of numbering as the parent one; nested numbering style may be configurable in future versions of CNM. Ordered lists always start with 1.

Example:

content
	list
		text
			This is the first item.
		text
			Second item.
		section
			text
				Third item.
			text
				Still third item.
		list
			text
				Nested list, item 4.1.

TL;DR: List of content blocks. Argument: ordered or unordered. Ordered always starts with 1.

table

The table block represents two-dimensional tabular data.

The contents are parsed in block mode. A table can contain two different types of blocks: header and row. The header and row blocks both act like a section block without an argument: they can contain arbitrary content blocks. Each of their child blocks represents one table cell; to group multiple blocks into one cell, a section block without a title can be used.

The width of the table depends on the longest header or row. Any headers or rows with less cells than that are padded with empty cells on the right side.

Currently, there is no support for multi-column or multi-row cells.

header

The header block represents a table header row.

It is parsed the same way as a section block without a title and can contain arbitrary content blocks. Each child block represents a column header cell.

The header block represents a row with table headers. It should be displayed in a more emphasized manner and, optionally, allow sorting all follow-up rows until the next header or the end of the table by columns. A table is not required to start with a header, nor to include one at all.

row

The row block represents a table data row.

It is parsed the same way as a section block without a title and can contain arbitrary content blocks. Each child block represents a table body cell.

The row block represents a row the table contents.

Example:

content
	table
		header
			text
				Header of column 1
			text
				Header of column 2
			text
				Header of column 3
		row
			text
				Row 1 column 1
			text
				Row 1 column 2
		row
			text
				Row 2 column 1
			text
				Row 2 column 2
		row
			section
				text
					Row 3 column 1
				text
					Still Row 3 column 1
			text
				Row 3 column 2
			text
				Row 3 column 3

				Row 1 column 3 and row 2 column 3 are empty.

TL;DR: Contains headers and rows. Child blocks of these are cells.

embed

The embed block is used to embed external content into the document.

The first block argument represents the MIME type of the embedded content. It can be used by the user agent to decide how to handle it. Graphical browsers are recommended to display at least common image types (e.g. image/png, image/jpeg and image/webp) inside the page by default. An empty argument or invalid MIME type can be treated as an application/octet-stream type and not be embedded.

The second argument is the URL pointing to the embedded content. An embed block without a URL should be ignored. The URL may also be a data URI.

The contents of the block are parsed in simple text mode and represent the description of the embedded content. If present, the description can be displayed as e.g. a caption, mouse-over title, placeholder when the content cannot be embedded, etc., but may as well be hidden.

If the content type is unknown or cannot be embedded within the page, the embedded content should be presented as a hyperlink instead.

Example:

content
	embed image/png /static/example.png
		This is an embedded image's caption/title/hover text.

TL;DR: Argument is MIME type and URL, contents are description. Embed inside page if possible, otherwise provide hyperlink.

The CNMfmt inline formatting submarkup

The CNMfmt markup is used within text fmt content blocks to provide inline formatting of text.

CNMfmt extends the CNM text plain block by introducing toggles of various format options. These toggles consist of two symbol characters. If the format of the toggle is currently not in effect, the toggle enables it. Otherwise, the format is disabled. Formats do not have to be toggled in LIFO order. All formats are implicitly closed with the end of the paragraph.

The following toggles and formats are currently defined:

**  bold
//  italic
__  underlined
``  monospaced
@@  hyperlink

Bold

The **bold** format makes all text inside it bold. It uses two asterisks (**) as the toggle.

Italic

The //italic// format makes all text inside it italic. It uses two slashes (//) as the toggle.

Underlined

The __underlined__ format makes all text inside it underlined. It uses two underscores (__) as the toggle.

Monospaced

The contents of the ``monospaced`` format should be rendered using a monospaced font, if possible. Whitespace is not preserved; it is collapsed the same way as in the rest of the text fmt block. It uses two grave accents (``) as the toggle.

Hyperlink

The @@cnp://example.com/ hyperlink@@ format represents an inline hyperlink. It uses two at signs (@@) as the toggle.

The hyperlink consists of two parts: the URL and the link text.

The URL is the first non-whitespace word inside the formatted text. The URL does not contain any CNMfmt toggles excluding @@, which ends the entire hyperlink format (for example, the // inside the URL does not toggle the italic format). Note that the URL can still contain CNM simple text and CNMfmt escape sequences; these can be used to supply Unicode characters and spaces instead of manually percent-encoding the URL.

If the hyperlink format consists of more than one word, the remainder of the content is used as the hyperlink text. It may contain arbitrary CNMfmt formatting. If the link text is blank, the URL is used as link text instead.

Any other sequences of two symbols stand for themselves as text.

The CNMfmt markup also includes several new escapes alongside the standard CNM ones to allow including the toggle characters as text:

"\*"  ->  U+002A  asterisk
"\/"  ->  U+002F  slash
"\_"  ->  U+005F  underscore
"\`"  ->  U+0060  grave accent
"\@"  ->  U+0040  at sign