ContNet Markup specification, version 0.4 (2017-09-04)

Overview

CNM is a lightweight markup language primarily meant to be used as the hypertext document markup format for ContNet. It is a line-based Unicode text markup format with indentation-delimited blocks. The primary goals of CNM are simple parsing and composition, as well as being readable and writable by humans.

CNM contains semantic content of hypertext pages. It does not include layout, styles or scripts, as all of that is supposed to be handled by the rendering application. As such, it aims to avoid obfuscating content behind presentation and supports responsive design, as every device can render the content to fit its screen and interface.

Syntax

All parts of CNM use the UTF-8 encoding. Any invalid UTF-8 sequence is replaced with the U+FFFD replacement character.

A CNM document is mainly composed of blocks defined by indentation. The core structure of the document consists of nested blocks containing other blocks, with the leaves being either blocks with no child blocks or some form of text that does not contain any blocks.

Each line in the document ends in a line feed character. All raw (not provided as an escape sequence) carriage return or null characters in the document are ignored. If the document does not end with a line feed character, it is parsed as if it had ended with one.

Parsing method for the contents of a block depend on which block it is. If the block is not known, it should be ignored and all of its contents skipped by advancing until the next nonempty line with less indentation than the unknown block's contents.

When whitespace is mentioned in the specification, it refers to the following ASCII whitespace characters: tab (U+0009), line feed (U+000A), form feed (U+000C) and space (U+0020) in their raw Unicode character form, not as an escape sequence. All other Unicode whitespace characters stand for themselves and are not collapsed or used to split fields.

An empty line is a line consisting of at most as much indentation as the parent block's contents and nothing else. Such lines implicitly belong to the last parsed block regardless of the amount of indentation and act the same as if the indentation depth was the same as the block's contents.

TL;DR: Encoded in UTF-8, line-based. LF is line terminator, CR is ignored. Unknown blocks' contents are skipped.

The following general syntactic contexts are commonly used:

Block mode

In block mode, every nonempty line is parsed as a block name line.

The block name line consists of a list of whitespace-delimited simple text tokens. The line is first split on each sequence of one or more whitespace characters that are not a part of a simple text escape sequence (specifically, not "\ "). If there's any leading or trailing whitespace, the first or last token is an empty string after splitting. If the splitting ends with a single empty token (the entire line was just whitespace), the line is treated the same as an empty line and is skipped.

The first token in the block name line is the block name. It defines the meaning of the block and how its contents are parsed. The remaining tokens, if any, represent the block's arguments. All empty tokens in the arguments should be ignored. Some blocks might use the arguments as one single value; in that case, the arguments are joined together with spaces.

Note that excess tabs or space indentation will result in a block with an empty name. This will usually result in an unknown block, which will then be skipped.

All lines following the block name line that are indented at least one level more than the block name or are empty are parsed as the contents of the named block. For every such line, the initial indentation equal to one level more than the block name's is removed and the remainder of the line parsed according to the named block's mode (the inner block keeps any tab characters in excess of the indentation). Block mode parsing in the current block resumes on the first nonempty line that has less indentation than the contents of the last named block.

TL;DR: Block mode contains blocks. Each block starts with line containing simple text name and optional arguments, split by non-escaped whitespace. All lines indented over the indentation of the block name line are contents of that block.

Simple text mode

Simple text is parsed by collapsing all raw (not provided as an escape sequence) whitespace into a single space and removing any leading or trailing spaces, then resolving escape sequences.

Simple text can contain escape sequences. These are C-style sequences of two or more characters that begin with a backslash and are parsed as a single character they represent. The following escape sequences are currently defined (without quotes):

"\b"          ->  U+0008      backspace
"\t"          ->  U+0009      tab
"\n"          ->  U+000A      line feed
"\v"          ->  U+000B      vertical tab
"\f"          ->  U+000C      form feed
"\r"          ->  U+000D      carriage return
"\ "          ->  U+0020      space
"\\"          ->  U+005C      backslash
"\x##"        ->  U+00##      8-bit Unicode character
"\u####"      ->  U+####      16-bit Unicode character
"\U########"  ->  U+########  32-bit Unicode character

The # characters in \x##, \u#### and \U######## escape sequences are arbitrary hexadecimal digits [0-9a-fA-F]. In \U########, the first two digits should generally be zero, since Unicode only supports 21-bit characters. Invalid codepoints are unescaped into the U+FFFD replacement character.

Any other sequence starting with a backslash that is not in the above table, or one of the \x, \u and \U sequences with too few hex digits, are parsed the same as if the backslash itself was escaped: they're left in the text unchanged, with the backslash remaining present.

Simple text mode is mostly used in block mode block names and arguments or as a part of other formats in specific blocks.

TL;DR: Collapse and trim whitespace. Handle C-style escape sequences. Invalid escape sequences are parsed as normal text.

Raw text mode

In raw text mode, all data is parsed as a literal text blob. Whitespace is preserved exactly as-is, including any leading tabs (tabs that are a part of the block's indentation do not count as a part of the block content in block mode) and empty lines inside the content, excluding any leading or trailing empty lines, which are removed. Global text parsing rules (ignoring carriage returns, UTF-8) still apply. Each raw text line also retains its line feed character.

Raw mode is mostly used for the raw block and for the initial parsing of other blocks with their own syntax. In essence, every block could first be parsed in raw mode, then the results of that using the block's parsing mode.

TL;DR: Lines are kept unmodified for later processing.

Structure

The top level of a CNM document is parsed in block mode. It contains blocks containing metadata and the content itself.

None of the top-level blocks in CNM have any arguments.

An empty top-level block is equivalent to an absent one.

If the same top-level block appears multiple times in the document, the contents of all instances are merged together. The content merging happens after parsing, so all child blocks end with the end of each instance of a top-level block. This means that a child block of one of multiple instances of container blocks (content, site and links) is fully contained in its parent top-level block and cannot extend into the next one. Simple text blocks (title) can just merge their contents as if all of their lines belonged to a single block, since simple text collapses whitespace anyway.

The following blocks are defined on the top level:

title

Contains the document title. The contents of the block are parsed as simple text.

Note that the title can be of arbitrary length or even absent and may contain characters like line feed and various control codes. Implementations are not required to display them as such and may instead prefer to display the title, or its prefix up to a certain length if it's too long, as a single line with all whitespace collapsed even after resolving escape sequences.

While a title is recommended, a document is not required to have one. Implementations may display that as an empty title (or not show a title at all) or an implementation-defined placeholder or content excerpt of their choice.

Example:

title
	This is a document title.

TL;DR: Simple text. May be very long or not present at all. Make sure to handle e.g. newlines.

links

The links block can contain an arbitrary number of hyperlinks, which are intended to be a page-wide list of links to relevant parts of the website or other websites.

The block contents are parsed in block mode.

Each block inside the contents of the links block should have a URL as the block name and the hyperlink text as the block arguments joined with spaces. If the argument is not present or empty, the hyperlink name is set to the hyperlink URL. The contents of the URL block are parsed as simple text and represent a link description, which may be optionally displayed by the interactive client (for example, as a title that appears on mouse-over or a footnote), but may as well be hidden.

Links with missing URL (blank block name) are skipped.

Example:

links
	/example Clicking this link leads to /example.
	/test
		The above link has no explicit title,
		so "/test" is used instead.

		However, it has a description.
		Despite the empty line,
		it's displayed as a single line.
	cnp://example.com/ Links can also be absolute URLs.

TL;DR: Block mode. Contains nested blocks with URL in name, link text in argument and description in simple text contents.

site

The site block represents a sitemap. It is used to show a hierarchical tree of the current site. The block contents are parsed in block mode.

Each block inside the site block should have a filename or filepath as the block name, which represents the path on the current site. The arguments, joined together with spaces, are an optional name of the path that is used as the hyperlink text; if not provided, then the path should be used as the name. The contents of each block are parsed in block mode and recursively contain other path blocks.

The path blocks represent an absolute hierarchical filepath within the current site. Each block represents a hyperlink to a certain page. To construct the entire filepath for a specific path block, prepend a slash to its name and the name of every parent block all the way to the site block itself, then join them together into a single string. If a block path contains slashes, it represents several levels of directories; path composition rules are unchanged. If a block path has a trailing slash, it should be preserved in the filepath. The final filepath represents a relative URL based on the document root of the current site.

The client should display these as a list or tree of hyperlinks for navigating the current site. It may assume that a node whose path matches the current page's location is the current page (e.g. shows it in a different color, or shows all other nodes collapsed, etc.). The order of nodes should not be changed and nodes with duplicate path or name should be kept as-is.

Sitemap entries with missing path argument are skipped.

Example:

site
	foo This is a link to /foo
		bar And this to /foo/bar
		baz/quux This one leads to /foo/baz/quux
			test And this to /foo/bar/baz/quux/test
		baz
			quux Above link uses "baz" as the name.
				test2 This leads to /foo/baz/quux/test2
	cnp://example.com/ This leads to /cnp:/example.com/

TL;DR: Block mode. Contains recursive block mode blocks with paths as names and hyperlink text as descriptions. Join the names from the root site block to the selected child node into a filepath.

content

The content top-level block contains the entire body of the document. All of content's child blocks represent the document content.

The block contents are parsed in block mode. The meaning of each child block depends on its name. The following content blocks are currently defined:

section

The section block represents a division of the contents with an optional title.

The contents of the section block are parsed in block mode and can be arbitrary content blocks.

If the block has arguments, they are joined together with spaces and represent the section title. The section title is displayed as a heading and can be used as a content selector inside the document. Nested sections with titles represent subsections.

A section without a title groups the child blocks together without counting as a section (e.g. no table of contents entry). An example use of that is putting multiple text blocks into a list item. As a direct child of the content or section block, a title-less section does nothing and is equivalent to a document that has its child blocks directly inside the parent block in the place of the section block.

Example:

content
	section Section name goes here.

TL;DR: Group of content blocks with a heading.

text

The text block represents text contents.

It is parsed in raw text mode, with additional formatting being applied on top depending on the block arguments.

The text block can be specified with a text format mode as the first argument. The format may be used to add rich text formatting.

Currently, there are three text format modes defined: plain, pre and fmt. If the block argument is empty, the plain format is used. Contents of blocks with unknown format modes can be parsed as if they were raw blocks.

TL;DR: Contains text. Formatting depends on argument.

text plain

The text plain block represents plain text content. It consists of a sequence of paragraphs of simple text. Since it's the default mode for the text block, using the plain argument is not necessary.

A paragraph is a sequence of consecutive nonempty lines of simple text. A paragraph ends with an empty line or the end of the text block. When displaying paragraphs, spacing should be added between them (such as some padding or a blank line). Escaped line feeds in the text itself do not have this spacing.

Example:

content
	text
		This is a paragraph of text.
		This sentence is in the same line as the above.

		This one, however, is a new paragraph.\n
		And the escaped line break above splits this
		sentence into a new line, but not a new paragraph.

		This   is   joined   by   single   spaces.

TL;DR: Contains paragraphs of simple text and escape sequences.

text pre

The text pre block represents preformatted plain text content.

The text pre block contents are parsed the same way as a raw block's, except that simple text escape sequences are still resolved and no syntax highlighting should be done. Whitespace is left untouched and the whole text block is just a single paragraph regardless of blank lines (which are simply literal line feeds).

Example:

content
	text pre
		This is the first line.
		This is on a new line.
		This sentence is\non two lines.

		The above line is empty, but not a paragraph.
		This   line   contains   triple   spaces.

TL;DR: Contains preformatted raw text and escape sequences.

text fmt

The text fmt block represents text that contains simple inline formatting.

First, the text block is split into paragraphs the same way as a plain text block, with whitespace collapsed as in simple text. After that, the CNMfmt formatting is applied to each paragraph. Finally, escape sequences (including CNMfmt specific ones) are resolved.

See the CNMfmt section below for more information.

Example:

content
	text fmt
		This is **emphasized**, __alternate__, ``code``,
		""quoted"" and @@/ a hyperlink to /@@.

		**emphasized __emphasized+alternate **alternate ""alternate+quoted
		still alternate+quoted **alternate+quoted+emphasized

		This is no longer emphasized, alternate, or quoted.
		It is also a new paragraph containing a single
		line without formatting.

		@@# This link contains **emphasized** text.

		**@@# This hyperlink is emphasized,**@@ but this text isn't.

TL;DR: Contains paragraphs of text containing inline CNMfmt formatting.

raw

The raw block represents preformatted text contents.

The block contents are parsed in raw mode. When possible, the contents should be displayed with a monospaced font with all whitespace preserved.

If present, the first block argument represents the type of the contents. That should generally be the MIME type of the data or lowercased name of the language/syntax in the contents of the raw block (for example, text/html or html, text/javascript or application/javascript or javascript). When rendering the block contents, the type may be used to perform syntax highlighting.

Note that, as in all other blocks, it's not possible to include leading or trailing blank lines in the raw block's contents.

Example:

content
	raw
		this is not **emphasized**
		this is on a new line
		this line is \n all in one line
		above line contains characters "\" and "n"

		the above line was empty

TL;DR: Raw preformatted text. Argument is type name for optional syntax highlighting.

list

The list block represents a list of items.

The block contents are parsed in block mode and can contain arbitrary content blocks. Each child block represents one list item; several blocks can be grouped into a single item using a section block.

The first block argument represents the list type. Currently, two list types are defined: ordered and unordered. Unordered lists are simple lists of items with e.g. bullet points. Ordered lists use Arabic numbers by default; currently, choosing alternate numbering style is not possible, but it may be added in the future. Nested unordered lists may use different bullet style, but are not required to. Nested ordered lists use the same style of numbering as the parent one; nested numbering style may be configurable in future versions of CNM. Ordered lists always start with 1.

Example:

content
	list
		text
			This is the first item.
		text
			Second item.
		section
			text
				Third item.
			text
				Still third item.
		list
			text
				Nested list, item 4.1.

TL;DR: List of content blocks. Argument: ordered or unordered. Ordered always starts with 1.

table

The table block represents two-dimensional tabular data.

The contents are parsed in block mode. A table can contain two different types of blocks: header and row. The header and row blocks both act like a section block without an argument: they can contain arbitrary content blocks. Each of their child blocks represents one table cell; to group multiple blocks into one cell, a section block without a title can be used.

The width of the table depends on the longest header or row. Any headers or rows with less cells than that are padded with empty cells on the right side.

Currently, there is no support for multi-column or multi-row cells.

header

The header block represents a table header row.

It is parsed the same way as a section block without a title and can contain arbitrary content blocks. Each child block represents a column header cell.

The header block represents a row with table headers. It should be displayed in a more emphasized manner and, optionally, allow sorting all follow-up rows until the next header or the end of the table by columns. A table is not required to start with a header, nor to include one at all.

row

The row block represents a table data row.

It is parsed the same way as a section block without a title and can contain arbitrary content blocks. Each child block represents a table body cell.

The row block represents a row the table contents.

Example:

content
	table
		header
			text
				Header of column 1
			text
				Header of column 2
			text
				Header of column 3
		row
			text
				Row 1 column 1
			text
				Row 1 column 2
		row
			text
				Row 2 column 1
			text
				Row 2 column 2
		row
			section
				text
					Row 3 column 1
				text
					Still Row 3 column 1
			text
				Row 3 column 2
			text
				Row 3 column 3

				Row 1 column 3 and row 2 column 3 are empty.

TL;DR: Contains headers and rows. Child blocks of these are cells.

embed

The embed block is used to embed external content into the document.

The first block argument represents the MIME type of the embedded content. It can be used by the user agent to decide how to handle it. Graphical browsers are recommended to display at least common image types (e.g. image/png, image/jpeg, image/webp and image/svg+xml) inside the page by default. An empty argument or invalid MIME type can be treated as an application/octet-stream type and not be embedded.

The second argument is the URL pointing to the embedded content. An embed block without a URL should be ignored. The URL may also be a data URI.

The contents of the block are parsed in simple text mode and represent the description of the embedded content. If present, the description can be displayed as e.g. a caption, mouse-over title, placeholder when the content cannot be embedded, etc., but may as well be hidden.

If the content type is unknown or cannot be embedded within the page, the embedded content should be presented as a hyperlink instead.

Example:

content
	embed image/png /static/example.png
		This is an embedded image's caption/title/hover text.

TL;DR: Argument is MIME type and URL, contents are description. Embed inside page if possible, otherwise provide hyperlink.

Selectors

CNM selector queries can be used to identify specific sections in a CNM document.

Selectors can be used to select a section in the document (e.g. to move an open document so that it's visible) or filter a document to only show certain sections and their content.

Section selector

A section selector query identifies a specific section in the document. It's usually used in the hash fragment part of a URL to move the visible document to the named section. Section title selectors are case-sensitive.

Section selectors can select sections either by a section title, a path of section titles or a path of section indices. A section without a title does not count as a section and cannot be selected by section selectors; any mention of sections in the specification of selectors refers exclusively to sections with non-empty titles. A section with an empty title can essentially be regarded as a generic container block.

Title selector

#{title}

The title selector selects the first section with the given title ({title}) in the document. The section order is defined by their vertical position; block depth is irrelevant. If multiple sections in the document have the same title, this selector only selects the first one. The title must use URL percent-encoding where at least the slash character (U+002F) is encoded into %2F or %2f.

An empty title matches the top of the document contents.

Note that the # character (U+0023) in the selector is not the same as the one separating the URL hash fragment. An example URL with a title selector is cnp://example.com/file.cnm##title.

Title path selector

/{path}

The title path selector selects a section based on a path of section titles. The {path} part of the query consists of zero or more section titles (escaped just like in the title selector) separated by a single slash character.

Each title in the path selects a section using the same method as the title selector, but only considers sections that aren't a child block of another section in the current context (are accessible from the current context without passing through another section). The initial context is the top-level content block. Each time a section in the path is matched, the new context becomes this section's contents.

If any part of the path fails to find a matching section, the query does not match anything.

An empty path matches the top of the document contents. An empty title in a non-empty path does not match anything.

Index path selector

${indices}

The index path selector selects a section based on a path of section indices. The {indices} part of the query is a dot-separated path of zero or more section indices represented by decimal numbers.

Each index in the path selects a section within the current context (as in the title path selector). The first section has the index 1.

If any index in the path is zero or higher than the number of the sections in its context, the query does not match anything.

An empty path matches the top of the document contents.

Content selector

A content selector is a selector that selects a subset of the document contents based on a section.

The content selectors have the same syntax as the section selectors, but may be optionally prefixed with an exclamation mark (U+0021) for a shallow selector.

Using a content selector query on a document returns a new document consisting of only the named section, all of its contents and all parent block names up to the top-level without any of their sibling blocks or other contents.

A shallow selector selects a similar document, but excludes the contents of any child sections of the selected section (the section block name lines and any non-section blocks with their contents are kept).

For the cases where a specific selector selects the top of the document contents, the entire content block with all of its contents is selected (or, in the case of a shallow selector, without child section contents).

An empty content selector selects the entire document with all of its contents, including non-content top-level blocks, unmodified (though the actual document may be recomposed, as long as the contents aren't changed). A content selector consisting only of the shallow selector modifier ! selects the same document, but without the contents of any sections.

Examples

Example CNM document:

title
	Test
content
	section A
		text
			T1
		section B
			text
				T2
			list
				text
					T3
				section C
					text
						T4
		section C
			text
				T5
	list
		section
			text
				T6
		section E
			text
				T7
		text
			T8
	section E
		text
			T9

Section selectors:

  • #A selects the section A containing the text T1, section B and section C.

  • #C selects the section C containing the text T4.

  • #F does not select anything.

  • /A selects the section A containing the text T1, section B and section C.

  • /A/B/C selects the section C containing the text T4.

  • /A/C selects the section C containing the text T5.

  • /E selects the section E containing the text T7.

  • /B does not select anything.

  • $1 selects the section A containing the text T1, section B and section C.

  • $2 selects the section E containing the text T7.

  • $3 selects the section E containing the text T9.

  • $1.1.1 selects the section C containing the text T4.

  • $1.3 does not select anything.

Content selectors:

  • #C selects the following document:

    content
    	section A
    		section B
    			list
    				section C
    					text
    						T4
  • !/A selects the following document:

    content
    	section A
    		text
    			T1
    		section B
    		section C
  • !/ selects the following document:

    content
    	section A
    	list
    		section
    			text
    				T6
    		section E
    		text
    			T8
    	section E
  • ! selects the following document:

    title
    	Test
    content
    	section A
    	list
    		section
    			text
    				T6
    		section E
    		text
    			T8
    	section E

The CNMfmt inline formatting submarkup

The CNMfmt markup is used within text fmt content blocks to provide inline formatting of text.

CNMfmt extends the CNM text plain block by introducing toggles of various format options. These toggles consist of two symbol characters. If the format of the toggle is currently not in effect, the toggle enables it. Otherwise, the format is disabled. Formats do not have to be toggled in LIFO order. All formats are implicitly closed with the end of the paragraph.

The following toggles and formats are currently defined:

**  emphasized
__  alternate
``  code
""  quotation
@@  hyperlink

Emphasized

The **emphasized** format indicates emphasized text. It uses two asterisks (**) as the toggle. The usual way to style emphasized text is with a bold font, but implementations may choose to use a different style.

Alternate

The __alternate__ format indicates text in an alternate voice that is offset from the normal text. It uses two underscores (__) as the toggle. The usual way to style alternate text is with an italic font, but implementations may choose to use a different style.

Code

The contents of the ``code`` format represent computer code or similar text that is usually not in a spoken language. It uses two grave accents (``) as the toggle. Note that whitespace in this tag is not preserved; it is collapsed the same way as in the rest of the text fmt block. Code should be displayed in a monospaced font, if possible.

Quote

The ""quote"" format represents a quotation. It uses two quote marks ("") as the toggle. The usual way to style quoted text is to include quote marks on the beginning and end and/or frame it, but implementations may choose a different style.

Hyperlink

The @@cnp://example.com/ hyperlink@@ format represents an inline hyperlink. It uses two at signs (@@) as the toggle.

The hyperlink consists of two parts: the URL and the link text.

The URL is the first non-whitespace word inside the formatted text. The URL does not contain any CNMfmt toggles excluding @@, which ends the entire hyperlink format (for example, if a __ appears inside the URL, it does not toggle the alternate format). Note that the URL can still contain CNM simple text and CNMfmt escape sequences; these can be used to supply Unicode characters and spaces instead of manually percent-encoding the URL.

If the hyperlink format consists of more than one word, the remainder of the content is used as the hyperlink text. It may contain arbitrary CNMfmt formatting. If the link text is blank, the URL is used as link text instead.

Any other sequences of two symbols stand for themselves as text.

The CNMfmt markup also includes several new escapes alongside the standard CNM ones to allow including the toggle characters as text:

"\*"  ->  U+002A  asterisk
"\_"  ->  U+005F  underscore
"\`"  ->  U+0060  grave accent
"\""  ->  U+0022  quotation mark
"\@"  ->  U+0040  at sign