Warrior Tang (tangaroa) wrote,
Warrior Tang
tangaroa

  • Mood:
  • Music:

HTML 6 wish list

I had another nifty thought about where I think HTML6 ought to go, so I'll add it to that "spec" that I spilled some of my thoughts onto earlier. Call this Tang's Awesome Markup Language version 0.002.

Major changes from my last post:

  • New "aka" attribute replaces proposed "define" node.
  • Added idea of "src" attribute merging retrieved contents into the current tag... and removed it.
  • Added idea of copying Docbook's nodes for HTML.
  • Renamed proposed "span" and "block" nodes to "a" and "div". I'm trying to strike a balance between building from the current standard and intentionally throwing it all out and falling on the side of ease. It's difficult. Expect v.0.003 to change "div" back to "block".
  • Sections have been reordered and renumbered.

And I still haven't gotten around to properly reading the HTML5 specs. TODO, TODO, TODO...


Section 1 - Basic Node Object Representation

Basic Node Object Representation is a vague description of a web page.

1.01 - Nodes and Attributes

Nodes and attributes have this relationship:

  • Nodes can contain nodes.
  • Nodes can have attributes.

Nodes can contain nodes OR data -- free text is not allowed. Readers which encounter free text should wrap it in a node of the most appropriate type for the given context.

Attributes are always metadata.

Pre-defined attributes for Node:

(New ones:)

aka
Define a new node type which shall have all of the content and non-unique attributes of the current node. The value of this attribute must meet the syntax guidelines for a node name.
src
The node pulls its data from the given URL. If no data is available, the contents of the node are used as alternate data.

(Old ones:)

encoding
The text encoding, like "UTF-8" or "ASCII".
type
A MIME content-type to assist systems that do not use file magic or name extension autodetection.

(I don't want to list all of the old ones being kept yet, so I'll stop here for now.)

1.02 - The Basic Nodes: Object and Meta

Two basic types of Node are defined:

Meta

Information which describes the parent node but is not meant to be read as data by the reader. For example, Meta nodes would not be displayed on a screen in the normal use of a web browser.

Object
The data part of the data. For example, Object nodes would be displayed on a screen by a web browser.

Pre-defined attributes for Object:

href
The object has the given hypertext reference.

1.03 - Basic Nodes for the Web

Other nodes may be defined as descended from one of the basic node types of Object and Meta. Descendant nodes will have all of the properties of the parent node that are defined in the standard.

a

Extended from: Object

Short for "alphanumeric". The contents may contain only text or "a" nodes and are to be treated like text by the reader. This serves the purpose of HTML's "span".

(Note: the "anchor" tag is already made obsolete by allowing any tag to carry an href, so we may as well repurpose it and save typing a few characters. Westerners will better understand that the first letter of the alphabet == text before they can guess what a span is supposed to be.)

div

Extended from: Object

Short for "Divided" or "Division"; in the absence of styling instructions, viewers should separate Div nodes from each other. This serves the purpose of HTML's "div" and the CSS "block" display style. The contents may include either Divided or Alphanumeric nodes, but not both. The contents may also include Meta nodes.

html

Extended from: Div

Says that the data is an HTML page.

Attributes:

version
Specify a specific version of HTML as decimal number.

1.04 - Common Nodes for the Web

Generally, every tag that exists in HTML 4 or 5 can be defined as a descendant of the basic tags.

Descendants of Div would include the html tag, p, img, table, and the new canvas tag.

Descendants of Meta would include head, link, title, and script.

It is acknowledged that "meta" and "object" tags already exist in HTML. The older meanings are considered deprecated because:

  • The purpose of this document is to imagine throwing all that out and starting over.
  • Major version upgrades are expected to be incompatible with previous versions.
  • I am so awesome.

HTML tags with their own unique domain of related children are their own sub-languages independent of the rest of the language. Table, Canvas, and List (replacing/parenting UL and LI) are each their own type of object.

1.05 - Implementation

Ha, ha!

This spec does not define an implementation.

However, end-user implementations should support XML syntax as a serialization.

Implementations may also support alternate serializations such as YAML or a compressed binary format.

Serialization interpreters meant for general use should not be case sensitive. Browsers are slow enough that running a tolower() on every string should not damage performance too much further.

Serialization interpreters meant for general use should ignore all white space except for a space between words.

Appendices to Section 1

Appendix: Example of XML Serialization

This example is a few valid nodes in an XML-style format:

<object>
<html src="foo.html" version="4.1" />
<html src="bar.html" version="5.0" />
</object>

Note that multiple HTML pages can exist as content inside another page, and a page can pull in nodes from another source at any point.

Appendix on Nodes and Text

Where alphanumeric nodes and free text are mixed with blocks, they should be wrapped in a block.

The example serialization:

<html version="6.0">
<body>
Alphabet Soup
<p>Bacon</p>
Chicken Chicken Chicken
<list>
foo
<li>bar</li> baz
quux
</list>
<div>
Text <span>text text</span> text
<p>More text</p>
</div>
</body>
</html>

May be internally represented like:

<html version="6.0">
<body>
<p>Alphabet Soup</p>
<p>Bacon</p>
<p>Chicken Chicken Chicken</p>
<list>
<li>foo</li>
<li>bar</li>
<li>baz quux</li>
</list>
<div>
<div>Text <span>text text</span> text</div>
<p>More text</p>
</div>
</body>
</html>

Appendix: Proposed src merging behaviour

Merging imported nodes: If the retrieved data is a single node with the same type as the current node, the imported node replaces this node and adopts its attributes.

Given the example node with src attribute:

<p foo="bar" src="http://example.com/path/to/file" >
 Alternate text - to be displayed if src retrieval fails.
</p>

Example 1:

<p>Hello World</p>

Produces the internal representation:

<p foo="bar">Hello World</p>

Example 2:

<a>Hello World</a>

Produces the internal representation:

<p foo="bar"><a>Hello World</a></p>

Example 3:

<p>Hello</p>
<p>World</p>

Produces the internal representation:

<p foo="bar">
  <p>Hello</p>
  <p>World</p>
</p>

Additional Thoughts:

  • If the contents are not a node, we wrap them inside this node.
  • If the contents are a node, (and additional conditions are met), it is this node.

This feels right but causes problems.

  • This produces an inconsistency in behaviour. What if it is not known whether the retrieved data will be a single node (meeting conditions) or not, but scripting depends on the node not being merged?
  • What if you want to discourage the behaviour?

Appendix: notes on the "id" attribute and scope

The "id" attribute in html4 is supposed to be unique to a page. This guarantee can break when importing things; a given page and a piece of imported content may each have an id tag that is unique to it, but conflicts when the two are brought together.

This will be a serious problem when pulling in Javascript-enabled objects that use IDs. Or when pulling in the same object twice; imagine that you want two Javascript clocks on a page. There should be a user-friendly solution that does not make web developers use namespaces everywhere.

When an ID is requested, it can come from different contexts:

  • Node context (a script that is metadata of a node)
  • Whole-page context (a #hashtag in a url)

If the request comes from a whole-page context, the first ID on the page should be the one selected. If it comes from a node context, the ID to be selected should depend on the node.

When conflicting IDs are discovered, they can be in different contexts:

  1. Same context as requester's context
  2. Child node imported from requester's context
  3. Parent that imported the node, or its parent

To resolve the issue intelligently, we must remember the node context of the requester. The internal representation of an ID will probably pair the ID string (or the address of the pointer to it) with a context ID. A full context match will be the first selected, then a match to a child context, then a match to any context.

There may still be a conflict in these circumstances. I am not sure if the selection should then be undefined or based on distance from requester.

Idea: Force all scripts to be scope-limited to the node that they are meta for. This will make html5 developers very unhappy. Outer scripts should be able to whitelist functions for import from descendants.

Idea: On import, automagically rewrite all IDs. The tricky part is to also rewrite all ID references in script.

Idea: Give up and use namespaces. If foo imports bar which imports baz which imports quux which has the id blarg, the namespace string to access blarg from the root is foo::bar::baz::quux::blarg. Scripts in blarg only need to say blarg, those in quux only need to say quux::blarg, and so on. This requires giving relatively-unique ids to the contexts.

A long time ago I had the concept of an attribute "id_rel" that is guaranteed to be unique to all children of the node and its siblings. I did not see a use for it.


Section 2 - HTML

2.01 - HTML

Oh, all right, here's an implementation:

  • An HTML tag is an XML serialization of a node.

Ta da!

I suppose this is what is called a "Parsed Internal Entity" in the XML specification, but the spec is not clear.

All the rest of XML? Dump it.

2.02 - Alternate node sets

Let's support DocBook as an alternate node set descended from the basic node object representation. Why not? Docbook is a more robust markup language than HTML and covers much of its feature set.

Docbook 6 == HTML 6. It could happen.

Appendices to Section 2

Appendix - Notes on Full Serialization

Given any web page, it should be possible to produce a serialization containing everything in the page including all objects imported from elsewhere on the web.

Binary data formats such as images can be serialized as base64.


Section 3 - Attribute Style Sheets

I hereby declare that CSS properties are valid node attributes.

I hereby declare that node attributes are valid CSS properties.

CSS is now a way of mass-assigning arbitrary attributes to HTML.

The pre-defined CSS properties are now optional HTML attribute sets.

  • Fonts are no longer CSS but are the standard Font Attribute Set.
  • The box model is no longer CSS but is the standard Box Model Attribute Set.
  • And so on.

This leaves the job of sorting them.


And that's that so far.

In the spirit of the the HTML 5 Boilerplate, I present the HTML 6 Boilerplate. This should be everything you need to create an HTML 6 web page:

<html version="6">

</html>
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

  • 0 comments