My ongoing work in making the site less gross under the hood is getting to a point where it needs to be able to handle bbcode. While I could adapt an existing library to do so, it would also be nice to be able to use the same code both client-side (for instance to preview posts in your browser) and server-side (to render them for viewing). In addition, traditional bbcode parsers are extremely regex-heavy- for instance the one currently powering the site (based on phpbb 2) basically does one pass per tag that it understands, in addition to modifying the text you give it before storing it, which makes doing any other handling of it more difficult.
So what have I done? I implemented a Rust library that can be used just about anywhere:
https://gitlab.com/cemetech/bbcode
Not all of the tags that I want to eventually support are implemented right now, but the remaining ones should be quite easy.
To support a variety of uses (for instance, counting words in a post in addition to converting to HTML), the core parser implements an API similar to SAX-mode XML parsers: it emits start and end events for each tag, and text anywhere there is text. It does a single pass through the text and doesn't backtrack, so it should be very fast. A disadvantage of this approach as implemented is that it may be difficult to adapt to other dialects of BBcode than the one we use here, but for Cemetech-adjacent uses that's not a concern. BBcode isn't the most self-consistent language, so some tags have special behavior coded in- especially lists, but also images and some kinds of links.
The one major departure from the status quo is in how unclosed tags are handled- in order to avoid a need for unbounded backtracking, the parser assumes that any valid open tag also has a valid close tag somewhere- if it reaches a point where a tag should have been closed but was not, it synthesizes a close tag. For instance, the markup:
Code:
would be displayed as [b]Hello, you by a traditional parser (illustrated with fullwidth brackets U+FF3B, U+FF3D to avoid ambiguity) while this one translates it as Hello, you (bolding the text as well as underlining) because the valid open bold tag must be closed when its parent underline is closed.
(Further supporting that choice, illustrating the behavior here reveals very strange behavior where the first unmatched tag is closed by a much later one, meaning the span of bold is much longer than expected.)
Being implemented in Rust, it's easy to embed this as a library. It currently has Python bindings (which I expect to use on the server) and a version that can be built to WebAssembly/Javascript, which can easily be used in a browser.
So what have I done? I implemented a Rust library that can be used just about anywhere:
https://gitlab.com/cemetech/bbcode
Not all of the tags that I want to eventually support are implemented right now, but the remaining ones should be quite easy.
To support a variety of uses (for instance, counting words in a post in addition to converting to HTML), the core parser implements an API similar to SAX-mode XML parsers: it emits start and end events for each tag, and text anywhere there is text. It does a single pass through the text and doesn't backtrack, so it should be very fast. A disadvantage of this approach as implemented is that it may be difficult to adapt to other dialects of BBcode than the one we use here, but for Cemetech-adjacent uses that's not a concern. BBcode isn't the most self-consistent language, so some tags have special behavior coded in- especially lists, but also images and some kinds of links.
The one major departure from the status quo is in how unclosed tags are handled- in order to avoid a need for unbounded backtracking, the parser assumes that any valid open tag also has a valid close tag somewhere- if it reaches a point where a tag should have been closed but was not, it synthesizes a close tag. For instance, the markup:
Code:
[u][b]Hello, you[u]
(Further supporting that choice, illustrating the behavior here reveals very strange behavior where the first unmatched tag is closed by a much later one, meaning the span of bold is much longer than expected.)
Being implemented in Rust, it's easy to embed this as a library. It currently has Python bindings (which I expect to use on the server) and a version that can be built to WebAssembly/Javascript, which can easily be used in a browser.