Attempt of getting <pre> to not parse inner contents similar to <script> #582

qknight · 2025-03-11T18:45:06Z

This branch is used to implement the fix required to the issue: #580

Motivation

With the current implementation the parser will evaluate arbitraty html tags inside a <pre>...</pre> and with this patch, <pre> will behave more like <script>.

This behaviour should be optional as sometimes it also makes sense to parse tags inside a <pre>, for instance for styling but most often the content inside a <pre> should be pretty much ignored and copied 1:1 from the source document into the generated output document and not reformatted (removing spaces, newlines or tabs) or should the parsed content have any influence on the overal consistenty of the document.

That said:

<html><pre></html>test foo</pre></html> should not be fixed into
<html><pre>test foo</pre></html>

Status

branch: servo_issue_580 with hash: 2094a85

This evaluates:

<hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script>

into

<html><head></head><body><hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script></body></html>

This shows that the content inside the <pre>...</pre> is grabbed and not parsed already. Yet the result should be no HTML escaped string but rather a 1:1 copy of the original tags.

This can be evaluated by running:

clear && cargo run --example html2html

Todo

Figure out why: process_to_completion is called for <script> but not for <pre>
Implement an option to the parser to include parsing of <pre>...</pre> content or not
Write the PreData as String and not HTML escaped.
Write a bunch of tests so make sure it works

jdm · 2025-03-11T19:10:56Z

Is this behavior specified in the HTML parsing specification?

qknight · 2025-03-12T01:39:06Z

@jdm your question is hard to answer!

html standard related to `<pre>`

i like the grok summary i created https://x.com/i/grok/share/AI7crMuXH2BoIAxC57P9v8VIg but it does not have sources.

my new understanding is now:

everything in <pre>...</pre> needs to have a fixed layout, no changes on spaces, tabs or newlines
the parser 'can' parse tags but must not do any 'fixes' if incorrect

something along these lines. i try to figure out how virtual-dom does it.

virtual-dom (works)

i write this technical blog at https://lastlog.de/blog/libnix_volth's_work.html and i'm using pandoc to generate <pre><code> sections and when i serialize and deserialize the html document using https://github.com/Matt-Esch/virtual-dom it just works correctly.

the motivation to move away from this is the usage of rust compiled to WASM. i always wanted to make modifications to the way 'new virtual-dom patches are applied' with visual cues which i can't do with virtual-dom.

rphtml (fails)

first i tried to replace virtual-dom with rphtml. but i discovered problems with rphtml: fefit/rphtml#4
i tried to fix them but the code is very hard to read and after a few days of hacking i gave up.

notable mention: the issue of rphtml was very hard to track down as it works 'half' of the time where text nodes in combination to tags sometimes yield correct html documents after doc.render(...)

html5ever (fails)

the <pre> handling in html5ever also breaks the code generated by pandoc after a serialize/deserialize run but for a slightly different reason. it removes all the newlines from the <pre>... \n <span>...</span> \n ... </pre>.

the first attempt in fixing it using a string for all the <pre>...</pre> content might work but still might not implement the html parsing standard correctly.

Attempt of getting <pre> to not parse inner contents similar to <script>

2094a85

qknight added 2 commits March 11, 2025 20:45

process_to_completion now processes PreData correctly

35b479e

parse_pre option support for TreeBuilderOpts

f0e4e4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Attempt of getting <pre> to not parse inner contents similar to <script> #582

qknight commented Mar 11, 2025 •

edited

Loading

jdm commented Mar 11, 2025

qknight commented Mar 12, 2025 •

edited

Loading

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Are you sure you want to change the base?

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Conversation

qknight commented Mar 11, 2025 • edited Loading

Motivation

Status

Todo

jdm commented Mar 11, 2025

qknight commented Mar 12, 2025 • edited Loading

html standard related to <pre>

virtual-dom (works)

rphtml (fails)

html5ever (fails)

qknight commented Mar 11, 2025 •

edited

Loading

qknight commented Mar 12, 2025 •

edited

Loading

html standard related to `<pre>`