Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

qknight
Copy link

@qknight qknight commented Mar 11, 2025

This branch is used to implement the fix required to the issue: #580

Motivation

With the current implementation the parser will evaluate arbitraty html tags inside a <pre>...</pre> and with this patch, <pre> will behave more like <script>.

This behaviour should be optional as sometimes it also makes sense to parse tags inside a <pre>, for instance for styling but most often the content inside a <pre> should be pretty much ignored and copied 1:1 from the source document into the generated output document and not reformatted (removing spaces, newlines or tabs) or should the parsed content have any influence on the overal consistenty of the document.

That said:

  • <html><pre></html>test foo</pre></html> should not be fixed into
  • <html><pre>test foo</pre></html>

Status

branch: servo_issue_580 with hash: 2094a85

This evaluates:

<hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script>

into

<html><head></head><body><hello>XML</hello><pre>\n&lt;bad&gt; &lt;/bad&gt;text-in pre</pre><p>asdf</p><script>script</html> magic string</script></body></html>

This shows that the content inside the <pre>...</pre> is grabbed and not parsed already. Yet the result should be no HTML escaped string but rather a 1:1 copy of the original tags.

This can be evaluated by running:

clear && cargo run --example html2html

Todo

  • Figure out why: process_to_completion is called for <script> but not for <pre>
  • Implement an option to the parser to include parsing of <pre>...</pre> content or not
  • Write the PreData as String and not HTML escaped.
  • Write a bunch of tests so make sure it works

@jdm
Copy link
Member

jdm commented Mar 11, 2025

Is this behavior specified in the HTML parsing specification?

@qknight
Copy link
Author

qknight commented Mar 12, 2025

@jdm your question is hard to answer!

html standard related to <pre>

i like the grok summary i created https://x.com/i/grok/share/AI7crMuXH2BoIAxC57P9v8VIg but it does not have sources.

my new understanding is now:

  • everything in <pre>...</pre> needs to have a fixed layout, no changes on spaces, tabs or newlines
  • the parser 'can' parse tags but must not do any 'fixes' if incorrect

something along these lines. i try to figure out how virtual-dom does it.

virtual-dom (works)

i write this technical blog at https://lastlog.de/blog/libnix_volth's_work.html and i'm using pandoc to generate <pre><code> sections and when i serialize and deserialize the html document using https://github.com/Matt-Esch/virtual-dom it just works correctly.

the motivation to move away from this is the usage of rust compiled to WASM. i always wanted to make modifications to the way 'new virtual-dom patches are applied' with visual cues which i can't do with virtual-dom.

rphtml (fails)

first i tried to replace virtual-dom with rphtml. but i discovered problems with rphtml: fefit/rphtml#4
i tried to fix them but the code is very hard to read and after a few days of hacking i gave up.

notable mention: the issue of rphtml was very hard to track down as it works 'half' of the time where text nodes in combination to tags sometimes yield correct html documents after doc.render(...)

html5ever (fails)

the <pre> handling in html5ever also breaks the code generated by pandoc after a serialize/deserialize run but for a slightly different reason. it removes all the newlines from the <pre>... \n <span>...</span> \n ... </pre>.

the first attempt in fixing it using a string for all the <pre>...</pre> content might work but still might not implement the html parsing standard correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants