Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

escaped HTML entities in markdown content are not being honored when prerendering Light DOM HTML #1375

Open
thescientist13 opened this issue Jan 6, 2025 · 1 comment
Assignees
Labels
bug Something isn't working CLI needs upstream question Further information is requested SSR
Milestone

Comments

@thescientist13
Copy link
Member

thescientist13 commented Jan 6, 2025

Type of Change

Bug

Summary

It was observed in ProjectEvergreen/www.greenwoodjs.dev#120 (comment) that if creating markdown content as follows

# Server Rendering

<app-ctc-block variant="snippet" heading="src/pages/users.js">

  ```js
  export async function getBody(compilation, page, request) {
    const timestamp = new Date().getTime();

    return `
      <h1>Hello from the server rendered users page! 👋</h1>
      <table>
        <tr>
          <th>Name</th>
          <th>Image</th>
        </tr>
      </table>
      <h6>Last Updated: ${timestamp}</h6>
    `;
  }
  ```

</app-ctc-block>

While the output from unified is correct and properly escaped

<h1>Server Rendering</h1>
<app-ctc-block variant="snippet" heading="src/pages/users.js">
    <pre><code class="language-js">export async function getBody(compilation, page, request) {
      const timestamp = new Date().getTime();
    
      return `
        &#x3C;h1>Hello from the server rendered users page! 👋&#x3C;/h1>
        &#x3C;table>
          &#x3C;tr>
            &#x3C;th&gt;Name&#x3C;;/th>
            &#x3C;th&gt;Image&#x3C;/th>
          &#x3C;/tr>
        &#x3C;/table>
        &#x3C;h6>Last Updated: ${timestamp}&#x3C;/h6>
      `;
    }
    </code></pre>
</app-ctc-block>

The output from WCC / parse5 has all the HTML entities converted back to HTML

<h1>Server Rendering</h1>
<app-ctc-block variant="snippet" heading="src/pages/users.js">
  <pre><code class="language-js">
    export async function getBody(compilation, page, request) {
      const timestamp = new Date().getTime();
    
      return `
        <h1>Hello from the server rendered users page! 👋</h1>
        <table>
          <tbody><tr>
            <th>Name</th>
            <th>Image</th>
          </tr>
        </tbody></table>
        <h6>Last Updated: ${timestamp}</h6>
      `;
    }
  </code></pre>
</app-ctc-block>

This means that instead of rendering as text
image

The HTML is rendered literally, breaking the output
image

Details

The main issue seems to be in WCC (parse5 specifically), in that parse5 will convert HTML entities automatically when accessing the "raw" value of a node
Screenshot 2025-01-06 at 5 22 24 PM

Which means that when building the HTML back out, instead of getting something like &lt;/h1>Some text&lt;/h1> to do innerHTML work in WCC, we end up getting the literal HTML </h1>Some text</h1> which does seem to be the expected behavior as they state, meaning it will be up to application consumers to manage this preservation, unfortunately.

A seemingly simple solution would be to just manually escape < when parsing innerHTML in WCC, though I'm not sure if this is the best solution, or more likely, the best place?

 } else if (nodeName === '#text') {
    // escape < brackets
    innerHTML += value.replace(/</g, '&lt;');
  }

The challenge is that as far as Greenwood is concerned, the input to WCC is correct, so how would we know where to do the substitution on the way out? Per reading through similar issues in the parse5 repo, we would have to double parse and convert based on locations, and from what i understand, adding location markers is a pretty significant performance overhead.


@thescientist13 thescientist13 added bug Something isn't working CLI SSR labels Jan 6, 2025
@thescientist13 thescientist13 added this to the 1.0 milestone Jan 6, 2025
@thescientist13 thescientist13 self-assigned this Jan 6, 2025
@thescientist13 thescientist13 added question Further information is requested needs upstream labels Jan 6, 2025
@thescientist13 thescientist13 changed the title escaped HTML entities from markdown content are not being honored when prerendering escaped HTML entities in markdown content are not being honored when prerendering Light DOM HTML Jan 7, 2025
@thescientist13
Copy link
Member Author

thescientist13 commented Jan 7, 2025

Another complication is that parse5 will also seem to encode entities even if they aren't part of HTML, which makes this work around in WCC even more unpredictable :/
ProjectEvergreen/wcc#182

{
  value: '\n' +
    '          <h1>Hello from the server rendered users < page! 👋</h1>\n' +
    '        '
}
{
  html: '\n' +
    '        <x-ctc>\n' +
    '          <h1>Hello from the server rendered users &lt; page! 👋</h1>\n' +
    '        </x-ctc>\n' +
    '        '
}

Wonder if we'll have to do something from the Greenwood side, e.g.
https://github.com/ProjectEvergreen/greenwood/blob/v0.31.0-alpha.2/packages/cli/src/lifecycles/prerender.js#L98

body = await new Promise((resolve, reject) => {
  pool.runTask({
    executeModuleUrl: workerPrerender.executeModuleUrl.href,
    modulePath: null,
    compilation: JSON.stringify(compilation),
    page: JSON.stringify(page),
    prerender: true,
    htmlContents: body.replace(/&#x3C;/g, 'custom-left-bracket')
    scripts: JSON.stringify(scripts)
  }, (err, result) => {
    if (err) {
      return reject(err);
    }

    return resolve(result.html);
  });
});

body = body.replace(/custom-left-bracket/g, '&#x3C;')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CLI needs upstream question Further information is requested SSR
Projects
Development

No branches or pull requests

1 participant