Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html Reader Process Titles as Headings Not Paragraphs #2533

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

oleibman
Copy link
Contributor

Fix #1692. Builds on work started some time ago by @0b10011, to whom primary credit is due.

Html Reader does not process the head section of the document, and, in particular, does not process its style section. It will, however, process inline styles, so 0b10011's model of adding the title as a text run (with styles) will work well once this change is applied. However, that model would not deal with the alternative method of assigning a Title Style, and just adding the title as text. In order to accommodate that, I have removed the declaration of heading font styles in the head section, and now generate them all inline in the body. This has the added benefit of being able to read the doc as html, then saving it as docx, preserving, at least in part, any user-defined font styles. Note that html does have pre-defined title styles, but docx does not.

@constip suggests in the original issue that margin top and bottom are being applied too frequently. I believe that was addressed by recently merged PR #2475. It is also suggested that the * css selector be dropped in favor of body. 2475 added the body selector. I agree that this renders the * selector unnecessary, and, as stated in the issue, it can cause problems. This PR drops that selector. It is also suggested that loadHTML be used instead of loadXML. This is not as easy a change as it seems, because loadHTML uses ISO-8859-1 charset rather than UTF-8, so I will not attempt that change.

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.

Fixes # (issue)

Checklist:

  • I have run composer run-script check --timeout=0 and no errors were reported
  • The new code is covered by unit tests (check build/coverage for coverage report)
  • I have updated the documentation to describe the changes

Fix PHPOffice#1692. Builds on work started some time ago by @0b10011, to whom primary credit is due.

Html Reader does not process the `head` section of the document, and, in particular, does not process its `style` section. It will, however, process inline styles, so 0b10011's model of adding the title as a text run (with styles) will work well once this change is applied. However, that model would not deal with the alternative method of assigning a Title Style, and just adding the title as text. In order to accommodate that, I have removed the declaration of heading font styles in the head section, and now generate them all inline in the body. This has the added benefit of being able to read the doc as html, then saving it as docx, preserving, at least in part, any user-defined font styles. Note that html does have pre-defined title styles, but docx does not.

@constip suggests in the original issue that margin top and bottom are being applied too frequently. I believe that was addressed by recently merged PR PHPOffice#2475. It is also suggested that the `*` css selector be dropped in favor of `body`. 2475 added the body selector. I agree that this renders the `*` selector unnecessary, and, as stated in the issue, it can cause problems. This PR drops that selector. It is also suggested that `loadHTML` be used instead of `loadXML`. This is not as easy a change as it seems, because loadHTML uses ISO-8859-1 charset rather than UTF-8, so I will not attempt that change.
@coveralls
Copy link

coveralls commented Dec 25, 2023

Coverage Status

coverage: 96.947% (+0.001%) from 96.946%
when pulling 03ad7ec on oleibman:word1692
into 639f396 on PHPOffice:master.

Copy link
Member

@Progi1984 Progi1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleibman Could you move changes to 2.0.0.md file, please ?

It seems that this PR is not finished. Isn't it ?

@oleibman
Copy link
Contributor Author

oleibman commented Jan 7, 2024

@Progi1984 I have made the code change and moved the change notes to the new log. But ...

It seems that this PR is not finished. Isn't it ?

I'm not sure what you mean. What work do you think is still undone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Importing HTML Headings inserts text, not heading and export to html wrong elements
3 participants