UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence #227

ZhuPingFei · 2024-12-28T18:02:45Z

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence

codeicu · 2024-12-31T01:29:51Z

same issue

SEU-zxj · 2024-12-31T08:20:31Z

It seems that Issue #198 has solved this problem.
If you still encouner this problem, below is my solution:

Write a .py file (let's say it is converter.py)

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("your-file-name")
print(result.text_content)

move the output text into a markdown file.

$ python convertor.py > your-file-name.md

That's worked for me.

SEU-zxj · 2024-12-31T08:35:38Z

If above code do not work, you can try below:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("your-file-name.pptx")
text = result.text_content
    
# Save to a file
with open("your-file-name.md", "w", encoding="utf-8") as f:
    f.write(text)

# Print to the console
# print(text)

codeicu · 2024-12-31T08:39:45Z

This works for me:

markitdown < 1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence #227

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence #227

ZhuPingFei commented Dec 28, 2024

codeicu commented Dec 31, 2024

SEU-zxj commented Dec 31, 2024

SEU-zxj commented Dec 31, 2024

codeicu commented Dec 31, 2024

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence #227

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence #227

Comments

ZhuPingFei commented Dec 28, 2024

codeicu commented Dec 31, 2024

SEU-zxj commented Dec 31, 2024

SEU-zxj commented Dec 31, 2024

codeicu commented Dec 31, 2024