hwp-parser

📁 harifatherkr/hwp-parser 📅 4 days ago

总安装量

周安装量

#59501

全站排名

安装命令

npx skills add https://github.com/harifatherkr/hwp-parser --skill hwp-parser

Agent 安装分布

mcpjam 3

claude-code 3

replit 3

junie 3

windsurf 3

zencoder 3

Skill 文档

HWP Parser ê°ë° ëì°ë¯¸

íê¸(HWP/HWPX) ë¬¸ìë¥¼ ì½ê³ ë³ííë ëª¨ë ììì ëìì¤ëë¤. Python APIì CLI ëêµ¬ë¥¼ ì ê³µíë©°, LLM íì©ì ìí ìí¬íë¡ì°ë¥¼ ì§ìí©ëë¤.

íê²½ íì¸

ê°ìíê²½ íì±í íì¸

source venv/bin/activate  # ëë .venv/bin/activate

í¨í¤ì§ ì¤ì¹ íì¸

pip install -e .  # ê°ë° ëª¨ë
# ëë
pip install hwpparser[all]  # ì ì²´ ê¸°ë¥

ìì¤í ìì¡´ì± íì¸

# PDF ë³í (Chrome headless ì¬ì©)
# macOS - Chromeì´ ì´ë¯¸ ì¤ì¹ëì´ ìì¼ë©´ ë³ë ì¤ì¹ ë¶íì
brew install --cask google-chrome

# Ubuntu/Debian
sudo apt install google-chrome-stable
# ëë Chromium
sudo apt install chromium-browser

# HWPX ìì± (ì íì¬í)
brew install pandoc  # macOS
sudo apt install pandoc  # Ubuntu

ìì² ë¶ë¥

ì¬ì©ì ìì²ì ë¶ìíì¬ í´ë¹íë ê¸°ë¥ íì:

í¤ìë	ê¸°ë¥	ì°¸ì¡°
íì¤í¸ ì¶ì¶, ì½ê¸°, íì±	HWP â Text	`hwp_to_text()`
HTML ë³í, ì¹íì´ì§	HWP â HTML	`hwp_to_html()`
ODT, OpenDocument	HWP â ODT	`hwp_to_odt()`
PDF ë³í	HWP â PDF	`hwp_to_pdf()`
ë§í¬ë¤ì´, íê¸ ìì±	Markdown â HWPX	`markdown_to_hwpx()`
ì²í¹, RAG, ë²¡í° DB	ë¬¸ì ì²í¹	`hwp_to_chunks()`
LangChain, ë¬¸ì ë¡ë	LangChain ì°ë	`HWPLoader`
ì¼ê´ ë³í, í´ë ì²ë¦¬	ë°°ì¹ ë³í	`batch_convert()`
ê²ì ì¸ë±ì±, JSONL	ì¸ë±ì¤ ìì±	`export_to_jsonl()`
ë©íë°ì´í°, ì ë³´ ì¶ì¶	ë¬¸ì ë©íë°ì´í°	`extract_metadata()`

ìì íë¦

1. ë¨ì¼ íì¼ ë³í

import hwpparser

# HWP ì½ê¸°
doc = hwpparser.read_hwp("document.hwp")
print(doc.text)  # íì¤í¸
print(doc.html)  # HTML

# íì¼ë¡ ì ì¥
doc.to_odt("output.odt")
doc.to_pdf("output.pdf")

# ë¹ ë¥¸ ë³í
text = hwpparser.hwp_to_text("document.hwp")

2. CLI ì¬ì©

# íì¤í¸ ì¶ì¶
hwpparser text document.hwp

# í¬ë§· ë³í
hwpparser convert document.hwp output.txt
hwpparser convert document.hwp output.pdf

# ì¼ê´ ë³í
hwpparser batch ./hwp_files/ -f text -o ./text_files/

# ì§ì í¬ë§· íì¸
hwpparser formats

3. LLM/RAG ìí¬íë¡ì°

# ì²í¹ (ë²¡í° DBì©)
chunks = hwpparser.hwp_to_chunks("document.hwp", chunk_size=1000)
for chunk in chunks:
    embedding = embed(chunk.text)
    vector_db.insert(embedding, chunk.metadata)

# LangChain ì°ë
from hwpparser import HWPLoader, DirectoryHWPLoader

loader = HWPLoader("document.hwp")
docs = loader.load()

# í´ë ì ì²´
loader = DirectoryHWPLoader("./documents", recursive=True)
docs = loader.load()

# ê²ì ì¸ë±ì± (Elasticsearch/Algolia)
hwpparser.export_to_jsonl("./documents", "./index.jsonl", chunk_size=1000)

4. ë°°ì¹ ì²ë¦¬

# í´ë ë´ ëª¨ë  HWP â TXT
result = hwpparser.batch_convert("./hwp_files", "./text_files", "txt")
print(f"ë³í ìë£: {result.success}/{result.total}")

# ëª¨ë  íì¤í¸ í©ì¹ê¸°
all_text = hwpparser.batch_extract_text("./documents")

5. HWPX ìì±

# Markdown â HWPX
hwpparser.markdown_to_hwpx("# ì ëª©\në´ì©", "output.hwpx")

# HTML â HWPX
hwpparser.html_to_hwpx("<h1>ì ëª©</h1><p>ë´ì©</p>", "output.hwpx")

# íµí© ë³í ì¸í°íì´ì¤
hwpparser.convert("input.md", "output.hwpx")
hwpparser.convert("input.docx", "output.hwpx")

ì§ì ë³í

ìë ¥ â ì¶ë ¥	í¨ì/CLI
HWP â Text	`hwp_to_text()`, `convert ... -f text`
HWP â HTML	`hwp_to_html()`, `convert ... -f html`
HWP â ODT	`hwp_to_odt()`, `convert ... -f odt`
HWP â PDF	`hwp_to_pdf()`, `convert ... -f pdf`
Markdown â HWPX	`markdown_to_hwpx()`, `convert file.md file.hwpx`
HTML â HWPX	`html_to_hwpx()`
DOCX â HWPX	`convert file.docx file.hwpx`

ì£¼ì ìì ìì

ë¨ì¼ íì¼ íì¤í¸ ì¶ì¶

ìì²: “ì´ HWP íì¼ ë´ì© ì½ì´ì¤”

text = hwpparser.hwp_to_text("document.hwp")
print(text)

í´ë ì ì²´ ì¼ê´ ë³í

ìì²: “documents í´ëì ëª¨ë HWPë¥¼ PDFë¡ ë³íí´ì¤”

result = hwpparser.batch_convert("./documents", "./pdf_output", "pdf")
print(f"ì±ê³µ: {result.success}, ì¤í¨: {result.failed}")

RAG íì´íë¼ì¸ êµ¬ì¶

ìì²: “HWP ë¬¸ìë¤ì ì²í¹í´ì ë²¡í° DBì ë£ì ì ìê² í´ì¤”

# ì²í¹
chunks = hwpparser.hwp_to_chunks("document.hwp", chunk_size=1000)

# ë²¡í°í ë° ì ì¥
for chunk in chunks:
    embedding = your_embed_function(chunk.text)
    vector_db.insert({
        'embedding': embedding,
        'text': chunk.text,
        'metadata': chunk.metadata  # file, page, offset ë±
    })

LangChain ë¬¸ì ë¡ë

from hwpparser import DirectoryHWPLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ë¬¸ì ë¡ë
loader = DirectoryHWPLoader("./documents", recursive=True)
docs = loader.load()

# ì²í¹
splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)

ê²ì ì¸ë±ì¤ ìì±

ìì²: “Elasticsearchì ë£ì JSONL íì¼ ë§ë¤ì´ì¤”

hwpparser.export_to_jsonl(
    "./documents",
    "./search_index.jsonl",
    chunk_size=1000  # ì²í¹ í¬í¨
)

ë§í¬ë¤ì´ì íê¸ ë¬¸ìë¡ ë³í

ìì²: “READMEë¥¼ HWPXë¡ ë§ë¤ì´ì¤”

hwpparser.convert("README.md", "README.hwpx")

ë©íë°ì´í° ì¶ì¶

ìì²: “ì´ HWP íì¼ ì ë³´ ìë ¤ì¤”

meta = hwpparser.extract_metadata("document.hwp")
print(f"ê¸ì ì: {meta['char_count']}")
print(f"ë¨ì´ ì: {meta['word_count']}")

ìì¸ ì²ë¦¬

from hwpparser.exceptions import (
    HWPFileNotFoundError,
    ConversionError,
    DependencyError,
    UnsupportedFormatError
)

try:
    result = hwpparser.convert("document.hwp", "output.pdf")
except HWPFileNotFoundError:
    print("íì¼ì ì°¾ì ì ììµëë¤")
except DependencyError as e:
    print(f"ìì¡´ì± ëë½: {e}")
except ConversionError as e:
    print(f"ë³í ì¤í¨: {e}")

íì¤í¸

# ì ì²´ íì¤í¸
pytest tests/ -v

# í¹ì  ëª¨ë íì¤í¸
pytest tests/test_reader.py -v

# ì»¤ë²ë¦¬ì§
pytest tests/ --cov=hwpparser

ì°¸ê³ ë¬¸ì

ìì¡´ì± ë¼ì´ì ì¤

ë¼ì´ì ì¤: GNU Affero General Public License v3
ì ì¥ì: https://github.com/mete0r/pyhwp

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台

hwp-parser

Agent 安装分布

Skill 文档

HWP Parser ê°ë° ëì°ë¯¸

íê²½ íì¸

ìì²­ ë¶ë¥

ìì íë¦

1. ë¨ì¼ íì¼ ë³í

2. CLI ì¬ì©

3. LLM/RAG ìí¬íë¡ì°

4. ë°°ì¹ ì²ë¦¬

5. HWPX ìì±

ì§ì ë³í

ì£¼ì ìì ìì

ë¨ì¼ íì¼ í ì¤í¸ ì¶ì¶

í´ë ì ì²´ ì¼ê´ ë³í

RAG íì´íë¼ì¸ êµ¬ì¶

LangChain ë¬¸ì ë¡ë

ê²ì ì¸ë±ì¤ ìì±

ë§í¬ë¤ì´ì íê¸ ë¬¸ìë¡ ë³í

ë©íë°ì´í° ì¶ì¶

ìì¸ ì²ë¦¬

í ì¤í¸

ì°¸ê³ ë¬¸ì

ìì¡´ì± ë¼ì´ì ì¤