Import project files
This commit is contained in:
154
docling/README.zh-CN.md
Normal file
154
docling/README.zh-CN.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# 本地安装与启动指南(Docling + FastAPI 服务)
|
||||
|
||||
本文档介绍如何在本机安装与启动本仓库的转换服务,以供前端调用生成并下载 PDF。
|
||||
|
||||
## 环境要求
|
||||
- 操作系统:macOS(已验证),Linux/Windows 亦可
|
||||
- Python:3.9–3.13
|
||||
- 建议安装工具:`python -m venv` 或 [uv](https://docs.astral.sh/uv/)
|
||||
|
||||
## 创建虚拟环境
|
||||
- 使用 venv:
|
||||
```bash
|
||||
cd /Users/fanyang/Desktop/docling
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
python -m pip install -U pip
|
||||
```
|
||||
- 或使用 uv:
|
||||
```bash
|
||||
cd /Users/fanyang/Desktop/docling
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
## 安装依赖
|
||||
- 安装本地 Docling 库(可编辑模式):
|
||||
```bash
|
||||
python -m pip install -e ./docling
|
||||
```
|
||||
- 安装后端服务依赖:
|
||||
```bash
|
||||
python -m pip install fastapi uvicorn minio weasyprint pytest
|
||||
```
|
||||
- 若 WeasyPrint 在 macOS 上提示缺少系统库,可使用 Homebrew 安装:
|
||||
```bash
|
||||
brew install cairo pango gdk-pixbuf libffi
|
||||
```
|
||||
|
||||
## 启动服务
|
||||
- 在项目根目录执行:
|
||||
```bash
|
||||
PYTHONPATH=. python -m uvicorn app.server:app --host 127.0.0.1 --port 8000
|
||||
```
|
||||
- 访问:
|
||||
- 首页 UI:`http://127.0.0.1:8000/`
|
||||
- 健康检查:`http://127.0.0.1:8000/health`(返回 `{"status":"ok"}`)
|
||||
|
||||
### 接口总览
|
||||
- `GET /` 本地 UI(静态文件)
|
||||
- `GET /health` 服务健康检查
|
||||
- `POST /md/convert` Markdown/HTML → `docx|pdf`(核心接口,返回 MinIO 下载链接)
|
||||
- `POST /md/convert-folder` 批量转换本地文件夹内的 `.md` 文件并上传结果到 MinIO
|
||||
- `POST /md/upload-folder` 批量上传前端打包的文件夹内容并转换其中 `.md` 文件
|
||||
- MinIO 配置相关:
|
||||
- `POST /config/minio` 设置连接信息与前缀
|
||||
- `POST /config/minio/test` 验证连接
|
||||
- `GET /config/minio/buckets` 列出桶
|
||||
- `POST /config/minio/create-bucket` 创建桶
|
||||
|
||||
## MinIO 配置
|
||||
- 环境变量方式(推荐):
|
||||
```bash
|
||||
export MINIO_ENDPOINT=127.0.0.1:9000
|
||||
export MINIO_ACCESS_KEY=minioadmin
|
||||
export MINIO_SECRET_KEY=minioadmin
|
||||
export MINIO_BUCKET=docling-target
|
||||
export MINIO_SECURE=false
|
||||
export MINIO_PUBLIC_ENDPOINT=http://127.0.0.1:9000
|
||||
export MINIO_PREFIX=cms-files
|
||||
```
|
||||
- 运行时接口方式:
|
||||
- `POST /config/minio` 设置连接信息与前缀
|
||||
- `POST /config/minio/test` 测试连通性
|
||||
- `GET /config/minio/buckets` 列出桶
|
||||
- `POST /config/minio/create-bucket` 创建桶
|
||||
|
||||
## 前端下载 PDF(接口说明)
|
||||
- 核心接口:`POST /md/convert`
|
||||
- 作用:将 Markdown/HTML 转换为 PDF 并上传至 MinIO,返回可下载链接
|
||||
- 参数(FormData,以下三选一提供文档来源):
|
||||
- `md_file`:上传 Markdown 文件
|
||||
- `markdown_text`:直接传入 Markdown 文本
|
||||
- `markdown_url`:文档 URL(推荐)
|
||||
- 目标格式:`target=pdf`
|
||||
- 可选参数:`toc`、`header_text`、`footer_text`、`logo_url|logo_file`、`cover_url|cover_file`、`product_name`、`document_name`、`product_version`、`document_version`、`css_name|css_text`
|
||||
- 返回 JSON 字段:`minio_presigned_url`(时效下载链接)或 `minio_url`(公开链接)、`name`、`media_type`
|
||||
|
||||
### 前端调用示例(TypeScript)
|
||||
```ts
|
||||
async function downloadPdf(markdownUrl: string) {
|
||||
const fd = new FormData();
|
||||
fd.append('markdown_url', markdownUrl);
|
||||
fd.append('target', 'pdf');
|
||||
fd.append('toc', 'true');
|
||||
// 可选品牌参数:
|
||||
// fd.append('header_text', '产品名|文档标题');
|
||||
// fd.append('footer_text', '© 公司');
|
||||
|
||||
const resp = await fetch('http://127.0.0.1:8000/md/convert', { method: 'POST', body: fd });
|
||||
if (!resp.ok) throw new Error('转换失败');
|
||||
const data = await resp.json();
|
||||
const url = data.minio_presigned_url || data.minio_url;
|
||||
if (!url) throw new Error('未返回可下载链接,请检查 MinIO 配置');
|
||||
window.location.href = url; // 触发下载
|
||||
}
|
||||
```
|
||||
|
||||
### cURL 示例(URL → PDF)
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-F 'markdown_url=http://127.0.0.1:9000/docs/assets/rewritten/DMDRS_Build_Manual_Oracle/DMDRS搭建手册-Oracle.md' \
|
||||
-F 'target=pdf' \
|
||||
-F 'toc=true' \
|
||||
-F 'header_text=产品名|文档标题' \
|
||||
-F 'footer_text=© 2025 公司' \
|
||||
http://127.0.0.1:8000/md/convert
|
||||
```
|
||||
|
||||
返回示例:
|
||||
```json
|
||||
{
|
||||
"minio_url": "http://127.0.0.1:9000/docling-target/cms-files/converted/DMDRS搭建手册-Oracle.pdf",
|
||||
"minio_presigned_url": "http://127.0.0.1:9000/...presigned...",
|
||||
"name": "DMDRS搭建手册-Oracle.pdf",
|
||||
"media_type": "application/pdf"
|
||||
}
|
||||
```
|
||||
|
||||
### 批量转换(文件夹)
|
||||
- 将本地文件夹内的 `.md` 全量转换并上传结果:
|
||||
```bash
|
||||
curl -s -X POST -F 'folder_path=/Users/you/docs' http://127.0.0.1:8000/md/convert-folder
|
||||
```
|
||||
|
||||
### 直接转 DOCX(按需)
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-F 'markdown_url=http://127.0.0.1:9000/docs/assets/rewritten/DMDRS_Build_Manual_Oracle/DMDRS搭建手册-Oracle.md' \
|
||||
-F 'target=docx' \
|
||||
http://127.0.0.1:8000/md/convert
|
||||
```
|
||||
|
||||
## 常见问题
|
||||
- `ModuleNotFoundError: No module named 'app' / 'docling'`
|
||||
- 请在启动命令前设置 `PYTHONPATH=.` 或在当前 shell 直接以 `PYTHONPATH=. python -m uvicorn ...` 方式启动。
|
||||
- 未返回下载 URL
|
||||
- 请检查 MinIO 环境变量或使用 `/config/minio` 进行配置;确保桶存在且服务端启用了 `store_final=true`。
|
||||
- 图片或样式异常
|
||||
- 确认资源已被重写为公共 URL(服务会自动上传并改写),并检查 `css_name`/`css_text`(PDF 默认样式为 `default`,位于 `app/configs/styles/default.css`)。
|
||||
- WeasyPrint 依赖缺失(macOS)
|
||||
- 执行 `brew install cairo pango gdk-pixbuf libffi` 后重试;如仍失败,请检查 `PATH`/`DYLD_LIBRARY_PATH`。
|
||||
|
||||
## 相关文档
|
||||
- 服务端接口中文说明:`docling/README.zh-CN.md`
|
||||
1
docling/app/__init__.py
Normal file
1
docling/app/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
17
docling/app/configs/dm.json
Normal file
17
docling/app/configs/dm.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
1
docling/app/configs/linkmap/linkmap.json
Normal file
1
docling/app/configs/linkmap/linkmap.json
Normal file
@@ -0,0 +1 @@
|
||||
{}
|
||||
17
docling/app/configs/profiles/active.json
Normal file
17
docling/app/configs/profiles/active.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
17
docling/app/configs/profiles/default2.json
Normal file
17
docling/app/configs/profiles/default2.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9001",
|
||||
"public": "127.0.0.1:9001",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "true",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
17
docling/app/configs/profiles/local2.json
Normal file
17
docling/app/configs/profiles/local2.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
17
docling/app/configs/profiles/localhost.json
Normal file
17
docling/app/configs/profiles/localhost.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "true",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
17
docling/app/configs/profiles/localhost3.json
Normal file
17
docling/app/configs/profiles/localhost3.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
17
docling/app/configs/profiles/test.json
Normal file
17
docling/app/configs/profiles/test.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "8.163.40.177:9000",
|
||||
"public": "http://8.163.40.177:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin",
|
||||
"bucket": "cms-files",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
88
docling/app/configs/styles/default.css
Normal file
88
docling/app/configs/styles/default.css
Normal file
@@ -0,0 +1,88 @@
|
||||
@page {
|
||||
size: A4;
|
||||
margin: 20mm 15mm 20mm 15mm;
|
||||
@top-left { content: none; }
|
||||
@top-center { content: element(header); }
|
||||
@bottom-left { content: element(copyright); }
|
||||
@bottom-center { content: element(footer); }
|
||||
@bottom-right { content: counter(page); font-size: 10pt; color: #444; }
|
||||
}
|
||||
|
||||
html { font-family: "Noto Sans CJK SC", "Noto Sans", "Source Han Sans SC", "DejaVu Sans", sans-serif; font-size: 12pt; line-height: 1.6; }
|
||||
|
||||
body { color: #111; }
|
||||
|
||||
h1 { font-size: 20pt; margin: 0 0 8pt; page-break-before: always; }
|
||||
h2 { font-size: 16pt; margin: 16pt 0 8pt; }
|
||||
h3 { font-size: 14pt; margin: 12pt 0 6pt; }
|
||||
|
||||
h1, h2, h3 { page-break-after: avoid; break-after: avoid-page; }
|
||||
|
||||
p { margin: 0 0 8pt; }
|
||||
|
||||
pre, code { font-family: "DejaVu Sans Mono", "Noto Sans Mono", monospace; font-size: 10pt; }
|
||||
|
||||
table { width: 100%; border-collapse: collapse; margin: 8pt 0; table-layout: fixed; }
|
||||
th, td { border: 1px solid #ddd; padding: 6pt 8pt; }
|
||||
thead { display: table-header-group; }
|
||||
tfoot { display: table-footer-group; }
|
||||
table, thead, tbody, tr, th, td { page-break-inside: avoid; break-inside: avoid-page; }
|
||||
|
||||
th, td { white-space: normal; overflow-wrap: anywhere; word-break: break-word; hyphens: auto; }
|
||||
.table-block { page-break-inside: avoid; break-inside: avoid-page; }
|
||||
|
||||
pre { background: #f6f8fa; border: 1px solid #e5e7eb; border-radius: 6pt; padding: 8pt 10pt; white-space: pre-wrap; overflow-wrap: anywhere; word-break: break-word; }
|
||||
code { background: #f6f8fa; border-radius: 4pt; padding: 0 3pt; }
|
||||
|
||||
a { color: #0366d6; text-decoration: underline; }
|
||||
a:hover { text-decoration: underline; }
|
||||
|
||||
.break-before { page-break-before: always; }
|
||||
.break-after { page-break-after: always; }
|
||||
.doc-meta { height: 0; overflow: hidden; }
|
||||
.doc-header-text { position: running(header); }
|
||||
.doc-footer-text { position: running(footer); }
|
||||
.doc-copyright { position: running(copyright); }
|
||||
img#brand-logo { display: none; }
|
||||
|
||||
.toc { page-break-after: always; }
|
||||
.toc h1 { font-size: 18pt; margin: 0 0 8pt; }
|
||||
.toc ul { list-style: none; padding: 0; }
|
||||
.toc li { margin: 4pt 0; display: grid; grid-template-columns: auto 1fr 30pt; column-gap: 8pt; align-items: baseline; }
|
||||
.toc li.toc-h1 .toc-text { font-weight: 600; }
|
||||
.toc li.toc-h2 .toc-text { margin-left: 8pt; }
|
||||
.toc li.toc-h3 .toc-text { margin-left: 16pt; }
|
||||
.toc .toc-dots { border-bottom: 1px dotted currentColor; height: 0.9em; transform: translateY(-0.1em); }
|
||||
.toc .toc-page { text-align: right; }
|
||||
.toc .toc-page::before { content: target-counter(attr(data-target), page); }
|
||||
@page { @bottom-right { content: counter(page); font-size: 10pt; color: #444; } }
|
||||
.doc-header-text { position: running(header); display: flex; justify-content: space-between; align-items: center; font-size: 11pt; color: #444; border-bottom: 1px solid #e5e7eb; padding-bottom: 6pt; min-height: 26pt; }
|
||||
.doc-header-left { font-weight: 500; }
|
||||
.doc-header-right { font-size: 10pt; color: #666; }
|
||||
.doc-header-text img.logo-inline { height: 26pt; margin-right: 8pt; }
|
||||
.doc-header-text img.logo-inline { height: 26pt; margin-right: 8pt; }
|
||||
.doc-footer-text { position: running(footer); display: block; text-align: center; font-size: 10pt; color: #444; border-top: 1px solid #e5e7eb; padding-top: 6pt; }
|
||||
.toc a { color: #0366d6; text-decoration: underline; }
|
||||
.toc li { grid-template-columns: auto 1fr 48pt; }
|
||||
.toc li.toc-h2 .toc-text { margin-left: 12pt; }
|
||||
.toc li.toc-h3 .toc-text { margin-left: 24pt; }
|
||||
table { max-width: 100%; box-sizing: border-box; }
|
||||
tr, th, td { page-break-inside: avoid; break-inside: avoid-page; }
|
||||
img, svg, canvas {
|
||||
display: block;
|
||||
max-width: 100%;
|
||||
height: auto;
|
||||
box-sizing: border-box;
|
||||
page-break-inside: avoid;
|
||||
break-inside: avoid-page;
|
||||
}
|
||||
p > img { margin: 6pt auto; }
|
||||
td img, th img { max-width: 100%; height: auto; }
|
||||
@page cover { size: A4; margin: 0; }
|
||||
.cover { page: cover; position: relative; width: 210mm; height: 297mm; overflow: hidden; page-break-after: always; }
|
||||
.cover .cover-bg { position: absolute; left: 0; top: 0; width: 100%; height: 100%; object-fit: cover; }
|
||||
.cover .cover-brand { position: absolute; top: 20mm; left: 20mm; font-size: 18pt; font-weight: 700; color: #1d4ed8; }
|
||||
.cover .cover-footer { position: absolute; left: 0; right: 0; bottom: 0; background: #1d4ed8; color: #fff; padding: 12mm 20mm; }
|
||||
.cover .cover-title { font-size: 24pt; font-weight: 700; margin: 0; }
|
||||
.cover .cover-subtitle { font-size: 13pt; margin-top: 4pt; }
|
||||
.cover .cover-meta { margin-top: 8pt; font-size: 11pt; display: flex; gap: 20mm; }
|
||||
17
docling/app/configs/test02.json
Normal file
17
docling/app/configs/test02.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"minio": {
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "minioadmin",
|
||||
"secret": "minioadmin123",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true"
|
||||
},
|
||||
"db": {
|
||||
"webhook_url": null,
|
||||
"token": null
|
||||
}
|
||||
}
|
||||
2993
docling/app/server.py
Normal file
2993
docling/app/server.py
Normal file
File diff suppressed because it is too large
Load Diff
1
docling/app/services/__init__.py
Normal file
1
docling/app/services/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
709
docling/app/services/docling_adapter.py
Normal file
709
docling/app/services/docling_adapter.py
Normal file
@@ -0,0 +1,709 @@
|
||||
from pathlib import Path
|
||||
from typing import Optional, Tuple, Dict, List, Any
|
||||
from urllib.parse import urlparse, unquote
|
||||
import os
|
||||
import re
|
||||
import io
|
||||
from bs4 import BeautifulSoup
|
||||
from bs4.element import PageElement
|
||||
import marko
|
||||
import sys
|
||||
try:
|
||||
_DOC_BASE = Path(__file__).resolve().parents[2] / "docling"
|
||||
p = str(_DOC_BASE)
|
||||
if p not in sys.path:
|
||||
sys.path.insert(0, p)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
from docling.document_converter import DocumentConverter
|
||||
except Exception:
|
||||
class DocumentConverter: # type: ignore
|
||||
def __init__(self, *args, **kwargs):
|
||||
pass
|
||||
def convert(self, source):
|
||||
raise RuntimeError("docling not available")
|
||||
from docx import Document
|
||||
from docx.shared import Mm, Pt
|
||||
from docx.enum.section import WD_SECTION
|
||||
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
|
||||
from docx.oxml import OxmlElement
|
||||
from docx.oxml.ns import qn
|
||||
from urllib.request import urlopen
|
||||
import json
|
||||
|
||||
try:
|
||||
from weasyprint import HTML, CSS # type: ignore
|
||||
except Exception:
|
||||
HTML = None
|
||||
CSS = None
|
||||
|
||||
_mdit: Any = None
|
||||
_tasklists_plugin: Any = None
|
||||
_deflist_plugin: Any = None
|
||||
_footnote_plugin: Any = None
|
||||
_attrs_plugin: Any = None
|
||||
_HAS_MD_IT: bool = False
|
||||
try:
|
||||
import markdown_it as _mdit # type: ignore
|
||||
from mdit_py_plugins.tasklists import tasklists_plugin as _tasklists_plugin # type: ignore
|
||||
from mdit_py_plugins.deflist import deflist_plugin as _deflist_plugin # type: ignore
|
||||
from mdit_py_plugins.footnote import footnote_plugin as _footnote_plugin # type: ignore
|
||||
from mdit_py_plugins.attrs import attrs_plugin as _attrs_plugin # type: ignore
|
||||
_HAS_MD_IT = True
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
converter = DocumentConverter()
|
||||
LINKMAP_PATH = Path(__file__).resolve().parent.parent / "configs" / "linkmap" / "linkmap.json"
|
||||
_LINKMAP: Dict[str, str] = {}
|
||||
|
||||
def load_linkmap() -> Dict[str, str]:
|
||||
global _LINKMAP
|
||||
try:
|
||||
if LINKMAP_PATH.exists():
|
||||
_LINKMAP = json.loads(LINKMAP_PATH.read_text("utf-8")) or {}
|
||||
except Exception:
|
||||
_LINKMAP = {}
|
||||
return _LINKMAP
|
||||
|
||||
def save_linkmap(mapping: Dict[str, str]) -> None:
|
||||
LINKMAP_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
LINKMAP_PATH.write_text(json.dumps(mapping, ensure_ascii=False, indent=2), "utf-8")
|
||||
load_linkmap()
|
||||
|
||||
def resolve_link(href: Optional[str], data_doc: Optional[str]) -> Optional[str]:
|
||||
if href:
|
||||
return href
|
||||
if not _LINKMAP:
|
||||
load_linkmap()
|
||||
if data_doc and data_doc in _LINKMAP:
|
||||
return _LINKMAP[data_doc]
|
||||
return None
|
||||
|
||||
def export_payload(doc, fmt: str) -> Tuple[str, str]:
|
||||
f = fmt.lower()
|
||||
if f == "markdown":
|
||||
return doc.export_to_markdown(), "text/markdown"
|
||||
if f == "html":
|
||||
return doc.export_to_html(), "text/html"
|
||||
if f == "json":
|
||||
return doc.export_to_json(), "application/json"
|
||||
if f == "doctags":
|
||||
return doc.export_to_doctags(), "application/json"
|
||||
raise ValueError("unsupported export")
|
||||
|
||||
def infer_basename(source_url: Optional[str], upload_name: Optional[str]) -> str:
|
||||
if source_url:
|
||||
path = urlparse(source_url).path
|
||||
name = os.path.basename(path) or "document"
|
||||
name = unquote(name)
|
||||
return os.path.splitext(name)[0] or "document"
|
||||
if upload_name:
|
||||
name = os.path.splitext(os.path.basename(upload_name))[0] or "document"
|
||||
return name
|
||||
return "document"
|
||||
|
||||
def sanitize_filename(name: Optional[str]) -> str:
|
||||
if not name:
|
||||
return "document"
|
||||
name = name.strip()[:128]
|
||||
name = re.sub(r'[<>:"/\\|?*\x00-\x1F]', "_", name) or "document"
|
||||
return name
|
||||
|
||||
def convert_source(source: str, export: str) -> Tuple[str, str]:
|
||||
result = converter.convert(source)
|
||||
return export_payload(result.document, export)
|
||||
|
||||
|
||||
|
||||
def md_to_docx_bytes(md: str, toc: bool = False, header_text: Optional[str] = None, footer_text: Optional[str] = None, logo_url: Optional[str] = None, copyright_text: Optional[str] = None, filename_text: Optional[str] = None, cover_src: Optional[str] = None, product_name: Optional[str] = None, document_name: Optional[str] = None, product_version: Optional[str] = None, document_version: Optional[str] = None) -> bytes:
|
||||
try:
|
||||
import logging as _log
|
||||
_log.info(f"md_to_docx_bytes start toc={toc} header={bool(header_text)} footer={bool(footer_text)} logo={bool(logo_url)} cover={bool(cover_src)}")
|
||||
except Exception:
|
||||
pass
|
||||
def _add_field(paragraph, instr: str):
|
||||
r1 = paragraph.add_run()
|
||||
b = OxmlElement('w:fldChar')
|
||||
b.set(qn('w:fldCharType'), 'begin')
|
||||
r1._r.append(b)
|
||||
r2 = paragraph.add_run()
|
||||
t = OxmlElement('w:instrText')
|
||||
t.set(qn('xml:space'), 'preserve')
|
||||
t.text = instr
|
||||
r2._r.append(t)
|
||||
r3 = paragraph.add_run()
|
||||
e = OxmlElement('w:fldChar')
|
||||
e.set(qn('w:fldCharType'), 'end')
|
||||
r3._r.append(e)
|
||||
def _available_width(section) -> int:
|
||||
return section.page_width - section.left_margin - section.right_margin
|
||||
def _fetch_bytes(u: str) -> Optional[bytes]:
|
||||
try:
|
||||
if u.lower().startswith('http://') or u.lower().startswith('https://'):
|
||||
with urlopen(u, timeout=10) as r:
|
||||
return r.read()
|
||||
p = Path(u)
|
||||
if p.exists() and p.is_file():
|
||||
return p.read_bytes()
|
||||
except Exception:
|
||||
return None
|
||||
return None
|
||||
html = normalize_html(md, options={
|
||||
"toc": "1" if toc else "",
|
||||
"header_text": header_text,
|
||||
"footer_text": footer_text,
|
||||
"logo_url": logo_url,
|
||||
"copyright_text": copyright_text,
|
||||
"filename_text": filename_text,
|
||||
"cover_src": cover_src,
|
||||
"product_name": product_name,
|
||||
"document_name": document_name,
|
||||
"product_version": product_version,
|
||||
"document_version": document_version,
|
||||
})
|
||||
try:
|
||||
import logging as _log
|
||||
_log.info(f"md_to_docx_bytes normalize_html length={len(html)}")
|
||||
except Exception:
|
||||
pass
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
doc = Document()
|
||||
sec0 = doc.sections[0]
|
||||
sec0.page_width = Mm(210)
|
||||
sec0.page_height = Mm(297)
|
||||
sec0.left_margin = Mm(15)
|
||||
sec0.right_margin = Mm(15)
|
||||
sec0.top_margin = Mm(20)
|
||||
sec0.bottom_margin = Mm(20)
|
||||
has_cover = bool(cover_src or (soup.find('section', class_='cover') is not None))
|
||||
if has_cover:
|
||||
sec0.left_margin = Mm(0)
|
||||
sec0.right_margin = Mm(0)
|
||||
sec0.top_margin = Mm(0)
|
||||
sec0.bottom_margin = Mm(0)
|
||||
if cover_src:
|
||||
b = _fetch_bytes(cover_src)
|
||||
if b:
|
||||
bio = io.BytesIO(b)
|
||||
doc.add_picture(bio, width=_available_width(sec0))
|
||||
if product_name:
|
||||
p = doc.add_paragraph()
|
||||
r = p.add_run(product_name)
|
||||
r.font.size = Pt(18)
|
||||
r.bold = True
|
||||
t = document_name or None
|
||||
if not t:
|
||||
h1 = soup.body.find('h1') if soup.body else soup.find('h1')
|
||||
t = h1.get_text(strip=True) if h1 else '文档'
|
||||
p2 = doc.add_paragraph()
|
||||
r2 = p2.add_run(t or '文档')
|
||||
r2.font.size = Pt(24)
|
||||
r2.bold = True
|
||||
if filename_text:
|
||||
p3 = doc.add_paragraph()
|
||||
r3 = p3.add_run(filename_text)
|
||||
r3.font.size = Pt(13)
|
||||
meta_parts = []
|
||||
if product_version:
|
||||
meta_parts.append("产品版本:" + product_version)
|
||||
if document_version:
|
||||
meta_parts.append("文档版本:" + document_version)
|
||||
if meta_parts:
|
||||
pm = doc.add_paragraph(" ".join(meta_parts))
|
||||
pm.font = None
|
||||
doc.add_section(WD_SECTION.NEW_PAGE)
|
||||
sec = doc.sections[-1]
|
||||
sec.page_width = Mm(210)
|
||||
sec.page_height = Mm(297)
|
||||
sec.left_margin = Mm(15)
|
||||
sec.right_margin = Mm(15)
|
||||
sec.top_margin = Mm(20)
|
||||
sec.bottom_margin = Mm(20)
|
||||
else:
|
||||
sec = sec0
|
||||
if header_text or logo_url or filename_text:
|
||||
hp = sec.header.add_paragraph()
|
||||
left = header_text or ''
|
||||
right = ''
|
||||
if '||' in left:
|
||||
parts = left.split('||', 1)
|
||||
left, right = parts[0], parts[1]
|
||||
elif '|' in left:
|
||||
parts = left.split('|', 1)
|
||||
left, right = parts[0], parts[1]
|
||||
if left.strip():
|
||||
hp.add_run(left.strip())
|
||||
if right.strip():
|
||||
rp = sec.header.add_paragraph()
|
||||
rp.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
|
||||
rp.add_run(right.strip())
|
||||
elif filename_text:
|
||||
rp = sec.header.add_paragraph()
|
||||
rp.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
|
||||
rp.add_run(filename_text)
|
||||
if footer_text or copyright_text:
|
||||
fp = sec.footer.add_paragraph()
|
||||
if footer_text:
|
||||
fp.add_run(footer_text)
|
||||
if copyright_text:
|
||||
cp = sec.footer.add_paragraph()
|
||||
cp.add_run(copyright_text)
|
||||
pn = sec.footer.add_paragraph()
|
||||
pn.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
|
||||
_add_field(pn, 'PAGE')
|
||||
if toc:
|
||||
doc.add_paragraph('目录')
|
||||
_add_field(doc.add_paragraph(), 'TOC \\o "1-3" \\h \\z \\u')
|
||||
doc.add_page_break()
|
||||
def add_inline(p, node):
|
||||
if isinstance(node, str):
|
||||
p.add_run(node)
|
||||
return
|
||||
if node.name in ['strong', 'b']:
|
||||
r = p.add_run(node.get_text())
|
||||
r.bold = True
|
||||
return
|
||||
if node.name in ['em', 'i']:
|
||||
r = p.add_run(node.get_text())
|
||||
r.italic = True
|
||||
return
|
||||
if node.name == 'code':
|
||||
r = p.add_run(node.get_text())
|
||||
r.font.name = 'Courier New'
|
||||
return
|
||||
if node.name == 'a':
|
||||
text = node.get_text()
|
||||
href = node.get('href')
|
||||
extra = node.get('data-doc')
|
||||
resolved = resolve_link(href, extra)
|
||||
if resolved:
|
||||
p.add_run(text + ' [' + resolved + ']')
|
||||
else:
|
||||
p.add_run(text)
|
||||
return
|
||||
if node.name == 'img':
|
||||
src = node.get('src') or ''
|
||||
b = _fetch_bytes(src)
|
||||
if b:
|
||||
bio = io.BytesIO(b)
|
||||
try:
|
||||
doc.add_picture(bio, width=_available_width(sec))
|
||||
except Exception:
|
||||
pass
|
||||
return
|
||||
for c in getattr(node, 'children', []):
|
||||
add_inline(p, c)
|
||||
def process_block(el):
|
||||
name = getattr(el, 'name', None)
|
||||
if name is None:
|
||||
return
|
||||
cls = el.get('class') or []
|
||||
if name == 'div' and 'doc-meta' in cls:
|
||||
return
|
||||
if name == 'section' and 'cover' in cls:
|
||||
return
|
||||
if name == 'nav' and 'toc' in cls:
|
||||
return
|
||||
if name == 'div':
|
||||
for child in el.children:
|
||||
process_block(child)
|
||||
return
|
||||
if name == 'h1':
|
||||
doc.add_heading(el.get_text(), level=1)
|
||||
return
|
||||
if name == 'h2' or (name == 'strong' and 'subtitle' in cls):
|
||||
doc.add_heading(el.get_text(), level=2)
|
||||
return
|
||||
if name == 'h3':
|
||||
doc.add_heading(el.get_text(), level=3)
|
||||
return
|
||||
if name == 'p':
|
||||
p = doc.add_paragraph()
|
||||
for c in el.children:
|
||||
add_inline(p, c)
|
||||
return
|
||||
if name in ['ul', 'ol']:
|
||||
for li in el.find_all('li', recursive=False):
|
||||
p = doc.add_paragraph(style='List Bullet')
|
||||
for c in li.children:
|
||||
add_inline(p, c)
|
||||
return
|
||||
if name == 'pre':
|
||||
code = el.get_text() or ''
|
||||
p = doc.add_paragraph()
|
||||
run = p.add_run(code)
|
||||
run.font.name = 'Courier New'
|
||||
return
|
||||
if name == 'blockquote':
|
||||
p = doc.add_paragraph(el.get_text())
|
||||
p.paragraph_format.left_indent = Mm(10)
|
||||
return
|
||||
if name == 'table':
|
||||
rows = []
|
||||
thead = el.find('thead')
|
||||
tbody = el.find('tbody')
|
||||
if thead:
|
||||
hdrs = [th.get_text(strip=True) for th in thead.find_all('th')]
|
||||
else:
|
||||
hdrs = [cell.get_text(strip=True) for cell in el.find_all('tr')[0].find_all(['th','td'])] if el.find_all('tr') else []
|
||||
trs = tbody.find_all('tr') if tbody else el.find_all('tr')[1:]
|
||||
for tr in trs:
|
||||
tds = [td.get_text(strip=True) for td in tr.find_all('td')]
|
||||
rows.append(tds)
|
||||
tbl = doc.add_table(rows=1 + len(rows), cols=len(hdrs) or 1)
|
||||
hdr = tbl.rows[0].cells
|
||||
for k, h in enumerate(hdrs or ['']):
|
||||
hdr[k].text = h
|
||||
for r_idx, row in enumerate(rows):
|
||||
cells = tbl.rows[1 + r_idx].cells
|
||||
for c_idx in range(len(hdrs) or 1):
|
||||
cells[c_idx].text = (row[c_idx] if c_idx < len(row) else '')
|
||||
return
|
||||
if name == 'img':
|
||||
src = el.get('src') or ''
|
||||
b = _fetch_bytes(src)
|
||||
if b:
|
||||
bio = io.BytesIO(b)
|
||||
try:
|
||||
doc.add_picture(bio, width=_available_width(sec))
|
||||
except Exception:
|
||||
pass
|
||||
return
|
||||
body = soup.body or soup
|
||||
for el in body.children:
|
||||
process_block(el)
|
||||
bio = io.BytesIO()
|
||||
try:
|
||||
import logging as _log
|
||||
_log.info("md_to_docx_bytes saving doc")
|
||||
except Exception:
|
||||
pass
|
||||
doc.save(bio)
|
||||
try:
|
||||
import logging as _log
|
||||
_log.info(f"md_to_docx_bytes done size={bio.tell()}")
|
||||
except Exception:
|
||||
pass
|
||||
return bio.getvalue()
|
||||
|
||||
def md_to_pdf_bytes(md: str) -> bytes:
|
||||
return md_to_pdf_bytes_with_renderer(md, renderer="weasyprint")
|
||||
|
||||
def _md_with_tables_to_html(md_text: str) -> str:
|
||||
lines = md_text.splitlines()
|
||||
out = []
|
||||
i = 0
|
||||
while i < len(lines):
|
||||
line = lines[i]
|
||||
def is_sep(s: str) -> bool:
|
||||
s = s.strip()
|
||||
if "|" not in s:
|
||||
return False
|
||||
s = s.strip("|")
|
||||
return all(set(seg.strip()) <= set("-: ") and len(seg.strip()) >= 1 for seg in s.split("|"))
|
||||
if "|" in line and i + 1 < len(lines) and is_sep(lines[i + 1]):
|
||||
headers = [c.strip() for c in line.strip().strip("|").split("|")]
|
||||
j = i + 2
|
||||
rows = []
|
||||
while j < len(lines) and "|" in lines[j]:
|
||||
rows.append([c.strip() for c in lines[j].strip().strip("|").split("|")])
|
||||
j += 1
|
||||
tbl = ["<table>", "<thead><tr>"]
|
||||
for h in headers:
|
||||
tbl.append(f"<th>{h}</th>")
|
||||
tbl.append("</tr></thead><tbody>")
|
||||
for row in rows:
|
||||
tbl.append("<tr>")
|
||||
for idx in range(len(headers)):
|
||||
cell = row[idx] if idx < len(row) else ""
|
||||
tbl.append(f"<td>{cell}</td>")
|
||||
tbl.append("</tr>")
|
||||
tbl.append("</tbody></table>")
|
||||
out.append("".join(tbl))
|
||||
i = j
|
||||
continue
|
||||
out.append(line)
|
||||
i += 1
|
||||
return marko.convert("\n".join(out))
|
||||
|
||||
def _render_markdown_html(md_text: str) -> str:
|
||||
if _HAS_MD_IT and _mdit is not None:
|
||||
try:
|
||||
md = _mdit.MarkdownIt("commonmark").enable(["table", "strikethrough"])
|
||||
if _tasklists_plugin:
|
||||
md.use(_tasklists_plugin)
|
||||
if _deflist_plugin:
|
||||
md.use(_deflist_plugin)
|
||||
if _footnote_plugin:
|
||||
md.use(_footnote_plugin)
|
||||
if _attrs_plugin:
|
||||
md.use(_attrs_plugin)
|
||||
return md.render(md_text)
|
||||
except Exception:
|
||||
pass
|
||||
return _md_with_tables_to_html(md_text)
|
||||
|
||||
def normalize_html(md_or_html: str, options: Optional[Dict[str, Optional[str]]] = None) -> str:
|
||||
html = _render_markdown_html(md_or_html)
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
for s in soup.find_all("strong", class_="subtitle"):
|
||||
s.name = "h2"
|
||||
s.attrs = {"data-origin": "subtitle"}
|
||||
for a in soup.find_all("a"):
|
||||
href_val = a.get("href")
|
||||
extra_val = a.get("data-doc")
|
||||
href = href_val if isinstance(href_val, str) else None
|
||||
extra = extra_val if isinstance(extra_val, str) else None
|
||||
resolved = resolve_link(href, extra)
|
||||
if resolved:
|
||||
a["href"] = resolved
|
||||
elif not href and extra:
|
||||
a.replace_with(a.get_text() + " [" + extra + "]")
|
||||
opts = options or {}
|
||||
header_text = opts.get("header_text") or None
|
||||
footer_text = opts.get("footer_text") or None
|
||||
logo_url = opts.get("logo_url") or None
|
||||
copyright_text = opts.get("copyright_text") or None
|
||||
cover_src = opts.get("cover_src") or None
|
||||
product_name_opt = opts.get("product_name") or None
|
||||
document_name_opt = opts.get("document_name") or None
|
||||
product_version_opt = opts.get("product_version") or None
|
||||
document_version_opt = opts.get("document_version") or None
|
||||
toc_flag = bool(opts.get("toc"))
|
||||
meta = soup.new_tag("div", attrs={"class": "doc-meta"})
|
||||
if header_text:
|
||||
ht = soup.new_tag("div", attrs={"class": "doc-header-text"})
|
||||
text = header_text
|
||||
left = text
|
||||
right = ""
|
||||
if "||" in text:
|
||||
parts = text.split("||", 1)
|
||||
left, right = parts[0], parts[1]
|
||||
elif "|" in text:
|
||||
parts = text.split("|", 1)
|
||||
left, right = parts[0], parts[1]
|
||||
if logo_url:
|
||||
img = soup.new_tag("img", attrs={"class": "logo-inline", "src": logo_url})
|
||||
ht.append(img)
|
||||
hl = soup.new_tag("span", attrs={"class": "doc-header-left"})
|
||||
hl.string = left
|
||||
ht.append(hl)
|
||||
if right.strip():
|
||||
hr = soup.new_tag("span", attrs={"class": "doc-header-right"})
|
||||
hr.string = right
|
||||
ht.append(hr)
|
||||
meta.append(ht)
|
||||
else:
|
||||
first_h1 = None
|
||||
if soup.body:
|
||||
first_h1 = soup.body.find("h1")
|
||||
else:
|
||||
first_h1 = soup.find("h1")
|
||||
left = (first_h1.get_text(strip=True) if first_h1 else "文档")
|
||||
right = opts.get("filename_text") or ""
|
||||
ht = soup.new_tag("div", attrs={"class": "doc-header-text"})
|
||||
if logo_url:
|
||||
img = soup.new_tag("img", attrs={"class": "logo-inline", "src": logo_url})
|
||||
ht.append(img)
|
||||
hl = soup.new_tag("span", attrs={"class": "doc-header-left"})
|
||||
hl.string = left
|
||||
ht.append(hl)
|
||||
if right:
|
||||
hr = soup.new_tag("span", attrs={"class": "doc-header-right"})
|
||||
hr.string = right
|
||||
ht.append(hr)
|
||||
meta.append(ht)
|
||||
if footer_text:
|
||||
ft = soup.new_tag("div", attrs={"class": "doc-footer-text"})
|
||||
ft.string = footer_text
|
||||
meta.append(ft)
|
||||
page_header_val = (header_text or (document_name_opt or None))
|
||||
if not page_header_val:
|
||||
first_h1_for_header = None
|
||||
if soup.body:
|
||||
first_h1_for_header = soup.body.find("h1")
|
||||
else:
|
||||
first_h1_for_header = soup.find("h1")
|
||||
page_header_val = (first_h1_for_header.get_text(strip=True) if first_h1_for_header else "文档")
|
||||
page_footer_val = (footer_text or "FunMD")
|
||||
ph = soup.new_tag("div", attrs={"class": "doc-page-header"})
|
||||
if logo_url:
|
||||
logo_inline = soup.new_tag("img", attrs={"src": logo_url, "class": "doc-page-header-logo"})
|
||||
ph.append(logo_inline)
|
||||
ht_inline = soup.new_tag("span", attrs={"class": "doc-page-header-text"})
|
||||
ht_inline.string = page_header_val
|
||||
ph.append(ht_inline)
|
||||
meta.append(ph)
|
||||
pf = soup.new_tag("div", attrs={"class": "doc-page-footer"})
|
||||
pf.string = page_footer_val
|
||||
meta.append(pf)
|
||||
if copyright_text:
|
||||
cp = soup.new_tag("div", attrs={"class": "doc-copyright"})
|
||||
cp.string = copyright_text
|
||||
meta.append(cp)
|
||||
# brand logo is rendered inline within header; no separate top-left element
|
||||
if soup.body:
|
||||
soup.body.insert(0, meta)
|
||||
else:
|
||||
soup.insert(0, meta)
|
||||
if not soup.head:
|
||||
head = soup.new_tag("head")
|
||||
soup.insert(0, head)
|
||||
else:
|
||||
head = soup.head
|
||||
style_run = soup.new_tag("style")
|
||||
style_run.string = "@page{margin:20mm}@page{\n @top-center{content: element(page-header)}\n @bottom-center{content: element(page-footer)}\n}\n.doc-page-header{position: running(page-header); font-size:10pt; color:#666; display:block; text-align:center; width:100%}\n.doc-page-header::after{content:''; display:block; width:80%; border-bottom:1px solid #d9d9d9; margin:4px auto 0}\n.doc-page-header-logo{height:20px; vertical-align:middle; margin-right:4px}\n.doc-page-header-text{vertical-align:middle}\n.doc-page-footer{position: running(page-footer); font-size:10pt; color:#666}\n.doc-page-footer::before{content:''; display:block; width:80%; border-top:1px solid #d9d9d9; margin:0 auto 4px}"
|
||||
head.append(style_run)
|
||||
# Fallback inline styles for cover to ensure visibility even if external CSS isn't loaded
|
||||
if (cover_src or product_name_opt or document_name_opt or product_version_opt or document_version_opt):
|
||||
if not soup.head:
|
||||
head = soup.new_tag("head")
|
||||
soup.insert(0, head)
|
||||
else:
|
||||
head = soup.head
|
||||
style = soup.new_tag("style")
|
||||
style.string = "@page:first{margin:0} html,body{margin:0;padding:0}.cover{position:relative;width:210mm;height:297mm;overflow:hidden;page-break-after:always}.cover .cover-bg{position:absolute;left:0;top:0;right:0;bottom:0;width:100%;height:100%;object-fit:cover;display:block}.cover .cover-brand{position:absolute;top:20mm;left:20mm;font-size:18pt;font-weight:700;color:#1d4ed8}.cover .cover-footer{position:absolute;left:0;right:0;bottom:0;background:#1d4ed8;color:#fff;padding:12mm 20mm}.cover .cover-title{font-size:24pt;font-weight:700;margin:0}.cover .cover-subtitle{font-size:13pt;margin-top:4pt}.cover .cover-meta{margin-top:8pt;font-size:11pt;display:flex;gap:20mm}"
|
||||
head.append(style)
|
||||
if cover_src or product_name_opt or document_name_opt or product_version_opt or document_version_opt:
|
||||
cov = soup.new_tag("section", attrs={"class": "cover"})
|
||||
if cover_src:
|
||||
bg = soup.new_tag("img", attrs={"class": "cover-bg", "src": cover_src})
|
||||
cov.append(bg)
|
||||
if product_name_opt:
|
||||
brand_el = soup.new_tag("div", attrs={"class": "cover-brand"})
|
||||
brand_el.string = product_name_opt
|
||||
cov.append(brand_el)
|
||||
footer = soup.new_tag("div", attrs={"class": "cover-footer"})
|
||||
title_text = document_name_opt or None
|
||||
if not title_text:
|
||||
first_h1 = soup.body.find("h1") if soup.body else soup.find("h1")
|
||||
if first_h1:
|
||||
title_text = first_h1.get_text(strip=True)
|
||||
title_el = soup.new_tag("div", attrs={"class": "cover-title"})
|
||||
title_el.string = title_text or "文档"
|
||||
footer.append(title_el)
|
||||
subtitle_val = opts.get("filename_text") or ""
|
||||
if subtitle_val:
|
||||
subtitle_el = soup.new_tag("div", attrs={"class": "cover-subtitle"})
|
||||
subtitle_el.string = subtitle_val
|
||||
footer.append(subtitle_el)
|
||||
meta_el = soup.new_tag("div", attrs={"class": "cover-meta"})
|
||||
if product_version_opt:
|
||||
pv = soup.new_tag("span")
|
||||
pv.string = f"产品版本:{product_version_opt}"
|
||||
meta_el.append(pv)
|
||||
if document_version_opt:
|
||||
dv = soup.new_tag("span")
|
||||
dv.string = f"文档版本:{document_version_opt}"
|
||||
meta_el.append(dv)
|
||||
footer.append(meta_el)
|
||||
cov.append(footer)
|
||||
if soup.body:
|
||||
soup.body.insert(1, cov)
|
||||
else:
|
||||
soup.insert(1, cov)
|
||||
if toc_flag:
|
||||
headings = [
|
||||
el for el in (soup.find_all(["h1", "h2", "h3"]) or [])
|
||||
if el.get("data-origin") != "subtitle"
|
||||
]
|
||||
if headings:
|
||||
ul = soup.new_tag("ul")
|
||||
idx = 1
|
||||
for el in headings:
|
||||
text = el.get_text(strip=True)
|
||||
if not text:
|
||||
continue
|
||||
hid = el.get("id")
|
||||
if not hid:
|
||||
hid = f"sec-{idx}"
|
||||
el["id"] = hid
|
||||
idx += 1
|
||||
li = soup.new_tag("li", attrs={"class": f"toc-{el.name}"})
|
||||
a = soup.new_tag("a", attrs={"href": f"#{hid}", "class": "toc-text"})
|
||||
a.string = text
|
||||
dots = soup.new_tag("span", attrs={"class": "toc-dots"})
|
||||
page = soup.new_tag("span", attrs={"class": "toc-page", "data-target": f"#{hid}"})
|
||||
li.append(a)
|
||||
li.append(dots)
|
||||
li.append(page)
|
||||
ul.append(li)
|
||||
nav = soup.new_tag("nav", attrs={"class": "toc"})
|
||||
h = soup.new_tag("h1")
|
||||
h.string = "目录"
|
||||
nav.append(h)
|
||||
nav.append(ul)
|
||||
if soup.body:
|
||||
soup.body.insert(2, nav)
|
||||
else:
|
||||
soup.insert(2, nav)
|
||||
if soup.body:
|
||||
for h in soup.body.find_all(["h1", "h2", "h3"]):
|
||||
sib: Optional[PageElement] = h.find_next_sibling()
|
||||
blocks: List[Any] = []
|
||||
first_table: Optional[Any] = None
|
||||
while sib is not None:
|
||||
# Skip pure whitespace nodes
|
||||
if getattr(sib, "name", None) is None:
|
||||
try:
|
||||
if str(sib).strip() == "":
|
||||
sib = sib.next_sibling
|
||||
continue
|
||||
except Exception:
|
||||
break
|
||||
# Stop if next heading encountered
|
||||
name = getattr(sib, "name", None)
|
||||
if name in ["h1", "h2", "h3"]:
|
||||
break
|
||||
# Collect explanatory blocks until first table
|
||||
if name == "table":
|
||||
first_table = sib
|
||||
break
|
||||
if name in ["p", "blockquote", "ul", "ol"]:
|
||||
blocks.append(sib)
|
||||
sib = sib.next_sibling
|
||||
continue
|
||||
# Unknown block: stop grouping to avoid wrapping unrelated content
|
||||
break
|
||||
if first_table is not None:
|
||||
wrap = soup.new_tag("div", attrs={"class": "table-block"})
|
||||
h.insert_before(wrap)
|
||||
wrap.append(h.extract())
|
||||
for el in blocks:
|
||||
wrap.append(el.extract())
|
||||
wrap.append(first_table.extract())
|
||||
return str(soup)
|
||||
|
||||
def _stylesheets_for(css_name: Optional[str], css_text: Optional[str]):
|
||||
sheets: List[Any] = []
|
||||
if CSS is None:
|
||||
return sheets
|
||||
if css_text:
|
||||
sheets.append(CSS(string=css_text))
|
||||
if css_name:
|
||||
css_path = Path(__file__).resolve().parent.parent / "configs" / "styles" / f"{css_name}.css"
|
||||
if css_path.exists():
|
||||
sheets.append(CSS(filename=str(css_path)))
|
||||
return sheets
|
||||
|
||||
def md_to_pdf_bytes_with_renderer(md: str, renderer: str = "weasyprint", css_name: Optional[str] = None, css_text: Optional[str] = None, toc: bool = False, header_text: Optional[str] = None, footer_text: Optional[str] = None, logo_url: Optional[str] = None, copyright_text: Optional[str] = None, filename_text: Optional[str] = None, cover_src: Optional[str] = None, product_name: Optional[str] = None, document_name: Optional[str] = None, product_version: Optional[str] = None, document_version: Optional[str] = None) -> bytes:
|
||||
html = normalize_html(md, options={
|
||||
"toc": "1" if toc else "",
|
||||
"header_text": header_text,
|
||||
"footer_text": footer_text,
|
||||
"logo_url": logo_url,
|
||||
"copyright_text": copyright_text,
|
||||
"filename_text": filename_text,
|
||||
"cover_src": cover_src,
|
||||
"product_name": product_name,
|
||||
"document_name": document_name,
|
||||
"product_version": product_version,
|
||||
"document_version": document_version,
|
||||
})
|
||||
if HTML is not None:
|
||||
stylesheets = _stylesheets_for(css_name, css_text)
|
||||
pdf_bytes = HTML(string=html).write_pdf(stylesheets=stylesheets or None)
|
||||
return pdf_bytes
|
||||
raise RuntimeError("WeasyPrint is not available")
|
||||
190
docling/app/services/minio_utils.py
Normal file
190
docling/app/services/minio_utils.py
Normal file
@@ -0,0 +1,190 @@
|
||||
from typing import Optional, Tuple, Dict
|
||||
import os
|
||||
import logging
|
||||
from urllib.request import urlopen
|
||||
|
||||
try:
|
||||
from minio import Minio # type: ignore
|
||||
import urllib3 # type: ignore
|
||||
except Exception:
|
||||
Minio = None
|
||||
urllib3 = None # type: ignore
|
||||
|
||||
def minio_head_bucket(client: object, bucket: str) -> bool:
|
||||
try:
|
||||
if hasattr(client, "bucket_exists"):
|
||||
try:
|
||||
return bool(client.bucket_exists(bucket)) # type: ignore
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
region = client._get_region(bucket) # type: ignore
|
||||
except Exception:
|
||||
region = "us-east-1"
|
||||
client._url_open(method="HEAD", region=region, bucket_name=bucket) # type: ignore
|
||||
return True
|
||||
except Exception:
|
||||
try:
|
||||
names = [getattr(b, "name", None) for b in client.list_buckets()] # type: ignore
|
||||
return bucket in set(n for n in names if n)
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
def minio_create_bucket(client: object, bucket: str) -> bool:
|
||||
try:
|
||||
if hasattr(client, "bucket_exists"):
|
||||
try:
|
||||
if client.bucket_exists(bucket): # type: ignore
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
if hasattr(client, "make_bucket"):
|
||||
try:
|
||||
client.make_bucket(bucket) # type: ignore
|
||||
return True
|
||||
except Exception:
|
||||
try:
|
||||
region = client._get_region(bucket) # type: ignore
|
||||
except Exception:
|
||||
region = "us-east-1"
|
||||
try:
|
||||
client.make_bucket(bucket, location=region) # type: ignore
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
try:
|
||||
region = client._get_region(bucket) # type: ignore
|
||||
except Exception:
|
||||
region = "us-east-1"
|
||||
client._url_open(method="PUT", region=region, bucket_name=bucket) # type: ignore
|
||||
return True
|
||||
except Exception as ce:
|
||||
if "BucketAlreadyOwnedByYou" in str(ce) or "BucketAlreadyExists" in str(ce):
|
||||
return True
|
||||
raise
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
def minio_client(endpoint: str, access: str, secret: str, secure: bool):
|
||||
if urllib3 is not None:
|
||||
try:
|
||||
http = urllib3.PoolManager(timeout=urllib3.Timeout(connect=3.0, read=20.0))
|
||||
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure, http_client=http) # type: ignore
|
||||
except Exception:
|
||||
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure) # type: ignore
|
||||
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure) # type: ignore
|
||||
|
||||
def minio_time_hint(endpoint: str, secure: bool) -> Optional[str]:
|
||||
try:
|
||||
scheme = "https" if secure else "http"
|
||||
r = urlopen(f"{scheme}://{endpoint}", timeout=3)
|
||||
srv_date = r.headers.get("Date")
|
||||
if not srv_date:
|
||||
return None
|
||||
from email.utils import parsedate_to_datetime
|
||||
from datetime import datetime, timezone
|
||||
dt = parsedate_to_datetime(srv_date)
|
||||
now = datetime.now(timezone.utc)
|
||||
diff = abs((now - dt).total_seconds())
|
||||
return f"服务器时间与本机相差约 {int(diff)} 秒"
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
def join_prefix(prefix: str, rel: str) -> str:
|
||||
pre = (prefix or "").strip("/")
|
||||
r = rel.lstrip("/")
|
||||
if pre and r.startswith(pre + "/"):
|
||||
return r
|
||||
return f"{pre}/{r}" if pre else r
|
||||
|
||||
def presigned_read(client: object, bucket: str, obj: str, expires_seconds: int) -> Optional[str]:
|
||||
try:
|
||||
from datetime import timedelta
|
||||
exp = expires_seconds
|
||||
try:
|
||||
exp = int(exp)
|
||||
except Exception:
|
||||
pass
|
||||
td = timedelta(seconds=exp)
|
||||
try:
|
||||
return client.get_presigned_url("GET", bucket, obj, expires=td) # type: ignore
|
||||
except Exception:
|
||||
return client.presigned_get_object(bucket, obj, expires=td) # type: ignore
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
def minio_current(runtime_cfg: Dict[str, Dict[str, Optional[str]]]) -> Tuple[Optional[object], Optional[str], Optional[str], str]:
|
||||
rc = runtime_cfg.get("minio", {})
|
||||
endpoint_raw = rc.get("endpoint") or os.environ.get("MINIO_ENDPOINT")
|
||||
access_raw = rc.get("access") or os.environ.get("MINIO_ACCESS_KEY")
|
||||
secret_raw = rc.get("secret") or os.environ.get("MINIO_SECRET_KEY")
|
||||
bucket_raw = rc.get("bucket") or os.environ.get("MINIO_BUCKET")
|
||||
secure_flag = rc.get("secure") or os.environ.get("MINIO_SECURE", "false")
|
||||
secure = str(secure_flag or "false").lower() in {"1","true","yes","on"}
|
||||
public_raw = rc.get("public") or os.environ.get("MINIO_PUBLIC_ENDPOINT")
|
||||
endpoint = (str(endpoint_raw).strip() if endpoint_raw else None)
|
||||
try:
|
||||
if isinstance(endpoint, str) and ":9001" in endpoint:
|
||||
h = endpoint.split("/")[0]
|
||||
if ":" in h:
|
||||
parts = h.split(":")
|
||||
endpoint = f"{parts[0]}:9000"
|
||||
else:
|
||||
endpoint = h
|
||||
except Exception:
|
||||
endpoint = endpoint
|
||||
access = (str(access_raw).strip() if access_raw else None)
|
||||
secret = (str(secret_raw).strip() if secret_raw else None)
|
||||
bucket = (str(bucket_raw).strip() if bucket_raw else None)
|
||||
public_base = (str(public_raw).strip() if public_raw else None)
|
||||
try:
|
||||
if isinstance(public_base, str) and (":9001" in public_base or "/browser" in public_base or "/minio" in public_base):
|
||||
host = public_base.strip().split("/")[0]
|
||||
scheme = "https" if secure else "http"
|
||||
if ":" in host:
|
||||
host = host.split("/")[0]
|
||||
base_host = host.split(":")[0]
|
||||
public_base = f"{scheme}://{base_host}:9000"
|
||||
else:
|
||||
public_base = f"{scheme}://{host}:9000"
|
||||
except Exception:
|
||||
public_base = public_base
|
||||
if not public_base and endpoint:
|
||||
public_base = f"https://{endpoint}" if secure else f"http://{endpoint}"
|
||||
missing = []
|
||||
if Minio is None:
|
||||
missing.append("client")
|
||||
if not endpoint:
|
||||
missing.append("endpoint")
|
||||
if not access:
|
||||
missing.append("access")
|
||||
if not secret:
|
||||
missing.append("secret")
|
||||
if not bucket:
|
||||
missing.append("bucket")
|
||||
if not public_base:
|
||||
missing.append("public")
|
||||
if missing:
|
||||
try:
|
||||
logging.error(f"minio config invalid: missing={missing}")
|
||||
except Exception:
|
||||
pass
|
||||
return None, None, None, ""
|
||||
client = minio_client(endpoint=endpoint, access=access, secret=secret, secure=secure)
|
||||
try:
|
||||
try:
|
||||
client.list_buckets() # type: ignore
|
||||
except Exception as e:
|
||||
if secure and ("SSL" in str(e) or "HTTPSConnectionPool" in str(e) or "SSLError" in str(e)):
|
||||
client = minio_client(endpoint=endpoint, access=access, secret=secret, secure=False)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
exists = minio_head_bucket(client, bucket)
|
||||
if not exists:
|
||||
minio_create_bucket(client, bucket)
|
||||
except Exception:
|
||||
pass
|
||||
prefix = rc.get("prefix") or os.environ.get("MINIO_PREFIX", "")
|
||||
return client, bucket, public_base, prefix
|
||||
492
docling/app/services/unified_converter.py
Normal file
492
docling/app/services/unified_converter.py
Normal file
@@ -0,0 +1,492 @@
|
||||
from pathlib import Path
|
||||
from typing import Optional, Tuple
|
||||
import re
|
||||
|
||||
import tempfile
|
||||
import sys
|
||||
from urllib.parse import urlsplit
|
||||
from urllib.request import urlopen
|
||||
from urllib.error import HTTPError, URLError
|
||||
import io
|
||||
_DOC_AVAILABLE = True
|
||||
try:
|
||||
_DOC_BASE = Path(__file__).resolve().parents[2] / "docling"
|
||||
p = str(_DOC_BASE)
|
||||
if p not in sys.path:
|
||||
sys.path.insert(0, p)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
from docling.document_converter import DocumentConverter
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.document_converter import PdfFormatOption
|
||||
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||
from docling_core.types.doc import ImageRefMode
|
||||
except Exception:
|
||||
_DOC_AVAILABLE = False
|
||||
class DocumentConverter: # type: ignore
|
||||
def __init__(self, *args, **kwargs):
|
||||
pass
|
||||
def convert(self, source):
|
||||
raise RuntimeError("docling unavailable")
|
||||
class InputFormat: # type: ignore
|
||||
PDF = "pdf"
|
||||
class PdfFormatOption: # type: ignore
|
||||
def __init__(self, *args, **kwargs):
|
||||
pass
|
||||
class StandardPdfPipeline: # type: ignore
|
||||
pass
|
||||
class PdfPipelineOptions: # type: ignore
|
||||
def __init__(self):
|
||||
pass
|
||||
class ImageRefMode: # type: ignore
|
||||
EMBEDDED = None
|
||||
|
||||
"""
|
||||
@api Unified Converter Service
|
||||
@description Provides core document conversion logic unifying Docling and word2markdown engines
|
||||
"""
|
||||
|
||||
_W2M_AVAILABLE = False
|
||||
try:
|
||||
from app.services.word2markdown import convert_any as _w2m_convert_any # type: ignore
|
||||
_W2M_AVAILABLE = True
|
||||
except Exception:
|
||||
_W2M_AVAILABLE = False
|
||||
|
||||
try:
|
||||
from bs4 import BeautifulSoup # type: ignore
|
||||
except Exception:
|
||||
BeautifulSoup = None # type: ignore
|
||||
try:
|
||||
from app.services.docling_adapter import normalize_html as _normalize_html # type: ignore
|
||||
from app.services.docling_adapter import resolve_link as _resolve_link # type: ignore
|
||||
from app.services.docling_adapter import _render_markdown_html as _render_md_html # type: ignore
|
||||
except Exception:
|
||||
_normalize_html = None # type: ignore
|
||||
_resolve_link = None # type: ignore
|
||||
_render_md_html = None # type: ignore
|
||||
|
||||
def _is_http(s: str) -> bool:
|
||||
t = (s or "").lower()
|
||||
return t.startswith("http://") or t.startswith("https://")
|
||||
|
||||
def _read_bytes(source: str) -> Tuple[bytes, str]:
|
||||
ct = ""
|
||||
try:
|
||||
if _is_http(source):
|
||||
from urllib.request import urlopen
|
||||
with urlopen(source, timeout=10) as r:
|
||||
ct = r.headers.get("Content-Type") or ""
|
||||
return r.read() or b"", ct
|
||||
p = Path(source)
|
||||
if p.exists() and p.is_file():
|
||||
return p.read_bytes(), ct
|
||||
except Exception:
|
||||
return b"", ct
|
||||
return b"", ct
|
||||
|
||||
def _decode_to_utf8(raw: bytes, ct: str = "") -> str:
|
||||
if not raw:
|
||||
return ""
|
||||
if raw.startswith(b"\xef\xbb\xbf"):
|
||||
try:
|
||||
return raw[3:].decode("utf-8")
|
||||
except Exception:
|
||||
pass
|
||||
if raw.startswith(b"\xff\xfe"):
|
||||
try:
|
||||
return raw[2:].decode("utf-16le")
|
||||
except Exception:
|
||||
pass
|
||||
if raw.startswith(b"\xfe\xff"):
|
||||
try:
|
||||
return raw[2:].decode("utf-16be")
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
m = re.search(r"charset=([\w-]+)", ct or "", re.IGNORECASE)
|
||||
if m:
|
||||
enc = m.group(1).strip().lower()
|
||||
try:
|
||||
return raw.decode(enc)
|
||||
except Exception:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
candidates = [
|
||||
"utf-8", "gb18030", "gbk", "big5", "shift_jis", "iso-8859-1", "windows-1252",
|
||||
]
|
||||
for enc in candidates:
|
||||
try:
|
||||
return raw.decode(enc)
|
||||
except Exception:
|
||||
continue
|
||||
return raw.decode("utf-8", errors="replace")
|
||||
|
||||
def _normalize_newlines(s: str) -> str:
|
||||
return (s or "").replace("\r\n", "\n").replace("\r", "\n")
|
||||
|
||||
def _html_to_markdown(html: str) -> str:
|
||||
if not html:
|
||||
return ""
|
||||
if BeautifulSoup is None:
|
||||
return html
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
out: list[str] = []
|
||||
def txt(node) -> str:
|
||||
return (getattr(node, "get_text", lambda **kwargs: str(node))(strip=True) if node else "")
|
||||
def inline(node) -> str:
|
||||
if isinstance(node, str):
|
||||
return node
|
||||
name = getattr(node, "name", None)
|
||||
if name in {None}: # type: ignore
|
||||
return str(node)
|
||||
if name in {"strong", "b"}:
|
||||
return "**" + txt(node) + "**"
|
||||
if name in {"em", "i"}:
|
||||
return "*" + txt(node) + "*"
|
||||
if name == "code":
|
||||
return "`" + txt(node) + "`"
|
||||
if name == "a":
|
||||
href_val = node.get("href")
|
||||
extra_val = node.get("data-doc")
|
||||
href = href_val if isinstance(href_val, str) else None
|
||||
extra = extra_val if isinstance(extra_val, str) else None
|
||||
resolved = _resolve_link(href, extra) if _resolve_link else (href or extra)
|
||||
url = resolved or ""
|
||||
text = txt(node)
|
||||
if url:
|
||||
return f"[{text}]({url})"
|
||||
return text
|
||||
if name == "img":
|
||||
alt = node.get("alt") or "image"
|
||||
src = node.get("src") or ""
|
||||
return f""
|
||||
res = []
|
||||
for c in getattr(node, "children", []):
|
||||
res.append(inline(c))
|
||||
return "".join(res)
|
||||
def block(node):
|
||||
name = getattr(node, "name", None)
|
||||
if name is None:
|
||||
s = str(node).strip()
|
||||
if s:
|
||||
out.append(s)
|
||||
return
|
||||
if name in {"h1", "h2", "h3", "h4", "h5", "h6"}:
|
||||
lvl = int(name[1])
|
||||
out.append("#" * lvl + " " + txt(node))
|
||||
out.append("")
|
||||
return
|
||||
if name == "p":
|
||||
segs = [inline(c) for c in node.children]
|
||||
out.append("".join(segs))
|
||||
out.append("")
|
||||
return
|
||||
if name == "br":
|
||||
out.append("")
|
||||
return
|
||||
if name in {"ul", "ol"}:
|
||||
is_ol = name == "ol"
|
||||
idx = 1
|
||||
for li in node.find_all("li", recursive=False):
|
||||
text = "".join(inline(c) for c in li.children)
|
||||
if is_ol:
|
||||
out.append(f"{idx}. {text}")
|
||||
idx += 1
|
||||
else:
|
||||
out.append(f"- {text}")
|
||||
out.append("")
|
||||
return
|
||||
if name == "pre":
|
||||
code_node = node.find("code")
|
||||
code_text = code_node.get_text() if code_node else node.get_text()
|
||||
lang = ""
|
||||
cls = (code_node.get("class") if code_node else node.get("class")) or []
|
||||
for c in cls:
|
||||
s = str(c)
|
||||
if s.startswith("language-"):
|
||||
lang = s.split("-", 1)[-1]
|
||||
break
|
||||
out.append(f"```{lang}\n{code_text}\n```\n")
|
||||
return
|
||||
if name == "blockquote":
|
||||
lines = [l for l in txt(node).splitlines() if l.strip()]
|
||||
for l in lines:
|
||||
out.append("> " + l)
|
||||
out.append("")
|
||||
return
|
||||
if name == "table":
|
||||
rows = node.find_all("tr")
|
||||
if not rows:
|
||||
return
|
||||
headers = [h.get_text(strip=True) for h in (rows[0].find_all(["th","td"]) or [])]
|
||||
if headers:
|
||||
out.append("|" + "|".join(headers) + "|")
|
||||
sep = "|" + "|".join(["---" for _ in headers]) + "|"
|
||||
out.append(sep)
|
||||
for tr in rows[1:]:
|
||||
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
|
||||
if cells:
|
||||
out.append("|" + "|".join(cells) + "|")
|
||||
out.append("")
|
||||
return
|
||||
if name == "div":
|
||||
for c in node.children:
|
||||
block(c)
|
||||
return
|
||||
segs = [inline(c) for c in node.children]
|
||||
if segs:
|
||||
out.append("".join(segs))
|
||||
out.append("")
|
||||
root = soup.body or soup
|
||||
for ch in getattr(root, "children", []):
|
||||
block(ch)
|
||||
return _normalize_newlines("\n".join(out)).strip()
|
||||
|
||||
|
||||
def _lower_html_table_tags(html: str) -> str:
|
||||
"""
|
||||
@function _lower_html_table_tags
|
||||
@description Normalizes HTML table tags to lowercase
|
||||
@param html Input HTML string
|
||||
@return Normalized HTML string
|
||||
"""
|
||||
if not html:
|
||||
return html
|
||||
tags = ["TABLE", "THEAD", "TBODY", "TFOOT", "TR", "TH", "TD"]
|
||||
out = html
|
||||
for t in tags:
|
||||
out = re.sub(r"</?" + t + r"\b", lambda m: m.group(0).lower(), out)
|
||||
out = re.sub(r">\s*\n+\s*", ">\n", out)
|
||||
return out
|
||||
|
||||
|
||||
def _replace_admonitions(md: str) -> str:
|
||||
"""
|
||||
@function _replace_admonitions
|
||||
@description Replaces ::: style admonitions with !!! style
|
||||
@param md Input markdown string
|
||||
@return Processed markdown string
|
||||
"""
|
||||
if not md:
|
||||
return md
|
||||
lines = md.split("\n")
|
||||
out = []
|
||||
in_block = False
|
||||
for raw in lines:
|
||||
t = raw.strip()
|
||||
if t.startswith(":::"):
|
||||
if not in_block:
|
||||
name = t[3:].strip()
|
||||
if not name:
|
||||
out.append("!!!")
|
||||
else:
|
||||
out.append("!!! " + name)
|
||||
in_block = True
|
||||
else:
|
||||
out.append("!!!")
|
||||
in_block = False
|
||||
continue
|
||||
out.append(raw)
|
||||
return "\n".join(out)
|
||||
|
||||
|
||||
def _enhance_codeblocks(md: str) -> str:
|
||||
if not md:
|
||||
return md
|
||||
lines = md.split("\n")
|
||||
res = []
|
||||
in_fence = False
|
||||
fence_lang = ""
|
||||
i = 0
|
||||
while i < len(lines):
|
||||
line = lines[i]
|
||||
t = line.strip()
|
||||
if t.startswith("```"):
|
||||
in_fence = not in_fence
|
||||
try:
|
||||
fence_lang = (t[3:] or "").strip() if in_fence else ""
|
||||
except Exception:
|
||||
fence_lang = ""
|
||||
res.append(line)
|
||||
i += 1
|
||||
continue
|
||||
if in_fence:
|
||||
res.append(line)
|
||||
i += 1
|
||||
continue
|
||||
if t.startswith("{") or t.startswith("["):
|
||||
buf = [line]
|
||||
j = i + 1
|
||||
closed = False
|
||||
depth = t.count("{") - t.count("}")
|
||||
while j < len(lines):
|
||||
buf.append(lines[j])
|
||||
s = lines[j].strip()
|
||||
depth += s.count("{") - s.count("}")
|
||||
if depth <= 0 and s.endswith("}"):
|
||||
closed = True
|
||||
break
|
||||
j += 1
|
||||
if closed and len(buf) >= 3:
|
||||
lang = "json"
|
||||
res.append("```" + lang)
|
||||
res.extend(buf)
|
||||
res.append("```")
|
||||
i = j + 1
|
||||
continue
|
||||
code_sig = (
|
||||
("public static" in t) or ("private static" in t) or ("class " in t) or ("return " in t) or ("package " in t) or ("import " in t)
|
||||
)
|
||||
if code_sig:
|
||||
buf = [line]
|
||||
j = i + 1
|
||||
while j < len(lines):
|
||||
s = lines[j].strip()
|
||||
if not s:
|
||||
break
|
||||
if s.startswith("# ") or s.startswith("## ") or s.startswith("### "):
|
||||
break
|
||||
buf.append(lines[j])
|
||||
j += 1
|
||||
if len(buf) >= 3:
|
||||
res.append("```")
|
||||
res.extend(buf)
|
||||
res.append("```")
|
||||
i = j + 1
|
||||
continue
|
||||
res.append(line)
|
||||
i += 1
|
||||
return "\n".join(res)
|
||||
|
||||
|
||||
class FormatConverter:
|
||||
"""
|
||||
@class FormatConverter
|
||||
@description Unified converter class wrapping Docling and word2markdown
|
||||
"""
|
||||
def __init__(self) -> None:
|
||||
self._docling = DocumentConverter()
|
||||
|
||||
def convert(self, source: str, export: str = "markdown", engine: Optional[str] = None, mdx_safe_mode_enabled: bool = True) -> Tuple[str, str, Optional[str]]:
|
||||
"""
|
||||
@function convert
|
||||
@description Convert a document source to specified format
|
||||
@param source Path or URL to source document
|
||||
@param export Output format (markdown, html, json, doctags)
|
||||
@param engine Optional engine override (word2markdown/docling)
|
||||
@param mdx_safe_mode_enabled Toggle safe mode for MDX
|
||||
@return Tuple of (encoding, content)
|
||||
"""
|
||||
|
||||
|
||||
# Prefer custom word2markdown engine for DOC/DOCX when available
|
||||
auto_engine = None
|
||||
try:
|
||||
from pathlib import Path as _P
|
||||
suf = _P(source).suffix.lower()
|
||||
if not engine and suf in {".doc", ".docx"} and _W2M_AVAILABLE:
|
||||
auto_engine = "word2markdown"
|
||||
except Exception:
|
||||
auto_engine = None
|
||||
use_engine = (engine or auto_engine or "").lower()
|
||||
try:
|
||||
from urllib.parse import urlsplit
|
||||
path = source
|
||||
if _is_http(source):
|
||||
path = urlsplit(source).path or ""
|
||||
ext = Path(path).suffix.lower()
|
||||
except Exception:
|
||||
ext = Path(source).suffix.lower()
|
||||
if ext in {".txt"}:
|
||||
raw, ct = _read_bytes(source)
|
||||
text = _normalize_newlines(_decode_to_utf8(raw, ct))
|
||||
if export.lower() == "html":
|
||||
if _render_md_html is not None:
|
||||
html = _render_md_html(text)
|
||||
else:
|
||||
try:
|
||||
import marko
|
||||
html = marko.convert(text)
|
||||
except Exception:
|
||||
html = f"<pre>{text}</pre>"
|
||||
return "utf-8", _lower_html_table_tags(html), None
|
||||
md = _enhance_codeblocks(text)
|
||||
return "utf-8", md, None
|
||||
if ext in {".md"}:
|
||||
raw, ct = _read_bytes(source)
|
||||
text = _normalize_newlines(_decode_to_utf8(raw, ct))
|
||||
if export.lower() == "html":
|
||||
if _render_md_html is not None:
|
||||
html = _render_md_html(text)
|
||||
else:
|
||||
try:
|
||||
import marko
|
||||
html = marko.convert(text)
|
||||
except Exception:
|
||||
html = text
|
||||
return "utf-8", _lower_html_table_tags(html), None
|
||||
return "utf-8", text, None
|
||||
if ext in {".html", ".htm"}:
|
||||
try:
|
||||
conv = DocumentConverter(allowed_formats=[InputFormat.HTML])
|
||||
result = conv.convert(source)
|
||||
if export.lower() == "html":
|
||||
html = result.document.export_to_html()
|
||||
html = _lower_html_table_tags(html)
|
||||
return "utf-8", html, None
|
||||
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
|
||||
md = _replace_admonitions(md)
|
||||
md = _enhance_codeblocks(md)
|
||||
return "utf-8", md, None
|
||||
except Exception:
|
||||
raw, ct = _read_bytes(source)
|
||||
html_in = _normalize_newlines(_decode_to_utf8(raw, ct))
|
||||
if export.lower() == "html":
|
||||
html = _normalize_html(html_in) if _normalize_html is not None else html_in
|
||||
return "utf-8", _lower_html_table_tags(html), None
|
||||
md = _html_to_markdown(html_in)
|
||||
md = _replace_admonitions(md)
|
||||
md = _enhance_codeblocks(md)
|
||||
return "utf-8", md, None
|
||||
if use_engine in {"pandoc", "custom", "word2markdown"} and _W2M_AVAILABLE:
|
||||
enc, md = _w2m_convert_any(Path(source), mdx_safe_mode_enabled=mdx_safe_mode_enabled)
|
||||
md = _replace_admonitions(md)
|
||||
md = _enhance_codeblocks(md)
|
||||
return enc or "utf-8", md, None
|
||||
# Configure PDF pipeline to generate picture images into a per-call artifacts directory
|
||||
artifacts_dir = tempfile.mkdtemp(prefix="docling_artifacts_")
|
||||
pdf_opts = PdfPipelineOptions()
|
||||
pdf_opts.generate_picture_images = True
|
||||
pdf_opts.generate_page_images = True
|
||||
pdf_opts.images_scale = 2.0
|
||||
pdf_opts.do_code_enrichment = True
|
||||
pdf_opts.do_formula_enrichment = True
|
||||
self._docling = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(
|
||||
pipeline_cls=StandardPdfPipeline,
|
||||
pipeline_options=pdf_opts,
|
||||
)
|
||||
}
|
||||
)
|
||||
result = self._docling.convert(source)
|
||||
if export.lower() == "markdown":
|
||||
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
|
||||
md = _replace_admonitions(md)
|
||||
md = _enhance_codeblocks(md)
|
||||
return "utf-8", md, artifacts_dir
|
||||
if export.lower() == "html":
|
||||
html = result.document.export_to_html()
|
||||
html = _lower_html_table_tags(html)
|
||||
return "utf-8", html, artifacts_dir
|
||||
if export.lower() == "json":
|
||||
js = result.document.export_to_json()
|
||||
return "utf-8", js, artifacts_dir
|
||||
if export.lower() == "doctags":
|
||||
dt = result.document.export_to_doctags()
|
||||
return "utf-8", dt, artifacts_dir
|
||||
raise RuntimeError("unsupported export")
|
||||
429
docling/app/services/word2markdown.py
Normal file
429
docling/app/services/word2markdown.py
Normal file
@@ -0,0 +1,429 @@
|
||||
from pathlib import Path
|
||||
from typing import Tuple, List
|
||||
|
||||
from docx import Document
|
||||
from docx.table import Table
|
||||
from docx.text.paragraph import Paragraph
|
||||
import re
|
||||
import base64
|
||||
import hashlib
|
||||
import tempfile
|
||||
import subprocess
|
||||
from lxml import etree
|
||||
|
||||
|
||||
def _iter_blocks(doc: Document):
|
||||
parent = doc
|
||||
parent_elm = parent.element.body
|
||||
for child in parent_elm.iterchildren():
|
||||
tag = child.tag.split('}')[-1]
|
||||
if tag == 'p':
|
||||
yield Paragraph(child, parent)
|
||||
elif tag == 'tbl':
|
||||
yield Table(child, parent)
|
||||
|
||||
|
||||
def _cell_text(cell) -> str:
|
||||
parts = []
|
||||
for p in cell.paragraphs:
|
||||
t = p.text or ""
|
||||
parts.append(t)
|
||||
return "\n".join([s for s in parts if s is not None])
|
||||
|
||||
|
||||
def _guess_lang(text: str) -> str:
|
||||
t = (text or "").strip()
|
||||
head = t[:512]
|
||||
if re.search(r"\b(package|import\s+java\.|public\s+class|public\s+static|private\s+static|@Override)\b", head):
|
||||
return "java"
|
||||
if re.search(r"\b(def\s+\w+\(|import\s+\w+|print\(|from\s+\w+\s+import)\b", head):
|
||||
return "python"
|
||||
if re.search(r"\b(function\s+\w+\(|console\.log|let\s+\w+|const\s+\w+|=>)\b", head):
|
||||
return "javascript"
|
||||
if re.search(r"^#include|\bint\s+main\s*\(\)", head):
|
||||
return "c"
|
||||
if re.search(r"\busing\s+namespace\b|\bstd::\b|\btemplate\b", head):
|
||||
return "cpp"
|
||||
if re.search(r"\b(SELECT|INSERT|UPDATE|DELETE|CREATE\s+TABLE|DROP\s+TABLE|ALTER\s+TABLE)\b", head, re.IGNORECASE):
|
||||
return "sql"
|
||||
if head.startswith("{") or head.startswith("["):
|
||||
return "json"
|
||||
if re.search(r"<html|<div|<span|<table|<code|<pre", head, re.IGNORECASE):
|
||||
return "html"
|
||||
if re.search(r"<\?xml|</?[A-Za-z0-9:_-]+>", head):
|
||||
return "xml"
|
||||
return ""
|
||||
|
||||
|
||||
def _table_to_md(tbl: Table) -> str:
|
||||
rows = tbl.rows
|
||||
cols = tbl.columns
|
||||
if len(rows) == 1 and len(cols) == 1:
|
||||
txt = _cell_text(rows[0].cells[0]).strip()
|
||||
lang = _guess_lang(txt)
|
||||
return f"```{lang}\n{txt}\n```\n"
|
||||
|
||||
def _cell_inline_md(doc: Document, paragraph: Paragraph) -> str:
|
||||
el = paragraph._element
|
||||
parts: List[str] = []
|
||||
try:
|
||||
for ch in el.iterchildren():
|
||||
tag = ch.tag.split('}')[-1]
|
||||
if tag == 'r':
|
||||
for rc in ch.iterchildren():
|
||||
rtag = rc.tag.split('}')[-1]
|
||||
if rtag == 't':
|
||||
s = rc.text or ''
|
||||
if s:
|
||||
parts.append(s)
|
||||
elif rtag == 'br':
|
||||
parts.append('\n')
|
||||
elif rtag == 'drawing':
|
||||
try:
|
||||
for node in rc.iter():
|
||||
local = node.tag.split('}')[-1]
|
||||
rid = None
|
||||
if local == 'blip':
|
||||
rid = node.get(f"{{{NS['r']}}}embed") or node.get(f"{{{NS['r']}}}link")
|
||||
elif local == 'imagedata':
|
||||
rid = node.get(f"{{{NS['r']}}}id")
|
||||
if not rid:
|
||||
continue
|
||||
try:
|
||||
part = None
|
||||
rp = getattr(doc.part, 'related_parts', None)
|
||||
if isinstance(rp, dict) and rid in rp:
|
||||
part = rp.get(rid)
|
||||
if part is None:
|
||||
rels = getattr(doc.part, 'rels', None)
|
||||
if rels is not None and hasattr(rels, 'get'):
|
||||
rel = rels.get(rid)
|
||||
part = getattr(rel, 'target_part', None)
|
||||
if part is None:
|
||||
rel = getattr(doc.part, '_rels', {}).get(rid)
|
||||
part = getattr(rel, 'target_part', None)
|
||||
ct = getattr(part, 'content_type', '') if part is not None else ''
|
||||
data = part.blob if part is not None and hasattr(part, 'blob') else None
|
||||
if data:
|
||||
b64 = base64.b64encode(data).decode('ascii')
|
||||
parts.append(f"")
|
||||
except Exception:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
return ''.join(parts)
|
||||
|
||||
out = []
|
||||
# python-docx table parent is the Document
|
||||
doc = getattr(tbl, '_parent', None) or getattr(tbl, 'part', None)
|
||||
for r_i, r in enumerate(rows):
|
||||
vals = []
|
||||
for c in r.cells:
|
||||
segs: List[str] = []
|
||||
for p in c.paragraphs:
|
||||
s = _cell_inline_md(doc, p)
|
||||
if s:
|
||||
segs.append(s)
|
||||
cell_text = '<br>'.join([x for x in segs if x is not None])
|
||||
vals.append((cell_text or '').replace('|', '\\|').strip())
|
||||
line = "| " + " | ".join(vals) + " |"
|
||||
out.append(line)
|
||||
if r_i == 0:
|
||||
sep = "| " + " | ".join(["---" for _ in vals]) + " |"
|
||||
out.append(sep)
|
||||
return "\n".join(out) + "\n"
|
||||
|
||||
|
||||
def _paragraph_to_md(p: Paragraph) -> str:
|
||||
return (p.text or "").strip() + "\n\n"
|
||||
|
||||
|
||||
def convert_any(path: Path, mdx_safe_mode_enabled: bool = True) -> Tuple[str, str]:
|
||||
ext = path.suffix.lower()
|
||||
use_path = path
|
||||
if ext == ".doc":
|
||||
use_path = _convert_doc_to_docx_cross_platform(path)
|
||||
if use_path.suffix.lower() not in {".docx"}:
|
||||
raise RuntimeError("unsupported input for word2markdown")
|
||||
doc = Document(str(use_path))
|
||||
out: List[str] = []
|
||||
in_code = False
|
||||
code_lines: List[str] = []
|
||||
lang_hint: str = ''
|
||||
for blk in _iter_blocks(doc):
|
||||
if isinstance(blk, Table):
|
||||
out.append(_table_to_md(blk))
|
||||
elif isinstance(blk, Paragraph):
|
||||
tboxes = _paragraph_textboxes(blk)
|
||||
for tb in tboxes:
|
||||
if tb.strip():
|
||||
out.append(_md_code_block(tb.strip()))
|
||||
sdts = _paragraph_sdts(blk)
|
||||
for s in sdts:
|
||||
if s.strip():
|
||||
out.append(_md_code_block(s.strip()))
|
||||
btx = _paragraph_bordered_text(blk)
|
||||
for s in btx:
|
||||
if s.strip():
|
||||
out.append(_md_code_block(s.strip()))
|
||||
ftx = _paragraph_framed(blk)
|
||||
for s in ftx:
|
||||
if s.strip():
|
||||
out.append(_md_code_block(s.strip()))
|
||||
raw = (blk.text or "")
|
||||
sraw = raw.strip()
|
||||
if _looks_like_code_paragraph(sraw) or (in_code and sraw == ""):
|
||||
if not in_code:
|
||||
in_code = True
|
||||
lang_hint = _guess_lang(sraw)
|
||||
code_lines = []
|
||||
code_lines.append(raw)
|
||||
continue
|
||||
if in_code and code_lines:
|
||||
text = "\n".join(code_lines)
|
||||
use_lang = lang_hint or _guess_lang(text)
|
||||
out.append(f"```{use_lang}\n{text}\n```\n")
|
||||
in_code = False
|
||||
code_lines = []
|
||||
lang_hint = ''
|
||||
def _paragraph_with_images(doc: Document, p: Paragraph) -> str:
|
||||
el = p._element
|
||||
parts: List[str] = []
|
||||
try:
|
||||
for ch in el.iterchildren():
|
||||
tag = ch.tag.split('}')[-1]
|
||||
if tag == 'r':
|
||||
for rc in ch.iterchildren():
|
||||
rtag = rc.tag.split('}')[-1]
|
||||
if rtag == 't':
|
||||
s = rc.text or ''
|
||||
if s:
|
||||
parts.append(s)
|
||||
elif rtag == 'br':
|
||||
parts.append('\n')
|
||||
elif rtag == 'drawing':
|
||||
for node in rc.iter():
|
||||
local = node.tag.split('}')[-1]
|
||||
rid = None
|
||||
if local == 'blip':
|
||||
rid = node.get(f"{{{NS['r']}}}embed") or node.get(f"{{{NS['r']}}}link")
|
||||
elif local == 'imagedata':
|
||||
rid = node.get(f"{{{NS['r']}}}id")
|
||||
if not rid:
|
||||
continue
|
||||
try:
|
||||
part = None
|
||||
rp = getattr(doc.part, 'related_parts', None)
|
||||
if isinstance(rp, dict) and rid in rp:
|
||||
part = rp.get(rid)
|
||||
if part is None:
|
||||
rels = getattr(doc.part, 'rels', None)
|
||||
if rels is not None and hasattr(rels, 'get'):
|
||||
rel = rels.get(rid)
|
||||
part = getattr(rel, 'target_part', None)
|
||||
if part is None:
|
||||
rel = getattr(doc.part, '_rels', {}).get(rid)
|
||||
part = getattr(rel, 'target_part', None)
|
||||
ct = getattr(part, 'content_type', '') if part is not None else ''
|
||||
data = part.blob if part is not None and hasattr(part, 'blob') else None
|
||||
if data:
|
||||
b64 = base64.b64encode(data).decode('ascii')
|
||||
parts.append(f"")
|
||||
except Exception:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
s = ''.join(parts).strip()
|
||||
return (s + '\n\n') if s else ''
|
||||
txt = _paragraph_with_images(doc, blk)
|
||||
if txt.strip():
|
||||
out.append(txt)
|
||||
if in_code and code_lines:
|
||||
text = "\n".join(code_lines)
|
||||
use_lang = lang_hint or _guess_lang(text)
|
||||
out.append(f"```{use_lang}\n{text}\n```\n")
|
||||
try:
|
||||
boxes = _doclevel_textboxes(doc)
|
||||
existing_texts = set()
|
||||
try:
|
||||
for seg in out:
|
||||
if isinstance(seg, str):
|
||||
ss = seg.strip()
|
||||
if ss.startswith("```"):
|
||||
m = re.search(r"^```[\w-]*\n([\s\S]*?)\n```\s*$", ss)
|
||||
if m:
|
||||
existing_texts.add(m.group(1).strip())
|
||||
continue
|
||||
existing_texts.add(ss)
|
||||
except Exception:
|
||||
pass
|
||||
for tb in boxes:
|
||||
s = (tb or '').strip()
|
||||
if not s:
|
||||
continue
|
||||
if s in existing_texts:
|
||||
continue
|
||||
out.append(_md_code_block(s))
|
||||
existing_texts.add(s)
|
||||
except Exception:
|
||||
pass
|
||||
md = "".join(out)
|
||||
return "utf-8", md
|
||||
|
||||
NS = {
|
||||
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
|
||||
"wp": "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing",
|
||||
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
|
||||
"wps": "http://schemas.microsoft.com/office/word/2010/wordprocessingShape",
|
||||
"v": "urn:schemas-microsoft-com:vml",
|
||||
"r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships",
|
||||
"pic": "http://schemas.openxmlformats.org/drawingml/2006/picture",
|
||||
}
|
||||
|
||||
|
||||
def _paragraph_textboxes(p: Paragraph) -> List[str]:
|
||||
try:
|
||||
el = p._element
|
||||
texts: List[str] = []
|
||||
for tbox in el.xpath('.//wps:txbx/w:txbxContent', namespaces=NS):
|
||||
paras = tbox.xpath('.//w:p', namespaces=NS)
|
||||
buf: List[str] = []
|
||||
for w_p in paras:
|
||||
ts = w_p.xpath('.//w:t', namespaces=NS)
|
||||
s = ''.join([t.text or '' for t in ts]).strip()
|
||||
if s:
|
||||
buf.append(s)
|
||||
if buf:
|
||||
texts.append('\n'.join(buf))
|
||||
for tbox in el.xpath('.//v:textbox/w:txbxContent', namespaces=NS):
|
||||
paras = tbox.xpath('.//w:p', namespaces=NS)
|
||||
buf: List[str] = []
|
||||
for w_p in paras:
|
||||
ts = w_p.xpath('.//w:t', namespaces=NS)
|
||||
s = ''.join([t.text or '' for t in ts]).strip()
|
||||
if s:
|
||||
buf.append(s)
|
||||
if buf:
|
||||
texts.append('\n'.join(buf))
|
||||
return texts
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def _paragraph_sdts(p: Paragraph) -> List[str]:
|
||||
try:
|
||||
el = p._element
|
||||
texts: List[str] = []
|
||||
for sdt in el.xpath('.//w:sdt/w:sdtContent', namespaces=NS):
|
||||
paras = sdt.xpath('.//w:p', namespaces=NS)
|
||||
buf: List[str] = []
|
||||
for w_p in paras:
|
||||
ts = w_p.xpath('.//w:t', namespaces=NS)
|
||||
s = ''.join([t.text or '' for t in ts]).strip()
|
||||
if s:
|
||||
buf.append(s)
|
||||
if buf:
|
||||
texts.append('\n'.join(buf))
|
||||
return texts
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def _paragraph_bordered_text(p: Paragraph) -> List[str]:
|
||||
try:
|
||||
el = p._element
|
||||
has_border = bool(el.xpath('./w:pPr/w:pBdr', namespaces=NS))
|
||||
t = (p.text or '').strip()
|
||||
if has_border and t:
|
||||
return [t]
|
||||
except Exception:
|
||||
pass
|
||||
return []
|
||||
|
||||
|
||||
def _paragraph_framed(p: Paragraph) -> List[str]:
|
||||
try:
|
||||
el = p._element
|
||||
has_frame = bool(el.xpath('./w:pPr/w:framePr', namespaces=NS))
|
||||
t = (p.text or '').strip()
|
||||
if has_frame and t:
|
||||
return [t]
|
||||
except Exception:
|
||||
pass
|
||||
return []
|
||||
|
||||
|
||||
def _md_code_block(text: str) -> str:
|
||||
lang = _guess_lang(text)
|
||||
return f"```{lang}\n{text}\n```\n"
|
||||
|
||||
|
||||
def _looks_like_code_paragraph(t: str) -> bool:
|
||||
s = (t or '').strip()
|
||||
if not s:
|
||||
return False
|
||||
if s.startswith('{') or s.startswith('[') or s.endswith('}'):
|
||||
return True
|
||||
if s.startswith(' ') or s.startswith('\t'):
|
||||
return True
|
||||
if ';' in s or '{' in s or '}' in s:
|
||||
return True
|
||||
keywords = ['public static', 'private static', 'class ', 'return ', 'import ', 'package ', 'byte[]', 'String ', 'Cipher', 'KeyFactory']
|
||||
return any(k in s for k in keywords)
|
||||
|
||||
|
||||
def _doclevel_textboxes(doc: Document) -> List[str]:
|
||||
texts: List[str] = []
|
||||
try:
|
||||
el = doc.element.body
|
||||
for tbox in el.xpath('.//wps:txbx/w:txbxContent', namespaces=NS):
|
||||
paras = tbox.xpath('.//w:p', namespaces=NS)
|
||||
buf: List[str] = []
|
||||
for w_p in paras:
|
||||
ts = w_p.xpath('.//w:t', namespaces=NS)
|
||||
s = ''.join([(t.text or '') for t in ts]).strip()
|
||||
if s:
|
||||
buf.append(s)
|
||||
if buf:
|
||||
texts.append('\n'.join(buf))
|
||||
for tbox in el.xpath('.//v:textbox/w:txbxContent', namespaces=NS):
|
||||
paras = tbox.xpath('.//w:p', namespaces=NS)
|
||||
buf: List[str] = []
|
||||
for w_p in paras:
|
||||
ts = w_p.xpath('.//w:t', namespaces=NS)
|
||||
s = ''.join([(t.text or '') for t in ts]).strip()
|
||||
if s:
|
||||
buf.append(s)
|
||||
if buf:
|
||||
texts.append('\n'.join(buf))
|
||||
except Exception:
|
||||
pass
|
||||
return texts
|
||||
|
||||
|
||||
def _convert_doc_to_docx_cross_platform(path: Path) -> Path:
|
||||
try:
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix=".docx") as tmp:
|
||||
tmp.close()
|
||||
subprocess.run(["textutil", "-convert", "docx", str(path), "-output", tmp.name], check=True)
|
||||
return Path(tmp.name)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
outdir = Path(tempfile.mkdtemp(prefix="doc2docx_"))
|
||||
subprocess.run(["soffice", "--headless", "--convert-to", "docx", "--outdir", str(outdir), str(path)], check=True)
|
||||
candidate = outdir / (path.stem + ".docx")
|
||||
if candidate.exists():
|
||||
return candidate
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
out = Path(tempfile.NamedTemporaryFile(delete=False, suffix=".docx").name)
|
||||
subprocess.run(["unoconv", "-f", "docx", "-o", str(out), str(path)], check=True)
|
||||
if out.exists():
|
||||
return out
|
||||
except Exception:
|
||||
pass
|
||||
raise RuntimeError("doc to docx conversion failed; please install 'soffice' or 'unoconv' or convert manually")
|
||||
80
docling/app/tests/run_batch_upload_debug.py
Normal file
80
docling/app/tests/run_batch_upload_debug.py
Normal file
@@ -0,0 +1,80 @@
|
||||
import io
|
||||
import os
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def main():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
tmp = Path("/tmp/run_batch_upload_debug")
|
||||
tmp.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
zpath = tmp / "pkg.zip"
|
||||
md_dir = tmp / "docs"
|
||||
img_dir = md_dir / "images"
|
||||
img_dir.mkdir(parents=True, exist_ok=True)
|
||||
(img_dir / "p.png").write_bytes(b"PNG")
|
||||
(md_dir / "a.md").write_text("", "utf-8")
|
||||
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
zf.write(str(md_dir / "a.md"), arcname="a.md")
|
||||
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
|
||||
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("pkg.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
print("stage status:", r1.status_code, r1.json())
|
||||
sid = r1.json()["data"]["id"]
|
||||
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1001"})
|
||||
print("process status:", r2.status_code, r2.json())
|
||||
|
||||
list_text = str(md_dir / "a.md")
|
||||
lf = io.BytesIO(list_text.encode("utf-8"))
|
||||
r3 = c.post("/api/upload-list", files={"list_file": ("list.txt", lf.getvalue())}, data={"prefix": "assets", "versionId": "1002"})
|
||||
print("upload-list status:", r3.status_code, r3.json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
75
docling/app/tests/run_convert_folder_debug.py
Normal file
75
docling/app/tests/run_convert_folder_debug.py
Normal file
@@ -0,0 +1,75 @@
|
||||
import io
|
||||
import os
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def main():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
tmp = Path("/tmp/run_convert_folder_debug")
|
||||
if tmp.exists():
|
||||
for p in tmp.rglob("*"):
|
||||
try:
|
||||
p.unlink()
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tmp.rmdir()
|
||||
except Exception:
|
||||
pass
|
||||
tmp.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
root = tmp / "数+产品手册-MD源文件"
|
||||
sub = root / "DMDRS_DRS_Language_User_Manual"
|
||||
img = sub / "images"
|
||||
img.mkdir(parents=True, exist_ok=True)
|
||||
(img / "p.png").write_bytes(b"PNG")
|
||||
(sub / "a.md").write_text("# Title\n\n", "utf-8")
|
||||
|
||||
r = c.post("/md/convert-folder", data={"folder_path": str(root), "prefix": "assets"})
|
||||
print("convert-folder:", r.status_code)
|
||||
print(r.json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
97
docling/app/tests/run_edge_cases_debug.py
Normal file
97
docling/app/tests/run_edge_cases_debug.py
Normal file
@@ -0,0 +1,97 @@
|
||||
import io
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def run():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
|
||||
r = c.post("/api/archive/process", data={"id": "missing"})
|
||||
print("invalid-id:", r.status_code, r.json())
|
||||
|
||||
tmp = Path("/tmp/run_edge_cases_debug")
|
||||
tmp.mkdir(parents=True, exist_ok=True)
|
||||
rar_path = tmp / "pkg.rar"
|
||||
rar_path.write_bytes(b"RAR")
|
||||
with open(rar_path, "rb") as fp:
|
||||
files = {"file": ("pkg.rar", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
sid = r1.json()["data"]["id"]
|
||||
r2 = c.post("/api/archive/process", data={"id": sid})
|
||||
print("rar-process:", r2.status_code, r2.json())
|
||||
r3 = c.post("/api/archive/process", data={"id": sid})
|
||||
print("rar-reprocess:", r3.status_code, r3.json())
|
||||
|
||||
root = tmp / "listcase2"
|
||||
root.mkdir(parents=True, exist_ok=True)
|
||||
(root / "img.png").write_bytes(b"PNG")
|
||||
(root / "a.md").write_text("", "utf-8")
|
||||
(root / "b.txt").write_text("", "utf-8")
|
||||
lines = ["", "# comment", "http://example.com/x.md", str(root / "a.md"), str(root / "b.txt")]
|
||||
data_bytes = "\n".join(lines).encode("utf-8")
|
||||
files = {"list_file": ("list.txt", data_bytes)}
|
||||
r4 = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1005"})
|
||||
print("upload-list:", r4.status_code, r4.json())
|
||||
|
||||
zpath = tmp / "dup.zip"
|
||||
base = tmp / "src"
|
||||
sub = base / "sub"
|
||||
sub.mkdir(parents=True, exist_ok=True)
|
||||
(base / "a.md").write_text("", "utf-8")
|
||||
(base / "img.png").write_bytes(b"PNG")
|
||||
(sub / "a.md").write_text("", "utf-8")
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
zf.write(str(base / "a.md"), arcname="a.md")
|
||||
zf.write(str(base / "img.png"), arcname="img.png")
|
||||
zf.write(str(sub / "a.md"), arcname="sub/a.md")
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("dup.zip", fp.read())}
|
||||
r5 = c.post("/api/archive/stage", files=files)
|
||||
sid2 = r5.json()["data"]["id"]
|
||||
r6 = c.post("/api/archive/process", data={"id": sid2, "prefix": "assets", "versionId": "1006"})
|
||||
print("archive-dup:", r6.status_code, r6.json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
77
docling/app/tests/run_minio_object_debug.py
Normal file
77
docling/app/tests/run_minio_object_debug.py
Normal file
@@ -0,0 +1,77 @@
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class _Resp:
|
||||
def __init__(self, data: bytes):
|
||||
self._data = data
|
||||
def read(self) -> bytes:
|
||||
return self._data
|
||||
def close(self):
|
||||
pass
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.store = {
|
||||
("doctest", "assets/rewritten/x.md"): (b"# Title\n\nhello", "text/markdown; charset=utf-8")
|
||||
}
|
||||
def stat_object(self, bucket: str, object_name: str):
|
||||
class S:
|
||||
def __init__(self, ct: str):
|
||||
self.content_type = ct
|
||||
k = (bucket, object_name)
|
||||
if k in self.store:
|
||||
return S(self.store[k][1])
|
||||
return S("application/octet-stream")
|
||||
def get_object(self, bucket: str, object_name: str):
|
||||
k = (bucket, object_name)
|
||||
if k in self.store:
|
||||
return _Resp(self.store[k][0])
|
||||
return _Resp(b"")
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "doctest", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def run():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
r = c.get("/minio/object", params={"bucket": "doctest", "object": "assets/rewritten/x.md"})
|
||||
print("status:", r.status_code)
|
||||
print("ct:", r.headers.get("Content-Type"))
|
||||
print(r.text)
|
||||
|
||||
import urllib.parse as _u
|
||||
enc = _u.quote("assets/rewritten/数字+产品手册-MD源文件/x.md")
|
||||
cur_client, _, _, _ = server._minio_current() # type: ignore
|
||||
cur_client.store[("doctest", "assets/rewritten/数字+产品手册-MD源文件/x.md")] = ("hello 中文+plus".encode("utf-8"), "text/markdown; charset=utf-8")
|
||||
r2 = c.get("/minio/object", params={"bucket": "doctest", "object": enc})
|
||||
print("status2:", r2.status_code)
|
||||
print("ct2:", r2.headers.get("Content-Type"))
|
||||
print(r2.text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
50
docling/app/tests/run_minio_presign_debug.py
Normal file
50
docling/app/tests/run_minio_presign_debug.py
Normal file
@@ -0,0 +1,50 @@
|
||||
import io
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
pass
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}?e={expires}"
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}?e={expires}"
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "doctest",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "doctest", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def run():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
url = "http://127.0.0.1:9000/doctest/assets/rewritten/%E6%B5%8B%E8%AF%95/a.md"
|
||||
r = c.post("/minio/presign", data={"url": url, "expires": 7200})
|
||||
print("status:", r.status_code)
|
||||
print(r.json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
|
||||
74
docling/app/tests/run_slash_path_debug.py
Normal file
74
docling/app/tests/run_slash_path_debug.py
Normal file
@@ -0,0 +1,74 @@
|
||||
import io
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
import sys
|
||||
from pathlib import Path as _Path
|
||||
base = _Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(base))
|
||||
sys.path.insert(0, str(base / "docling"))
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup():
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def main():
|
||||
setup()
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
tmp = Path("/tmp/run_slash_path_debug")
|
||||
tmp.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
zpath = tmp / "pkg.zip"
|
||||
md_dir = tmp / "docs"
|
||||
img_dir = md_dir / "images"
|
||||
img_dir.mkdir(parents=True, exist_ok=True)
|
||||
(img_dir / "p.png").write_bytes(b"PNG")
|
||||
(md_dir / "a.md").write_text("", "utf-8")
|
||||
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
zf.write(str(md_dir / "a.md"), arcname="a.md")
|
||||
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
|
||||
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("pkg.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
sid = r1.json()["data"]["id"]
|
||||
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1007"})
|
||||
print("process:", r2.status_code)
|
||||
print(r2.json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
29
docling/app/tests/test_api_convert.py
Normal file
29
docling/app/tests/test_api_convert.py
Normal file
@@ -0,0 +1,29 @@
|
||||
import unittest
|
||||
from fastapi.testclient import TestClient
|
||||
from pathlib import Path
|
||||
import io
|
||||
|
||||
from app.server import app
|
||||
|
||||
|
||||
class ApiConvertTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.client = TestClient(app)
|
||||
|
||||
def test_api_convert_markdown_file(self):
|
||||
tmpdir = Path("./scratch_unittest")
|
||||
tmpdir.mkdir(exist_ok=True)
|
||||
p = tmpdir / "sample.md"
|
||||
p.write_text("# Title\n\n::: note\nBody\n:::\n", "utf-8")
|
||||
with open(p, "rb") as f:
|
||||
files = {"file": (p.name, io.BytesIO(f.read()), "text/markdown")}
|
||||
r = self.client.post("/api/convert", files=files, data={"export": "markdown"})
|
||||
self.assertEqual(r.status_code, 200)
|
||||
j = r.json()
|
||||
self.assertEqual(j.get("code"), 0)
|
||||
self.assertIsInstance(j.get("data", {}).get("content"), str)
|
||||
self.assertIn("!!! note", j["data"]["content"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
113
docling/app/tests/test_batch_upload_edge_cases.py
Normal file
113
docling/app/tests/test_batch_upload_edge_cases.py
Normal file
@@ -0,0 +1,113 @@
|
||||
import io
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup_module(module=None):
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
fake = FakeMinio()
|
||||
def _cur():
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server._minio_current = _cur # type: ignore
|
||||
|
||||
|
||||
def test_process_invalid_id():
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
r = c.post("/api/archive/process", data={"id": "missing"})
|
||||
assert r.status_code == 200
|
||||
j = r.json()
|
||||
assert j["code"] != 0
|
||||
|
||||
|
||||
def test_stage_unsupported_format_and_cleanup(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
rar_path = tmp_path / "pkg.rar"
|
||||
rar_path.write_bytes(b"RAR")
|
||||
with open(rar_path, "rb") as fp:
|
||||
files = {"file": ("pkg.rar", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
assert r1.status_code == 200
|
||||
sid = r1.json()["data"]["id"]
|
||||
r2 = c.post("/api/archive/process", data={"id": sid})
|
||||
assert r2.status_code == 200
|
||||
j2 = r2.json()
|
||||
assert j2["code"] != 0
|
||||
r3 = c.post("/api/archive/process", data={"id": sid})
|
||||
assert r3.status_code == 200
|
||||
j3 = r3.json()
|
||||
assert j3["code"] != 0
|
||||
|
||||
|
||||
def test_upload_list_empty_lines_comments_and_urls(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
root = tmp_path / "listcase2"
|
||||
root.mkdir(parents=True, exist_ok=True)
|
||||
(root / "img.png").write_bytes(b"PNG")
|
||||
(root / "a.md").write_text("", "utf-8")
|
||||
(root / "b.txt").write_text("", "utf-8")
|
||||
lines = ["", "# comment", "http://example.com/x.md", str(root / "a.md"), str(root / "b.txt")]
|
||||
data_bytes = "\n".join(lines).encode("utf-8")
|
||||
files = {"list_file": ("list.txt", data_bytes)}
|
||||
r = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1005"})
|
||||
assert r.status_code == 200
|
||||
j = r.json()
|
||||
assert j["code"] == 0
|
||||
assert j["data"]["count"] >= 2
|
||||
|
||||
|
||||
def test_archive_duplicate_filenames_tree(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
zpath = tmp_path / "dup.zip"
|
||||
base = tmp_path / "src"
|
||||
sub = base / "sub"
|
||||
sub.mkdir(parents=True, exist_ok=True)
|
||||
(base / "a.md").write_text("", "utf-8")
|
||||
(base / "img.png").write_bytes(b"PNG")
|
||||
(sub / "a.md").write_text("", "utf-8")
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
zf.write(str(base / "a.md"), arcname="a.md")
|
||||
zf.write(str(base / "img.png"), arcname="img.png")
|
||||
zf.write(str(sub / "a.md"), arcname="sub/a.md")
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("dup.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
assert r1.status_code == 200
|
||||
sid = r1.json()["data"]["id"]
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1006"})
|
||||
assert r2.status_code == 200
|
||||
j = r2.json()
|
||||
assert j["code"] == 0
|
||||
tree = j["data"]["import"]["tree"]
|
||||
names = [n["name"] for n in tree]
|
||||
assert "sub" in names or any((isinstance(n, dict) and n.get("type") == "FOLDER" and n.get("name") == "sub") for n in tree)
|
||||
185
docling/app/tests/test_batch_upload_endpoints.py
Normal file
185
docling/app/tests/test_batch_upload_endpoints.py
Normal file
@@ -0,0 +1,185 @@
|
||||
import io
|
||||
import os
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
import app.server as server
|
||||
|
||||
|
||||
class FakeMinio:
|
||||
def __init__(self):
|
||||
self.objs = {}
|
||||
|
||||
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
|
||||
self.objs[(bucket_name, object_name)] = data.read(length)
|
||||
|
||||
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
def presigned_get_object(self, bucket: str, obj: str, expires: int):
|
||||
return f"http://minio.test/presigned/{bucket}/{obj}"
|
||||
|
||||
|
||||
def setup_module(module=None):
|
||||
server.RUNTIME_CONFIG["minio"].update({
|
||||
"endpoint": "127.0.0.1:9000",
|
||||
"public": "http://127.0.0.1:9000",
|
||||
"access": "ak",
|
||||
"secret": "sk",
|
||||
"bucket": "test",
|
||||
"secure": "false",
|
||||
"prefix": "assets",
|
||||
"store_final": "true",
|
||||
"public_read": "true",
|
||||
})
|
||||
|
||||
fake = FakeMinio()
|
||||
|
||||
def _cur_cfg(_cfg):
|
||||
return fake, "test", "http://127.0.0.1:9000", "assets"
|
||||
server.minio_current = _cur_cfg # type: ignore
|
||||
try:
|
||||
server._minio_current = lambda: _cur_cfg(None) # type: ignore
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def test_archive_stage_and_process(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
|
||||
zpath = tmp_path / "pkg.zip"
|
||||
md_dir = tmp_path / "docs"
|
||||
img_dir = md_dir / "images"
|
||||
img_dir.mkdir(parents=True, exist_ok=True)
|
||||
(img_dir / "p.png").write_bytes(b"PNG")
|
||||
(md_dir / "a.md").write_text("", "utf-8")
|
||||
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
zf.write(str(md_dir / "a.md"), arcname="a.md")
|
||||
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
|
||||
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("pkg.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
assert r1.status_code == 200
|
||||
j1 = r1.json()
|
||||
assert j1["code"] == 0 and j1["data"]["id"]
|
||||
sid = j1["data"]["id"]
|
||||
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1001"})
|
||||
assert r2.status_code == 200
|
||||
j2 = r2.json()
|
||||
assert j2["code"] == 0
|
||||
assert j2["data"]["count"] >= 1
|
||||
assert "import" in j2["data"]
|
||||
|
||||
|
||||
def test_upload_list(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
|
||||
root = tmp_path / "listcase"
|
||||
root.mkdir(parents=True, exist_ok=True)
|
||||
(root / "img.png").write_bytes(b"PNG")
|
||||
(root / "b.md").write_text("", "utf-8")
|
||||
|
||||
list_text = str(root / "b.md")
|
||||
lf = io.BytesIO(list_text.encode("utf-8"))
|
||||
|
||||
files = {"list_file": ("list.txt", lf.getvalue())}
|
||||
r = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1002"})
|
||||
assert r.status_code == 200
|
||||
j = r.json()
|
||||
assert j["code"] == 0
|
||||
assert j["data"]["count"] >= 1
|
||||
assert "import" in j["data"]
|
||||
|
||||
|
||||
def test_archive_process_html_conversion(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
|
||||
zpath = tmp_path / "web.zip"
|
||||
root = tmp_path / "web"
|
||||
static = root / "static"
|
||||
static.mkdir(parents=True, exist_ok=True)
|
||||
(static / "pic.png").write_bytes(b"PNG")
|
||||
|
||||
(root / "index.html").write_text("<html><body><h1>T</h1><img src='static/pic.png'/></body></html>", "utf-8")
|
||||
pages = root / "pages"
|
||||
pages.mkdir(parents=True, exist_ok=True)
|
||||
(pages / "a.html").write_text("<img src='../static/pic.png'>", "utf-8")
|
||||
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
for p in root.rglob("*"):
|
||||
if p.is_file():
|
||||
zf.write(str(p), arcname=p.relative_to(root).as_posix())
|
||||
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("web.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
assert r1.status_code == 200
|
||||
sid = r1.json()["data"]["id"]
|
||||
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1003"})
|
||||
assert r2.status_code == 200
|
||||
j = r2.json()
|
||||
assert j["code"] == 0
|
||||
|
||||
files_list = j["data"]["files"]
|
||||
names = {Path(str(f.get("source") or "")).name for f in files_list}
|
||||
assert "index.md" in names
|
||||
assert "a.md" in names
|
||||
for f in files_list:
|
||||
n = Path(str(f.get("source") or "")).name
|
||||
if n in {"index.md", "a.md"}:
|
||||
assert f.get("minio_url")
|
||||
assert str(f.get("object_name") or "").startswith("assets/rewritten/")
|
||||
|
||||
imp = j["data"]["import"]
|
||||
nodes = []
|
||||
def walk(children):
|
||||
for n in children:
|
||||
if n.get("type") == "FILE":
|
||||
nodes.append(n.get("name"))
|
||||
elif n.get("type") == "FOLDER":
|
||||
walk(n.get("children", []))
|
||||
walk(imp["tree"])
|
||||
assert "index" in nodes
|
||||
assert "a" in nodes
|
||||
|
||||
|
||||
def test_archive_process_html_abs_uppercase(tmp_path: Path):
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
|
||||
zpath = tmp_path / "web2.zip"
|
||||
root = tmp_path / "web2"
|
||||
(root / "static").mkdir(parents=True, exist_ok=True)
|
||||
(root / "static" / "p.png").write_bytes(b"PNG")
|
||||
|
||||
(root / "INDEX.HTML").write_text("<img src='/static/p.png'>", "utf-8")
|
||||
(root / "pages").mkdir(parents=True, exist_ok=True)
|
||||
(root / "pages" / "A.HTM").write_text("<img src='/static/p.png'>", "utf-8")
|
||||
|
||||
with zipfile.ZipFile(str(zpath), "w") as zf:
|
||||
for p in root.rglob("*"):
|
||||
if p.is_file():
|
||||
zf.write(str(p), arcname=p.relative_to(root).as_posix())
|
||||
|
||||
with open(zpath, "rb") as fp:
|
||||
files = {"file": ("web2.zip", fp.read())}
|
||||
r1 = c.post("/api/archive/stage", files=files)
|
||||
assert r1.status_code == 200
|
||||
sid = r1.json()["data"]["id"]
|
||||
|
||||
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1004"})
|
||||
assert r2.status_code == 200
|
||||
j = r2.json()
|
||||
assert j["code"] == 0
|
||||
files_list = j["data"]["files"]
|
||||
names = {Path(str(f.get("source") or "")).name for f in files_list}
|
||||
assert "INDEX.md" in names
|
||||
assert "A.md" in names
|
||||
53
docling/app/tests/test_md_to_docx.py
Normal file
53
docling/app/tests/test_md_to_docx.py
Normal file
@@ -0,0 +1,53 @@
|
||||
import io
|
||||
import os
|
||||
import base64
|
||||
from pathlib import Path
|
||||
from zipfile import ZipFile
|
||||
|
||||
from app.services.docling_adapter import md_to_docx_bytes
|
||||
|
||||
|
||||
def _make_png(tmpdir: Path) -> Path:
|
||||
# Minimal 1x1 PNG
|
||||
data = base64.b64decode(
|
||||
b"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGNgYAAAAAMAASsJTYQAAAAASUVORK5CYII="
|
||||
)
|
||||
p = tmpdir / "tiny.png"
|
||||
p.write_bytes(data)
|
||||
return p
|
||||
|
||||
|
||||
def test_md_to_docx_renders_blocks_and_media(tmp_path: Path):
|
||||
png = _make_png(tmp_path)
|
||||
html = (
|
||||
f"<h1>标题</h1>"
|
||||
f"<p>内容</p>"
|
||||
f"<pre><code>print(\"hello\")\n</code></pre>"
|
||||
f"<img src='{png.as_posix()}'>"
|
||||
f"<table><thead><tr><th>A</th><th>B</th></tr></thead>"
|
||||
f"<tbody><tr><td>1</td><td>2</td></tr></tbody></table>"
|
||||
)
|
||||
|
||||
docx = md_to_docx_bytes(
|
||||
html,
|
||||
toc=True,
|
||||
header_text="Left|Right",
|
||||
footer_text="Footer",
|
||||
filename_text="FileName",
|
||||
product_name="Product",
|
||||
document_name="DocName",
|
||||
product_version="1.0",
|
||||
document_version="2.0",
|
||||
)
|
||||
|
||||
assert isinstance(docx, (bytes, bytearray)) and len(docx) > 0
|
||||
zf = ZipFile(io.BytesIO(docx))
|
||||
names = set(zf.namelist())
|
||||
assert any(n.startswith("word/") for n in names)
|
||||
# Document XML should contain core texts
|
||||
doc_xml = zf.read("word/document.xml").decode("utf-8")
|
||||
for tok in ["标题", "内容", "print(\"hello\")", "A", "B", "1", "2"]:
|
||||
assert tok in doc_xml
|
||||
# Media should be present for the image
|
||||
assert any(n.startswith("word/media/") for n in names)
|
||||
|
||||
51
docling/app/tests/test_word2markdown_inline_images.py
Normal file
51
docling/app/tests/test_word2markdown_inline_images.py
Normal file
@@ -0,0 +1,51 @@
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
import base64
|
||||
import tempfile
|
||||
import sys
|
||||
|
||||
# ensure 'app' package is importable
|
||||
try:
|
||||
root = Path(__file__).resolve().parents[2]
|
||||
p = str(root)
|
||||
if p not in sys.path:
|
||||
sys.path.insert(0, p)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
from docx import Document
|
||||
|
||||
from app.services.word2markdown import convert_any
|
||||
|
||||
|
||||
def _tiny_png_bytes() -> bytes:
|
||||
return base64.b64decode(
|
||||
b"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGNgYAAAAAMAASsJTYQAAAAASUVORK5CYII="
|
||||
)
|
||||
|
||||
|
||||
class InlineImagesTest(unittest.TestCase):
|
||||
def test_paragraph_image_order(self):
|
||||
tmp = Path(tempfile.mkdtemp(prefix="w2m_inline_test_"))
|
||||
img = tmp / "tiny.png"
|
||||
img.write_bytes(_tiny_png_bytes())
|
||||
|
||||
docx = tmp / "sample.docx"
|
||||
doc = Document()
|
||||
doc.add_paragraph("前文A")
|
||||
doc.add_picture(str(img)) # 图片单独段落
|
||||
doc.add_paragraph("后文B")
|
||||
doc.save(str(docx))
|
||||
|
||||
enc, md = convert_any(docx)
|
||||
self.assertEqual(enc, "utf-8")
|
||||
a_pos = md.find("前文A")
|
||||
img_pos = md.find("
|
||||
b_pos = md.find("后文B")
|
||||
# 顺序应为 A -> 图片 -> B
|
||||
self.assertTrue(a_pos != -1 and img_pos != -1 and b_pos != -1)
|
||||
self.assertTrue(a_pos < img_pos < b_pos)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
1
docling/docling
Submodule
1
docling/docling
Submodule
Submodule docling/docling added at ad97e52851
28
docling/requirements.txt
Normal file
28
docling/requirements.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
fastapi
|
||||
uvicorn
|
||||
python-multipart
|
||||
minio
|
||||
beautifulsoup4
|
||||
marko
|
||||
markdown-it-py
|
||||
mdit-py-plugins
|
||||
pydantic-settings
|
||||
filetype
|
||||
python-docx
|
||||
openpyxl
|
||||
mammoth
|
||||
weasyprint
|
||||
reportlab
|
||||
pypdfium2
|
||||
python-pptx
|
||||
pluggy
|
||||
requests
|
||||
docling-core
|
||||
docling-parse
|
||||
docling-ibm-models
|
||||
transformers
|
||||
sentencepiece
|
||||
safetensors
|
||||
scipy
|
||||
opencv-python
|
||||
pymupdf
|
||||
17
docling/tests/debug_api.py
Normal file
17
docling/tests/debug_api.py
Normal file
@@ -0,0 +1,17 @@
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
root = Path(__file__).resolve().parents[2] / "docling"
|
||||
sys.path.insert(0, str(root))
|
||||
import app.server as server
|
||||
|
||||
from docling.tests.test_api_prd import setup_module, PNG
|
||||
setup_module()
|
||||
|
||||
app = server.app
|
||||
c = TestClient(app)
|
||||
files = {"file": ("管理端使用说明 (1).pdf", b"%PDF-1.4\n")}
|
||||
data = {"export": "markdown", "save": "true", "filename": "管理端使用说明 (1)"}
|
||||
r = c.post("/api/convert", files=files, data=data)
|
||||
print(r.json())
|
||||
131
docling/tests/test_api_prd.py
Normal file
131
docling/tests/test_api_prd.py
Normal file
@@ -0,0 +1,131 @@
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from fastapi.testclient import TestClient
|
||||
import types
|
||||
|
||||
root = Path(__file__).resolve().parents[2] / "docling"
|
||||
sys.path.insert(0, str(root))
|
||||
dc = types.ModuleType('docling.document_converter')
|
||||
class _DC:
|
||||
def __init__(self, *a, **k):
|
||||
pass
|
||||
def convert(self, src):
|
||||
class R:
|
||||
class D:
|
||||
def export_to_markdown(self, image_mode=None):
|
||||
return ""
|
||||
def export_to_html(self):
|
||||
return ""
|
||||
def export_to_json(self):
|
||||
return "{}"
|
||||
def export_to_doctags(self):
|
||||
return "{}"
|
||||
document = D()
|
||||
return R()
|
||||
class _PF:
|
||||
def __init__(self, *a, **k):
|
||||
pass
|
||||
dc.DocumentConverter = _DC
|
||||
dc.PdfFormatOption = _PF
|
||||
sys.modules['docling.document_converter'] = dc
|
||||
bm = types.ModuleType('docling.datamodel.base_models')
|
||||
class _IF:
|
||||
PDF = 'pdf'
|
||||
bm.InputFormat = _IF
|
||||
sys.modules['docling.datamodel.base_models'] = bm
|
||||
pl = types.ModuleType('docling.pipeline.standard_pdf_pipeline')
|
||||
class _SP:
|
||||
def __init__(self, *a, **k):
|
||||
pass
|
||||
pl.StandardPdfPipeline = _SP
|
||||
sys.modules['docling.pipeline.standard_pdf_pipeline'] = pl
|
||||
po = types.ModuleType('docling.datamodel.pipeline_options')
|
||||
class _PPO:
|
||||
def __init__(self, *a, **k):
|
||||
pass
|
||||
po.PdfPipelineOptions = _PPO
|
||||
sys.modules['docling.datamodel.pipeline_options'] = po
|
||||
ct = types.ModuleType('docling_core.types.doc')
|
||||
class _IRM:
|
||||
PLACEHOLDER = 'placeholder'
|
||||
ct.ImageRefMode = _IRM
|
||||
sys.modules['docling_core.types.doc'] = ct
|
||||
da = types.ModuleType('app.services.docling_adapter')
|
||||
def _convert_source(src, export):
|
||||
return ("", "text/markdown")
|
||||
def _md2docx(md, **k):
|
||||
return b""
|
||||
def _md2pdf(md, *a, **k):
|
||||
return b""
|
||||
def _infer(source_url, upload_name):
|
||||
return "document"
|
||||
def _san(name):
|
||||
return name or "document"
|
||||
def _load():
|
||||
return {}
|
||||
def _save(m):
|
||||
return None
|
||||
da.convert_source = _convert_source
|
||||
da.md_to_docx_bytes = _md2docx
|
||||
da.md_to_pdf_bytes_with_renderer = _md2pdf
|
||||
da.infer_basename = _infer
|
||||
da.sanitize_filename = _san
|
||||
da.load_linkmap = _load
|
||||
da.save_linkmap = _save
|
||||
sys.modules['app.services.docling_adapter'] = da
|
||||
import app.server as server
|
||||
|
||||
class DummyMinio:
|
||||
def __init__(self):
|
||||
self.objs = []
|
||||
def put_object(self, bucket_name, object_name, data, length, content_type):
|
||||
self.objs.append((bucket_name, object_name, length, content_type))
|
||||
def get_presigned_url(self, method, bucket, obj, expires=None):
|
||||
return f"http://127.0.0.1:9000/{bucket}/{obj}"
|
||||
def presigned_get_object(self, bucket, obj, expires=None):
|
||||
return f"http://127.0.0.1:9000/{bucket}/{obj}"
|
||||
|
||||
PNG = (b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00\x00\nIDATx\x9cc\xf8\x0f\x00\x01\x01\x01\x00\x18\xdd\xdc\xa4\x00\x00\x00\x00IEND\xaeB`\x82")
|
||||
|
||||
def setup_module(module=None):
|
||||
server._minio_current = lambda: (DummyMinio(), "doctest", "http://127.0.0.1:9000", "assets")
|
||||
def fake_convert(src, export="markdown", engine=None):
|
||||
d = Path(tempfile.mkdtemp(prefix="artifacts_"))
|
||||
(d / "img.png").write_bytes(PNG)
|
||||
return ("utf-8", "A\n<!-- image -->\nB", str(d))
|
||||
server._converter_v2.convert = fake_convert
|
||||
server._extract_pdf_images = lambda pdf_path: [("png", PNG), ("png", PNG)]
|
||||
|
||||
import unittest
|
||||
|
||||
class TestApiConvert(unittest.TestCase):
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
setup_module()
|
||||
def test_api_convert_save_true_returns_md_url(self):
|
||||
app = server.app
|
||||
mc = server._minio_current()
|
||||
assert mc[1] == 'doctest'
|
||||
c = TestClient(app)
|
||||
files = {"file": ("管理端使用说明 (1).pdf", b"%PDF-1.4\n")}
|
||||
data = {"export": "markdown", "save": "true", "filename": "管理端使用说明 (1)"}
|
||||
r = c.post("/api/convert", files=files, data=data)
|
||||
j = r.json()
|
||||
self.assertEqual(j["code"], 0, str(j))
|
||||
self.assertTrue(j["data"]["name"].lower().endswith(".md"))
|
||||
self.assertTrue(j["data"]["minio_url"].lower().endswith(".md"))
|
||||
|
||||
def test_api_convert_save_false_returns_content_and_md_name(self):
|
||||
app = server.app
|
||||
mc = server._minio_current()
|
||||
assert mc[1] == 'doctest'
|
||||
c = TestClient(app)
|
||||
files = {"file": ("文档.pdf", b"%PDF-1.4\n")}
|
||||
data = {"export": "markdown", "save": "false", "filename": "文档"}
|
||||
r = c.post("/api/convert", files=files, data=data)
|
||||
j = r.json()
|
||||
self.assertEqual(j["code"], 0, str(j))
|
||||
self.assertTrue(j["data"]["name"].lower().endswith(".md"))
|
||||
self.assertIn("
|
||||
Reference in New Issue
Block a user