Import project files

This commit is contained in:
2026-01-07 17:18:26 +08:00
parent 7d9fff2c34
commit 0b07e63b76
66 changed files with 11497 additions and 0 deletions

16
.gitignore vendored Normal file
View File

@@ -0,0 +1,16 @@
# OS
.DS_Store
# Node/Vite
node_modules/
frontend/dist/
# Python
__pycache__/
*.pyc
.env
# Local archives (do not push huge files)
FunMD_Convert.tar
FunMD_Convert_Image.tar

3
.vscode/settings.json vendored Normal file
View File

@@ -0,0 +1,3 @@
{
"git.ignoreLimitWarning": true
}

58
Dockerfile Normal file
View File

@@ -0,0 +1,58 @@
########## Frontend build stage ##########
FROM node:20-alpine AS frontend
WORKDIR /frontend
# Install dependencies and build
COPY frontend/package*.json ./
RUN npm ci
COPY frontend/ .
# Allow overriding API base at build time if needed
ARG VITE_API_BASE_URL=
ENV VITE_API_BASE_URL=${VITE_API_BASE_URL}
RUN npm run build
########## Backend runtime stage ##########
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libgl1 \
libglib2.0-0 \
pandoc \
libreoffice \
fonts-noto \
fonts-noto-cjk \
&& rm -rf /var/lib/apt/lists/*
# Copy backend requirements and install
COPY docling/requirements.txt ./requirements.txt
ARG PIP_INDEX_URL
RUN if [ -n "$PIP_INDEX_URL" ]; then pip install --no-cache-dir -i "$PIP_INDEX_URL" --timeout 120 -r requirements.txt; else pip install --no-cache-dir --timeout 120 -r requirements.txt; fi
# Copy backend code
COPY docling/ /app/docling/
# Copy built frontend into expected location
COPY --from=frontend /frontend/dist /app/frontend/dist
# Prefetch models for offline use
ENV DOCLING_CACHE_DIR=/root/.cache/docling
ENV PYTHONPATH=/app:/app/docling:/app/docling/docling
RUN python - <<'PY'
from docling.utils.model_downloader import download_models
print('Prefetching Docling models (layout, table, picture-classifier, code-formula, rapidocr)...')
download_models(progress=False)
print('Models downloaded.')
PY
# Expose port
ENV PORT=8000
EXPOSE 8000
# Start backend (serves API and /ui)
CMD ["uvicorn", "app.server:app", "--host", "0.0.0.0", "--port", "8000"]

238
api.md Normal file
View File

@@ -0,0 +1,238 @@
# FunMD 文档处理接口与测试说明
## 基本信息
- 基地址(内网):`http://192.168.110.58:8000`
- 前端内网测试链接:`http://192.168.110.58:8000/ui/`
- 统一返回结构API v2成功 `{"code":0,"msg":"ok","data":{...}}`,失败 `{"code":<错误码>,"msg":<错误>,"data":null}`HTTP 状态保持 200
- 建议前端设置:`localStorage.setItem('app.api.base','http://192.168.110.58:8000')`
- 重要约定:当 MinIO 桶为私有时,优先使用返回的 `minio_presigned_url` 进行下载;直链 `minio_url` 可能 403。
## 接口列表
### 健康检查
- 方法:`GET /health`
- 返回:`{"status":"ok"}`
- 参考:`docling/app/server.py:99`
### 统一转换DOCX/PDF → Markdown/HTML/JSON
- 方法:`POST /api/convert`
- 表单字段:
- `file``source_url` 二选一
- `export`: `markdown|html|json|doctags`,默认 `markdown`
- `engine`(可选):`word2markdown|docling`
- `save`(可选):`true|false`
- `filename`(可选):输出基名
- 返回:
- 未保存:`data.content` 为文本,`data.media_type` 指示类型
- 已保存:`data.minio_url``data.minio_presigned_url`
- 示例:
```bash
# 本地 PDF 转 Markdown不保存
curl -X POST http://192.168.110.58:8000/api/convert \
-F file=@/path/to/file.pdf \
-F export=markdown
# 远程 URL 转 HTML保存
curl -X POST http://192.168.110.58:8000/api/convert \
-F source_url=https://example.com/page.pdf \
-F export=html -F save=true -F filename=example
```
- 参考:`docling/app/server.py:2296`
### Markdown → DOCX/PDF高级样式支持
- 方法:`POST /md/convert`
- 输入三选一:`md_file` | `markdown_text` | `markdown_url`
- 必填:`target=docx|pdf`
- 可选(高级设置):
- 布局:`css_name`、`css_text`、`toc=true|false`、`header_text`、`footer_text`
- 封面与 Logo`cover_url|cover_file`、`logo_url|logo_file`
- 封面文字:`product_name|document_name|product_version|document_version`
- 版权:`copyright_text`
- 保存:`save=true|false`
- 行为说明:
- `save=false` 时,封面/Logo 会内嵌为 `data:`,避免私有桶 403`save=true` 时返回 MinIO 链接。
- 例:
```bash
# 文本转 PDF封面、Logo、目录、页眉页脚
curl -X POST http://192.168.110.58:8000/md/convert \
-F markdown_text='# 标题\n\n内容' \
-F target=pdf -F toc=true \
-F header_text='Internal' -F footer_text='Confidential' \
-F product_name='CMS' -F document_name='周报' \
-F product_version='v1.0' -F document_version='2025-W48' \
-F cover_file=@/path/to/cover.png -F logo_file=@/path/to/logo.png
# 文件转 DOCX保存到 MinIO
curl -X POST http://192.168.110.58:8000/md/convert \
-F md_file=@/path/to/doc.md \
-F target=docx -F save=true -F filename='周报'
```
- 参考:`docling/app/server.py:1198`
### 本地文件夹批量处理(重写 MD 资源并上传)
- 方法:`POST /md/convert-folder`
- 表单字段:
- `folder_path`(必填):本地文件夹绝对路径(后端机器)
- `prefix`可选MinIO 前缀(如 `assets`
- 返回:`{ ok, count, files: [{ source, minio_url, minio_presigned_url, asset_ok, asset_fail, mappings }] }`
- 示例:
```bash
curl -X POST http://192.168.110.58:8000/md/convert-folder \
-F folder_path='/Users/fanyang/Desktop/Others/CMS/达梦数据-各类示范文档/数+产品手册-MD源文件/DMDRS_Build_Manual_DM8' \
-F prefix='assets'
```
- 参考:`docling/app/server.py:2075`
### 上传压缩包批量处理
- 方法:`POST /api/upload-archive`
- 表单字段:`file`zip/tar.gz/tgz`prefix`(可选)
- 返回:`{ code, msg, data: { count, files: [{ source, minio_url, minio_presigned_url, mappings }] } }`
- 示例:
```bash
curl -X POST http://192.168.110.58:8000/api/upload-archive \
-F file=@/path/to/archive.zip -F prefix='assets'
```
- 参考:`docling/app/server.py:2571`
### 归档分阶段处理
- 暂存上传:`POST /api/archive/stage`,返回 `{ id, name, size }`
- 批量处理:`POST /api/archive/process`,字段:`id`(必填)、`prefix`(可选)、`versionId`(可选)
- 说明HTML 文件按“两阶段重写”策略处理HTML 资源上传到 MinIO 并重写 → 转换为 Markdown → 再次重写 MD 中的资源与链接),支持 `data:image/*;base64,` 图片上传并替换为 MinIO 链接
- 参考:`docling/app/server.py:2714,2728`
### MinIO 配置与测试
- 设置配置:`POST /config/minio`
- 字段:`endpoint`、`public`、`access`、`secret`、`bucket`、`secure=true|false`、`prefix`、`store_final=true|false`、`public_read=true|false`
- 示例:
```bash
curl -X POST http://192.168.110.58:8000/config/minio \
-F endpoint='127.0.0.1:9000' -F public='127.0.0.1:9000' \
-F access='minioadmin' -F secret='minioadmin123' \
-F bucket='doctest' -F secure=false -F prefix='assets' \
-F store_final=true -F public_read=true
```
- 注意:请使用 MinIO API 端口 `9000`(而非 `9001` 控制台端口);若填写控制台地址或 `:9001` 将被拒绝
- 参考:`docling/app/server.py:488`
- 连通测试并应用策略:`POST /config/minio/test`
- 同上字段,额外可携带 `create_if_missing=true`
- 返回:`{ ok, connected, bucket_exists, created, error?, hint? }`
- 参考:`docling/app/server.py:577`
- 获取配置快照:`GET /config`(参考:`docling/app/server.py:1047`
- 配置档案:`GET /config/profiles`、`POST /config/save_profile`、`GET /config/load_profile?name=xxx`(参考:`docling/app/server.py:1058,1068,1084`
### 系统时间检查MinIO 时间偏差排查)
- 方法:`GET /system/time/check`
- 查询参数:`endpoint`、`public`、`secure`(可选,不传则使用当前运行时配置)
- 返回:`{ ok, server_time, local_time, diff_sec, hint }`
- 参考:`docling/app/server.py:720`
### 资源映射与代理下载(可选)
- Linkmap`GET /config/linkmap`、`POST /config/linkmap`(参考:`docling/app/server.py:1583,1587`
- 代理下载:`POST /proxy/download`(参考:`docling/app/server.py:1635`
## 前端集成要点
- 基地址读取:`frontend/src/services/api.ts:56-64`localStorage `app.api.base` 优先,其次 `VITE_API_BASE_URL`
- 提供的方法:
- `convertDoc` → `/api/convert``frontend/src/services/api.ts:96`
- `uploadArchive` → `/api/upload-archive``frontend/src/services/api.ts:104`
- `stageArchive` → `/api/archive/stage``frontend/src/services/api.ts:185`
- `processArchive` → `/api/archive/process``frontend/src/services/api.ts:193`
- `convertMd` → `/md/convert``frontend/src/services/api.ts:157`
- `convertFolder` → `/md/convert-folder``frontend/src/services/api.ts:164`
- MinIO 配置:`setMinioConfig``frontend/src/services/api.ts:112`)、`testMinioConfig``frontend/src/services/api.ts:128`)、`createBucket``frontend/src/services/api.ts:145`
- 私有桶注意:直链可能 403前端应优先使用 `minio_presigned_url`UI 已支持)。
## 测试说明(覆盖所有能力)
### 1. 健康检查
- 请求:`GET /health`
- 断言:返回 `{"status":"ok"}`。
### 2. DOCX/PDF → Markdown/HTML/JSON
- 用例 A本地 PDF → Markdown不保存
- `POST /api/convert``file=@/path/to/file.pdf``export=markdown`
- 断言:`code=0``data.content` 包含 Markdown 文本、`data.media_type` 为 `text/markdown; charset=utf-8`。
- 用例 B远程 PDF → HTML保存
- `POST /api/convert``source_url=http(s)://...pdf``export=html``save=true``filename=example`
- 断言:返回 `minio_url` 与 `minio_presigned_url` 可访问;中文路径正确编码。
### 3. Markdown → DOCX/PDF
- 用例 C文本 → PDF高级参数`save=false`
- 字段:`markdown_text`、`target=pdf`、`toc=true`、`header_text`、`footer_text`、封面/Logo 文件与封面文字
- 断言:返回 PDF 二进制可打开;封面与 Logo 可见。日志中的 `word-break: break-word` 警告不影响生成。
- 用例 D文件 → DOCX`save=true`
- 字段:`md_file`、`target=docx`、`save=true`
- 断言:`minio_presigned_url` 可下载;中文文件名编码正确。
- 用例 EURL → PDF
- 字段:`markdown_url=http(s)://...md`、`target=pdf`
- 断言:生成成功;封面与 Logo 正常加载(若私有桶则走签名链接)。
### 4. 批量处理
- 用例 F本地文件夹批量重写并上传
- `POST /md/convert-folder``folder_path='/Users/fanyang/Desktop/Others/CMS/达梦数据-各类示范文档/数+产品手册-MD源文件/DMDRS_Build_Manual_DM8'`、`prefix='assets'`
- 断言:`count>0`;各文件 `asset_ok/asset_fail` 合理;`minio_presigned_url` 可下载。
- 用例 G上传压缩包批量处理
- `POST /api/upload-archive``file=@/path/to/archive.zip`、`prefix='assets'`
- 断言:`data.count` 正确;各文件链接可用。
### 5. MinIO 配置与策略
- 用例 H设置配置
- `POST /config/minio`(真实参数)
- 断言:返回 `ok:true`。
- 用例 I连通测试并应用策略
- `POST /config/minio/test``public_read=true|false``create_if_missing=true`
- 断言:返回连通状态;私有桶下使用 `minio_presigned_url` 可访问。
### 6. 资源映射与代理(可选)
- 用例 J`GET/POST /config/linkmap` 设置静态映射;`POST /proxy/download` 验证代理下载功能。
## 兼容性与注意事项
- 路径编码:所有返回的对象路径已进行编码,适配中文、空格、括号等字符。
- 私有桶:直链可能 403前端测试请使用 `minio_presigned_url`。
- 样式警告WeasyPrint 不支持 `word-break: break-word`,建议 `overflow-wrap: break-word` 或 `word-break: break-all`。
- 安全解压ZIP/TAR 采用路径穿越防护,解压目标限定在工作目录内;同时自动修复常见文件名乱码编码
- HTML 资产重写:批量处理对 HTML 的资源链接进行两阶段重写并上传至 MinIO内嵌 Base64 图片自动上传并替换为可访问链接
- 控制台端口限制:`/config/minio` 与 `/config/minio/test` 会拒绝 `:9001` 或带 `/browser`、`/minio` 的控制台地址;请使用 `9000` API 端口
## 本地运行(后端与前端)
- 后端FastAPI端口 `8000`
- `cd /Users/fanyang/Desktop/FunMD_Convert/docling`
- `PYTHONPATH=. python -m uvicorn app.server:app --host 127.0.0.1 --port 8000`
- 前端Vite
- `cd /Users/fanyang/Desktop/FunMD_Convert/frontend`
- `npm install`
- `VITE_API_BASE_URL=http://127.0.0.1:8000 npm run dev`
- 访问:
- 后端 UI 首页:`http://127.0.0.1:8000/ui/`
- 前端开发页面Vite 控制台输出的本地地址(通常为 `http://127.0.0.1:5173/`
### 前端配置 API Base避免代理空响应
- 打开前端页面右上角“数据库配置”弹窗,在“接口地址”一栏填写:`http://127.0.0.1:8000`,点击“保存配置”。
- 保存后,前端会直连后端 `8000` 端口,不再通过 Vite 代理,避免长耗时请求在 `5173` 上出现 `ERR_EMPTY_RESPONSE`。

View File

@@ -0,0 +1,141 @@
toolName: search_by_regex
status: success
query: StaticFiles
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: view_folder
status: success
/Users/fanyang/Desktop/FunMD_Convert
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/Dockerfile
toolName: view_folder
status: success
/Users/fanyang/Desktop/FunMD_Convert
toolName: edit_file_update
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/Dockerfile
changes:
toolName: ApplyPatch
status: failed
toolName: write_to_file
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/package_offline.sh
content: undefined
toolName: todo_write
status: success
Todos updated: 2 items
下面是按你给出的参考流程,结合当前项目的启动方式(后端 FastAPI 同时托管前端 `/ui`),进行“打包 → 传输 → 加载 → 启动”的完整方案。
**打包镜像**
- 在项目根目录执行:
- 构建镜像:`docker build -t funmd-convert:latest .`
- 保存镜像:`docker save funmd-convert:latest -o FunMD_Convert.tar`
- 查看大小:`ls -lh FunMD_Convert.tar`
- 计算校验:`shasum -a 256 FunMD_Convert.tar | awk '{print $1}'`
- 也可以直接用脚本:
- `chmod +x package_offline.sh && ./package_offline.sh`
- 该脚本会自动构建、生成 `FunMD_Convert.tar`、打印大小与 `sha256`,并给出服务器端加载和运行提示
**离线服务器加载**
- 复制镜像包到服务器:`scp FunMD_Convert.tar user@server:/opt/FunMD_Convert/`
- 加载镜像:`docker load -i /opt/FunMD_Convert/FunMD_Convert.tar`
- 验证镜像:`docker images | grep funmd-convert`
**启动容器**
- 启动后端(同时托管前端 `/ui`
`docker run -d -p 8000:8000 --name FunMD_Convert --restart unless-stopped funmd-convert:latest`
- 访问地址:
- 后端健康检查:`http://<服务器IP>:8000/health`
- 前端页面:`http://<服务器IP>:8000/ui/`
- API 基地址:`http://<服务器IP>:8000/api/…`
**MinIO 配置(首次部署或变更时)**
- 通过后端接口配置 MinIO确保 `public` 指向内网可达的 9000 端口):
- `curl -X POST -F endpoint=10.9.35.31:9000 -F public=http://10.9.35.31:9000 -F access=<AK> -F secret=<SK> -F bucket=file-cms -F secure=false -F public_read=true http://<服务器IP>:8000/config/minio`
- 验证连接:`curl -X POST -F endpoint=10.9.35.31:9000 -F bucket=file-cms -F access=<AK> -F secret=<SK> -F secure=false http://<服务器IP>:8000/config/minio/test`
**说明与对齐**
- 镜像内已包含:
- 后端 FastAPI 服务(`uvicorn` 启动,端口 `8000`
- 前端已构建的静态资源,后端自动挂载为 `/ui``/assets`
- Docling 模型离线预下载(构建阶段执行),容器内可直接使用
- 前端请求基座:
- 当用后端托管前端时,前端自动回退到 `http://<当前主机>:8000`,无需额外配置
- 中文路径与 URL 显示:
- 后端已修复 Zip/Tar 解压中文“乱码”,同时返回 `minio_url`(安全编码)和 `minio_url_display`(中文可读),前端展示文本用后者,资源实际加载用前者
**推荐的打包校验清单**
- 构建完成后:
- `ls -lh offline-funmd-convert.tar` 确认大小合理(包含模型与前端)
- `shasum -a 256 offline-funmd-convert.tar | awk '{print $1}'` 记录校验值
- 服务器端:
- `docker load -i ...``docker images | grep offline-funmd-convert` 显示镜像
- `docker ps` 容器运行后访问 `http://<服务器IP>:8000/health``{"status":"ok"}`
如果你希望镜像命名和示例完全对齐项目名(如 `FunMD_Convert`),只需在构建与运行命令中替换为:
- 构建:`docker build -t funmd-convert:latest .`
- 保存:`docker save funmd-convert:latest -o FunMD_Convert.tar`
- 加载:`docker load -i /opt/FunMD_Convert/FunMD_Convert.tar`
- 启动:`docker run -d -p 8000:8000 --name FunMD_Convert --restart unless-stopped funmd-convert:latest`

154
docling/README.zh-CN.md Normal file
View File

@@ -0,0 +1,154 @@
# 本地安装与启动指南Docling + FastAPI 服务)
本文档介绍如何在本机安装与启动本仓库的转换服务,以供前端调用生成并下载 PDF。
## 环境要求
- 操作系统macOS已验证Linux/Windows 亦可
- Python3.93.13
- 建议安装工具:`python -m venv` 或 [uv](https://docs.astral.sh/uv/)
## 创建虚拟环境
- 使用 venv
```bash
cd /Users/fanyang/Desktop/docling
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
```
- 或使用 uv
```bash
cd /Users/fanyang/Desktop/docling
uv venv
source .venv/bin/activate
```
## 安装依赖
- 安装本地 Docling 库(可编辑模式):
```bash
python -m pip install -e ./docling
```
- 安装后端服务依赖:
```bash
python -m pip install fastapi uvicorn minio weasyprint pytest
```
- 若 WeasyPrint 在 macOS 上提示缺少系统库,可使用 Homebrew 安装:
```bash
brew install cairo pango gdk-pixbuf libffi
```
## 启动服务
- 在项目根目录执行:
```bash
PYTHONPATH=. python -m uvicorn app.server:app --host 127.0.0.1 --port 8000
```
- 访问:
- 首页 UI`http://127.0.0.1:8000/`
- 健康检查:`http://127.0.0.1:8000/health`(返回 `{"status":"ok"}`
### 接口总览
- `GET /` 本地 UI静态文件
- `GET /health` 服务健康检查
- `POST /md/convert` Markdown/HTML → `docx|pdf`(核心接口,返回 MinIO 下载链接)
- `POST /md/convert-folder` 批量转换本地文件夹内的 `.md` 文件并上传结果到 MinIO
- `POST /md/upload-folder` 批量上传前端打包的文件夹内容并转换其中 `.md` 文件
- MinIO 配置相关:
- `POST /config/minio` 设置连接信息与前缀
- `POST /config/minio/test` 验证连接
- `GET /config/minio/buckets` 列出桶
- `POST /config/minio/create-bucket` 创建桶
## MinIO 配置
- 环境变量方式(推荐):
```bash
export MINIO_ENDPOINT=127.0.0.1:9000
export MINIO_ACCESS_KEY=minioadmin
export MINIO_SECRET_KEY=minioadmin
export MINIO_BUCKET=docling-target
export MINIO_SECURE=false
export MINIO_PUBLIC_ENDPOINT=http://127.0.0.1:9000
export MINIO_PREFIX=cms-files
```
- 运行时接口方式:
- `POST /config/minio` 设置连接信息与前缀
- `POST /config/minio/test` 测试连通性
- `GET /config/minio/buckets` 列出桶
- `POST /config/minio/create-bucket` 创建桶
## 前端下载 PDF接口说明
- 核心接口:`POST /md/convert`
- 作用:将 Markdown/HTML 转换为 PDF 并上传至 MinIO返回可下载链接
- 参数FormData以下三选一提供文档来源
- `md_file`:上传 Markdown 文件
- `markdown_text`:直接传入 Markdown 文本
- `markdown_url`:文档 URL推荐
- 目标格式:`target=pdf`
- 可选参数:`toc`、`header_text`、`footer_text`、`logo_url|logo_file`、`cover_url|cover_file`、`product_name`、`document_name`、`product_version`、`document_version`、`css_name|css_text`
- 返回 JSON 字段:`minio_presigned_url`(时效下载链接)或 `minio_url`(公开链接)、`name`、`media_type`
### 前端调用示例TypeScript
```ts
async function downloadPdf(markdownUrl: string) {
const fd = new FormData();
fd.append('markdown_url', markdownUrl);
fd.append('target', 'pdf');
fd.append('toc', 'true');
// 可选品牌参数:
// fd.append('header_text', '产品名|文档标题');
// fd.append('footer_text', '© 公司');
const resp = await fetch('http://127.0.0.1:8000/md/convert', { method: 'POST', body: fd });
if (!resp.ok) throw new Error('转换失败');
const data = await resp.json();
const url = data.minio_presigned_url || data.minio_url;
if (!url) throw new Error('未返回可下载链接,请检查 MinIO 配置');
window.location.href = url; // 触发下载
}
```
### cURL 示例URL → PDF
```bash
curl -s -X POST \
-F 'markdown_url=http://127.0.0.1:9000/docs/assets/rewritten/DMDRS_Build_Manual_Oracle/DMDRS搭建手册-Oracle.md' \
-F 'target=pdf' \
-F 'toc=true' \
-F 'header_text=产品名|文档标题' \
-F 'footer_text=© 2025 公司' \
http://127.0.0.1:8000/md/convert
```
返回示例:
```json
{
"minio_url": "http://127.0.0.1:9000/docling-target/cms-files/converted/DMDRS搭建手册-Oracle.pdf",
"minio_presigned_url": "http://127.0.0.1:9000/...presigned...",
"name": "DMDRS搭建手册-Oracle.pdf",
"media_type": "application/pdf"
}
```
### 批量转换(文件夹)
- 将本地文件夹内的 `.md` 全量转换并上传结果:
```bash
curl -s -X POST -F 'folder_path=/Users/you/docs' http://127.0.0.1:8000/md/convert-folder
```
### 直接转 DOCX按需
```bash
curl -s -X POST \
-F 'markdown_url=http://127.0.0.1:9000/docs/assets/rewritten/DMDRS_Build_Manual_Oracle/DMDRS搭建手册-Oracle.md' \
-F 'target=docx' \
http://127.0.0.1:8000/md/convert
```
## 常见问题
- `ModuleNotFoundError: No module named 'app' / 'docling'`
- 请在启动命令前设置 `PYTHONPATH=.` 或在当前 shell 直接以 `PYTHONPATH=. python -m uvicorn ...` 方式启动。
- 未返回下载 URL
- 请检查 MinIO 环境变量或使用 `/config/minio` 进行配置;确保桶存在且服务端启用了 `store_final=true`。
- 图片或样式异常
- 确认资源已被重写为公共 URL服务会自动上传并改写并检查 `css_name`/`css_text`PDF 默认样式为 `default`,位于 `app/configs/styles/default.css`)。
- WeasyPrint 依赖缺失macOS
- 执行 `brew install cairo pango gdk-pixbuf libffi` 后重试;如仍失败,请检查 `PATH`/`DYLD_LIBRARY_PATH`。
## 相关文档
- 服务端接口中文说明:`docling/README.zh-CN.md`

1
docling/app/__init__.py Normal file
View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1 @@
{}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9001",
"public": "127.0.0.1:9001",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "true",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "true",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "8.163.40.177:9000",
"public": "http://8.163.40.177:9000",
"access": "minioadmin",
"secret": "minioadmin",
"bucket": "cms-files",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

View File

@@ -0,0 +1,88 @@
@page {
size: A4;
margin: 20mm 15mm 20mm 15mm;
@top-left { content: none; }
@top-center { content: element(header); }
@bottom-left { content: element(copyright); }
@bottom-center { content: element(footer); }
@bottom-right { content: counter(page); font-size: 10pt; color: #444; }
}
html { font-family: "Noto Sans CJK SC", "Noto Sans", "Source Han Sans SC", "DejaVu Sans", sans-serif; font-size: 12pt; line-height: 1.6; }
body { color: #111; }
h1 { font-size: 20pt; margin: 0 0 8pt; page-break-before: always; }
h2 { font-size: 16pt; margin: 16pt 0 8pt; }
h3 { font-size: 14pt; margin: 12pt 0 6pt; }
h1, h2, h3 { page-break-after: avoid; break-after: avoid-page; }
p { margin: 0 0 8pt; }
pre, code { font-family: "DejaVu Sans Mono", "Noto Sans Mono", monospace; font-size: 10pt; }
table { width: 100%; border-collapse: collapse; margin: 8pt 0; table-layout: fixed; }
th, td { border: 1px solid #ddd; padding: 6pt 8pt; }
thead { display: table-header-group; }
tfoot { display: table-footer-group; }
table, thead, tbody, tr, th, td { page-break-inside: avoid; break-inside: avoid-page; }
th, td { white-space: normal; overflow-wrap: anywhere; word-break: break-word; hyphens: auto; }
.table-block { page-break-inside: avoid; break-inside: avoid-page; }
pre { background: #f6f8fa; border: 1px solid #e5e7eb; border-radius: 6pt; padding: 8pt 10pt; white-space: pre-wrap; overflow-wrap: anywhere; word-break: break-word; }
code { background: #f6f8fa; border-radius: 4pt; padding: 0 3pt; }
a { color: #0366d6; text-decoration: underline; }
a:hover { text-decoration: underline; }
.break-before { page-break-before: always; }
.break-after { page-break-after: always; }
.doc-meta { height: 0; overflow: hidden; }
.doc-header-text { position: running(header); }
.doc-footer-text { position: running(footer); }
.doc-copyright { position: running(copyright); }
img#brand-logo { display: none; }
.toc { page-break-after: always; }
.toc h1 { font-size: 18pt; margin: 0 0 8pt; }
.toc ul { list-style: none; padding: 0; }
.toc li { margin: 4pt 0; display: grid; grid-template-columns: auto 1fr 30pt; column-gap: 8pt; align-items: baseline; }
.toc li.toc-h1 .toc-text { font-weight: 600; }
.toc li.toc-h2 .toc-text { margin-left: 8pt; }
.toc li.toc-h3 .toc-text { margin-left: 16pt; }
.toc .toc-dots { border-bottom: 1px dotted currentColor; height: 0.9em; transform: translateY(-0.1em); }
.toc .toc-page { text-align: right; }
.toc .toc-page::before { content: target-counter(attr(data-target), page); }
@page { @bottom-right { content: counter(page); font-size: 10pt; color: #444; } }
.doc-header-text { position: running(header); display: flex; justify-content: space-between; align-items: center; font-size: 11pt; color: #444; border-bottom: 1px solid #e5e7eb; padding-bottom: 6pt; min-height: 26pt; }
.doc-header-left { font-weight: 500; }
.doc-header-right { font-size: 10pt; color: #666; }
.doc-header-text img.logo-inline { height: 26pt; margin-right: 8pt; }
.doc-header-text img.logo-inline { height: 26pt; margin-right: 8pt; }
.doc-footer-text { position: running(footer); display: block; text-align: center; font-size: 10pt; color: #444; border-top: 1px solid #e5e7eb; padding-top: 6pt; }
.toc a { color: #0366d6; text-decoration: underline; }
.toc li { grid-template-columns: auto 1fr 48pt; }
.toc li.toc-h2 .toc-text { margin-left: 12pt; }
.toc li.toc-h3 .toc-text { margin-left: 24pt; }
table { max-width: 100%; box-sizing: border-box; }
tr, th, td { page-break-inside: avoid; break-inside: avoid-page; }
img, svg, canvas {
display: block;
max-width: 100%;
height: auto;
box-sizing: border-box;
page-break-inside: avoid;
break-inside: avoid-page;
}
p > img { margin: 6pt auto; }
td img, th img { max-width: 100%; height: auto; }
@page cover { size: A4; margin: 0; }
.cover { page: cover; position: relative; width: 210mm; height: 297mm; overflow: hidden; page-break-after: always; }
.cover .cover-bg { position: absolute; left: 0; top: 0; width: 100%; height: 100%; object-fit: cover; }
.cover .cover-brand { position: absolute; top: 20mm; left: 20mm; font-size: 18pt; font-weight: 700; color: #1d4ed8; }
.cover .cover-footer { position: absolute; left: 0; right: 0; bottom: 0; background: #1d4ed8; color: #fff; padding: 12mm 20mm; }
.cover .cover-title { font-size: 24pt; font-weight: 700; margin: 0; }
.cover .cover-subtitle { font-size: 13pt; margin-top: 4pt; }
.cover .cover-meta { margin-top: 8pt; font-size: 11pt; display: flex; gap: 20mm; }

View File

@@ -0,0 +1,17 @@
{
"minio": {
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "minioadmin",
"secret": "minioadmin123",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true"
},
"db": {
"webhook_url": null,
"token": null
}
}

2993
docling/app/server.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,709 @@
from pathlib import Path
from typing import Optional, Tuple, Dict, List, Any
from urllib.parse import urlparse, unquote
import os
import re
import io
from bs4 import BeautifulSoup
from bs4.element import PageElement
import marko
import sys
try:
_DOC_BASE = Path(__file__).resolve().parents[2] / "docling"
p = str(_DOC_BASE)
if p not in sys.path:
sys.path.insert(0, p)
except Exception:
pass
try:
from docling.document_converter import DocumentConverter
except Exception:
class DocumentConverter: # type: ignore
def __init__(self, *args, **kwargs):
pass
def convert(self, source):
raise RuntimeError("docling not available")
from docx import Document
from docx.shared import Mm, Pt
from docx.enum.section import WD_SECTION
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from urllib.request import urlopen
import json
try:
from weasyprint import HTML, CSS # type: ignore
except Exception:
HTML = None
CSS = None
_mdit: Any = None
_tasklists_plugin: Any = None
_deflist_plugin: Any = None
_footnote_plugin: Any = None
_attrs_plugin: Any = None
_HAS_MD_IT: bool = False
try:
import markdown_it as _mdit # type: ignore
from mdit_py_plugins.tasklists import tasklists_plugin as _tasklists_plugin # type: ignore
from mdit_py_plugins.deflist import deflist_plugin as _deflist_plugin # type: ignore
from mdit_py_plugins.footnote import footnote_plugin as _footnote_plugin # type: ignore
from mdit_py_plugins.attrs import attrs_plugin as _attrs_plugin # type: ignore
_HAS_MD_IT = True
except Exception:
pass
converter = DocumentConverter()
LINKMAP_PATH = Path(__file__).resolve().parent.parent / "configs" / "linkmap" / "linkmap.json"
_LINKMAP: Dict[str, str] = {}
def load_linkmap() -> Dict[str, str]:
global _LINKMAP
try:
if LINKMAP_PATH.exists():
_LINKMAP = json.loads(LINKMAP_PATH.read_text("utf-8")) or {}
except Exception:
_LINKMAP = {}
return _LINKMAP
def save_linkmap(mapping: Dict[str, str]) -> None:
LINKMAP_PATH.parent.mkdir(parents=True, exist_ok=True)
LINKMAP_PATH.write_text(json.dumps(mapping, ensure_ascii=False, indent=2), "utf-8")
load_linkmap()
def resolve_link(href: Optional[str], data_doc: Optional[str]) -> Optional[str]:
if href:
return href
if not _LINKMAP:
load_linkmap()
if data_doc and data_doc in _LINKMAP:
return _LINKMAP[data_doc]
return None
def export_payload(doc, fmt: str) -> Tuple[str, str]:
f = fmt.lower()
if f == "markdown":
return doc.export_to_markdown(), "text/markdown"
if f == "html":
return doc.export_to_html(), "text/html"
if f == "json":
return doc.export_to_json(), "application/json"
if f == "doctags":
return doc.export_to_doctags(), "application/json"
raise ValueError("unsupported export")
def infer_basename(source_url: Optional[str], upload_name: Optional[str]) -> str:
if source_url:
path = urlparse(source_url).path
name = os.path.basename(path) or "document"
name = unquote(name)
return os.path.splitext(name)[0] or "document"
if upload_name:
name = os.path.splitext(os.path.basename(upload_name))[0] or "document"
return name
return "document"
def sanitize_filename(name: Optional[str]) -> str:
if not name:
return "document"
name = name.strip()[:128]
name = re.sub(r'[<>:"/\\|?*\x00-\x1F]', "_", name) or "document"
return name
def convert_source(source: str, export: str) -> Tuple[str, str]:
result = converter.convert(source)
return export_payload(result.document, export)
def md_to_docx_bytes(md: str, toc: bool = False, header_text: Optional[str] = None, footer_text: Optional[str] = None, logo_url: Optional[str] = None, copyright_text: Optional[str] = None, filename_text: Optional[str] = None, cover_src: Optional[str] = None, product_name: Optional[str] = None, document_name: Optional[str] = None, product_version: Optional[str] = None, document_version: Optional[str] = None) -> bytes:
try:
import logging as _log
_log.info(f"md_to_docx_bytes start toc={toc} header={bool(header_text)} footer={bool(footer_text)} logo={bool(logo_url)} cover={bool(cover_src)}")
except Exception:
pass
def _add_field(paragraph, instr: str):
r1 = paragraph.add_run()
b = OxmlElement('w:fldChar')
b.set(qn('w:fldCharType'), 'begin')
r1._r.append(b)
r2 = paragraph.add_run()
t = OxmlElement('w:instrText')
t.set(qn('xml:space'), 'preserve')
t.text = instr
r2._r.append(t)
r3 = paragraph.add_run()
e = OxmlElement('w:fldChar')
e.set(qn('w:fldCharType'), 'end')
r3._r.append(e)
def _available_width(section) -> int:
return section.page_width - section.left_margin - section.right_margin
def _fetch_bytes(u: str) -> Optional[bytes]:
try:
if u.lower().startswith('http://') or u.lower().startswith('https://'):
with urlopen(u, timeout=10) as r:
return r.read()
p = Path(u)
if p.exists() and p.is_file():
return p.read_bytes()
except Exception:
return None
return None
html = normalize_html(md, options={
"toc": "1" if toc else "",
"header_text": header_text,
"footer_text": footer_text,
"logo_url": logo_url,
"copyright_text": copyright_text,
"filename_text": filename_text,
"cover_src": cover_src,
"product_name": product_name,
"document_name": document_name,
"product_version": product_version,
"document_version": document_version,
})
try:
import logging as _log
_log.info(f"md_to_docx_bytes normalize_html length={len(html)}")
except Exception:
pass
soup = BeautifulSoup(html, "html.parser")
doc = Document()
sec0 = doc.sections[0]
sec0.page_width = Mm(210)
sec0.page_height = Mm(297)
sec0.left_margin = Mm(15)
sec0.right_margin = Mm(15)
sec0.top_margin = Mm(20)
sec0.bottom_margin = Mm(20)
has_cover = bool(cover_src or (soup.find('section', class_='cover') is not None))
if has_cover:
sec0.left_margin = Mm(0)
sec0.right_margin = Mm(0)
sec0.top_margin = Mm(0)
sec0.bottom_margin = Mm(0)
if cover_src:
b = _fetch_bytes(cover_src)
if b:
bio = io.BytesIO(b)
doc.add_picture(bio, width=_available_width(sec0))
if product_name:
p = doc.add_paragraph()
r = p.add_run(product_name)
r.font.size = Pt(18)
r.bold = True
t = document_name or None
if not t:
h1 = soup.body.find('h1') if soup.body else soup.find('h1')
t = h1.get_text(strip=True) if h1 else '文档'
p2 = doc.add_paragraph()
r2 = p2.add_run(t or '文档')
r2.font.size = Pt(24)
r2.bold = True
if filename_text:
p3 = doc.add_paragraph()
r3 = p3.add_run(filename_text)
r3.font.size = Pt(13)
meta_parts = []
if product_version:
meta_parts.append("产品版本:" + product_version)
if document_version:
meta_parts.append("文档版本:" + document_version)
if meta_parts:
pm = doc.add_paragraph(" ".join(meta_parts))
pm.font = None
doc.add_section(WD_SECTION.NEW_PAGE)
sec = doc.sections[-1]
sec.page_width = Mm(210)
sec.page_height = Mm(297)
sec.left_margin = Mm(15)
sec.right_margin = Mm(15)
sec.top_margin = Mm(20)
sec.bottom_margin = Mm(20)
else:
sec = sec0
if header_text or logo_url or filename_text:
hp = sec.header.add_paragraph()
left = header_text or ''
right = ''
if '||' in left:
parts = left.split('||', 1)
left, right = parts[0], parts[1]
elif '|' in left:
parts = left.split('|', 1)
left, right = parts[0], parts[1]
if left.strip():
hp.add_run(left.strip())
if right.strip():
rp = sec.header.add_paragraph()
rp.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
rp.add_run(right.strip())
elif filename_text:
rp = sec.header.add_paragraph()
rp.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
rp.add_run(filename_text)
if footer_text or copyright_text:
fp = sec.footer.add_paragraph()
if footer_text:
fp.add_run(footer_text)
if copyright_text:
cp = sec.footer.add_paragraph()
cp.add_run(copyright_text)
pn = sec.footer.add_paragraph()
pn.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
_add_field(pn, 'PAGE')
if toc:
doc.add_paragraph('目录')
_add_field(doc.add_paragraph(), 'TOC \\o "1-3" \\h \\z \\u')
doc.add_page_break()
def add_inline(p, node):
if isinstance(node, str):
p.add_run(node)
return
if node.name in ['strong', 'b']:
r = p.add_run(node.get_text())
r.bold = True
return
if node.name in ['em', 'i']:
r = p.add_run(node.get_text())
r.italic = True
return
if node.name == 'code':
r = p.add_run(node.get_text())
r.font.name = 'Courier New'
return
if node.name == 'a':
text = node.get_text()
href = node.get('href')
extra = node.get('data-doc')
resolved = resolve_link(href, extra)
if resolved:
p.add_run(text + ' [' + resolved + ']')
else:
p.add_run(text)
return
if node.name == 'img':
src = node.get('src') or ''
b = _fetch_bytes(src)
if b:
bio = io.BytesIO(b)
try:
doc.add_picture(bio, width=_available_width(sec))
except Exception:
pass
return
for c in getattr(node, 'children', []):
add_inline(p, c)
def process_block(el):
name = getattr(el, 'name', None)
if name is None:
return
cls = el.get('class') or []
if name == 'div' and 'doc-meta' in cls:
return
if name == 'section' and 'cover' in cls:
return
if name == 'nav' and 'toc' in cls:
return
if name == 'div':
for child in el.children:
process_block(child)
return
if name == 'h1':
doc.add_heading(el.get_text(), level=1)
return
if name == 'h2' or (name == 'strong' and 'subtitle' in cls):
doc.add_heading(el.get_text(), level=2)
return
if name == 'h3':
doc.add_heading(el.get_text(), level=3)
return
if name == 'p':
p = doc.add_paragraph()
for c in el.children:
add_inline(p, c)
return
if name in ['ul', 'ol']:
for li in el.find_all('li', recursive=False):
p = doc.add_paragraph(style='List Bullet')
for c in li.children:
add_inline(p, c)
return
if name == 'pre':
code = el.get_text() or ''
p = doc.add_paragraph()
run = p.add_run(code)
run.font.name = 'Courier New'
return
if name == 'blockquote':
p = doc.add_paragraph(el.get_text())
p.paragraph_format.left_indent = Mm(10)
return
if name == 'table':
rows = []
thead = el.find('thead')
tbody = el.find('tbody')
if thead:
hdrs = [th.get_text(strip=True) for th in thead.find_all('th')]
else:
hdrs = [cell.get_text(strip=True) for cell in el.find_all('tr')[0].find_all(['th','td'])] if el.find_all('tr') else []
trs = tbody.find_all('tr') if tbody else el.find_all('tr')[1:]
for tr in trs:
tds = [td.get_text(strip=True) for td in tr.find_all('td')]
rows.append(tds)
tbl = doc.add_table(rows=1 + len(rows), cols=len(hdrs) or 1)
hdr = tbl.rows[0].cells
for k, h in enumerate(hdrs or ['']):
hdr[k].text = h
for r_idx, row in enumerate(rows):
cells = tbl.rows[1 + r_idx].cells
for c_idx in range(len(hdrs) or 1):
cells[c_idx].text = (row[c_idx] if c_idx < len(row) else '')
return
if name == 'img':
src = el.get('src') or ''
b = _fetch_bytes(src)
if b:
bio = io.BytesIO(b)
try:
doc.add_picture(bio, width=_available_width(sec))
except Exception:
pass
return
body = soup.body or soup
for el in body.children:
process_block(el)
bio = io.BytesIO()
try:
import logging as _log
_log.info("md_to_docx_bytes saving doc")
except Exception:
pass
doc.save(bio)
try:
import logging as _log
_log.info(f"md_to_docx_bytes done size={bio.tell()}")
except Exception:
pass
return bio.getvalue()
def md_to_pdf_bytes(md: str) -> bytes:
return md_to_pdf_bytes_with_renderer(md, renderer="weasyprint")
def _md_with_tables_to_html(md_text: str) -> str:
lines = md_text.splitlines()
out = []
i = 0
while i < len(lines):
line = lines[i]
def is_sep(s: str) -> bool:
s = s.strip()
if "|" not in s:
return False
s = s.strip("|")
return all(set(seg.strip()) <= set("-: ") and len(seg.strip()) >= 1 for seg in s.split("|"))
if "|" in line and i + 1 < len(lines) and is_sep(lines[i + 1]):
headers = [c.strip() for c in line.strip().strip("|").split("|")]
j = i + 2
rows = []
while j < len(lines) and "|" in lines[j]:
rows.append([c.strip() for c in lines[j].strip().strip("|").split("|")])
j += 1
tbl = ["<table>", "<thead><tr>"]
for h in headers:
tbl.append(f"<th>{h}</th>")
tbl.append("</tr></thead><tbody>")
for row in rows:
tbl.append("<tr>")
for idx in range(len(headers)):
cell = row[idx] if idx < len(row) else ""
tbl.append(f"<td>{cell}</td>")
tbl.append("</tr>")
tbl.append("</tbody></table>")
out.append("".join(tbl))
i = j
continue
out.append(line)
i += 1
return marko.convert("\n".join(out))
def _render_markdown_html(md_text: str) -> str:
if _HAS_MD_IT and _mdit is not None:
try:
md = _mdit.MarkdownIt("commonmark").enable(["table", "strikethrough"])
if _tasklists_plugin:
md.use(_tasklists_plugin)
if _deflist_plugin:
md.use(_deflist_plugin)
if _footnote_plugin:
md.use(_footnote_plugin)
if _attrs_plugin:
md.use(_attrs_plugin)
return md.render(md_text)
except Exception:
pass
return _md_with_tables_to_html(md_text)
def normalize_html(md_or_html: str, options: Optional[Dict[str, Optional[str]]] = None) -> str:
html = _render_markdown_html(md_or_html)
soup = BeautifulSoup(html, "html.parser")
for s in soup.find_all("strong", class_="subtitle"):
s.name = "h2"
s.attrs = {"data-origin": "subtitle"}
for a in soup.find_all("a"):
href_val = a.get("href")
extra_val = a.get("data-doc")
href = href_val if isinstance(href_val, str) else None
extra = extra_val if isinstance(extra_val, str) else None
resolved = resolve_link(href, extra)
if resolved:
a["href"] = resolved
elif not href and extra:
a.replace_with(a.get_text() + " [" + extra + "]")
opts = options or {}
header_text = opts.get("header_text") or None
footer_text = opts.get("footer_text") or None
logo_url = opts.get("logo_url") or None
copyright_text = opts.get("copyright_text") or None
cover_src = opts.get("cover_src") or None
product_name_opt = opts.get("product_name") or None
document_name_opt = opts.get("document_name") or None
product_version_opt = opts.get("product_version") or None
document_version_opt = opts.get("document_version") or None
toc_flag = bool(opts.get("toc"))
meta = soup.new_tag("div", attrs={"class": "doc-meta"})
if header_text:
ht = soup.new_tag("div", attrs={"class": "doc-header-text"})
text = header_text
left = text
right = ""
if "||" in text:
parts = text.split("||", 1)
left, right = parts[0], parts[1]
elif "|" in text:
parts = text.split("|", 1)
left, right = parts[0], parts[1]
if logo_url:
img = soup.new_tag("img", attrs={"class": "logo-inline", "src": logo_url})
ht.append(img)
hl = soup.new_tag("span", attrs={"class": "doc-header-left"})
hl.string = left
ht.append(hl)
if right.strip():
hr = soup.new_tag("span", attrs={"class": "doc-header-right"})
hr.string = right
ht.append(hr)
meta.append(ht)
else:
first_h1 = None
if soup.body:
first_h1 = soup.body.find("h1")
else:
first_h1 = soup.find("h1")
left = (first_h1.get_text(strip=True) if first_h1 else "文档")
right = opts.get("filename_text") or ""
ht = soup.new_tag("div", attrs={"class": "doc-header-text"})
if logo_url:
img = soup.new_tag("img", attrs={"class": "logo-inline", "src": logo_url})
ht.append(img)
hl = soup.new_tag("span", attrs={"class": "doc-header-left"})
hl.string = left
ht.append(hl)
if right:
hr = soup.new_tag("span", attrs={"class": "doc-header-right"})
hr.string = right
ht.append(hr)
meta.append(ht)
if footer_text:
ft = soup.new_tag("div", attrs={"class": "doc-footer-text"})
ft.string = footer_text
meta.append(ft)
page_header_val = (header_text or (document_name_opt or None))
if not page_header_val:
first_h1_for_header = None
if soup.body:
first_h1_for_header = soup.body.find("h1")
else:
first_h1_for_header = soup.find("h1")
page_header_val = (first_h1_for_header.get_text(strip=True) if first_h1_for_header else "文档")
page_footer_val = (footer_text or "FunMD")
ph = soup.new_tag("div", attrs={"class": "doc-page-header"})
if logo_url:
logo_inline = soup.new_tag("img", attrs={"src": logo_url, "class": "doc-page-header-logo"})
ph.append(logo_inline)
ht_inline = soup.new_tag("span", attrs={"class": "doc-page-header-text"})
ht_inline.string = page_header_val
ph.append(ht_inline)
meta.append(ph)
pf = soup.new_tag("div", attrs={"class": "doc-page-footer"})
pf.string = page_footer_val
meta.append(pf)
if copyright_text:
cp = soup.new_tag("div", attrs={"class": "doc-copyright"})
cp.string = copyright_text
meta.append(cp)
# brand logo is rendered inline within header; no separate top-left element
if soup.body:
soup.body.insert(0, meta)
else:
soup.insert(0, meta)
if not soup.head:
head = soup.new_tag("head")
soup.insert(0, head)
else:
head = soup.head
style_run = soup.new_tag("style")
style_run.string = "@page{margin:20mm}@page{\n @top-center{content: element(page-header)}\n @bottom-center{content: element(page-footer)}\n}\n.doc-page-header{position: running(page-header); font-size:10pt; color:#666; display:block; text-align:center; width:100%}\n.doc-page-header::after{content:''; display:block; width:80%; border-bottom:1px solid #d9d9d9; margin:4px auto 0}\n.doc-page-header-logo{height:20px; vertical-align:middle; margin-right:4px}\n.doc-page-header-text{vertical-align:middle}\n.doc-page-footer{position: running(page-footer); font-size:10pt; color:#666}\n.doc-page-footer::before{content:''; display:block; width:80%; border-top:1px solid #d9d9d9; margin:0 auto 4px}"
head.append(style_run)
# Fallback inline styles for cover to ensure visibility even if external CSS isn't loaded
if (cover_src or product_name_opt or document_name_opt or product_version_opt or document_version_opt):
if not soup.head:
head = soup.new_tag("head")
soup.insert(0, head)
else:
head = soup.head
style = soup.new_tag("style")
style.string = "@page:first{margin:0} html,body{margin:0;padding:0}.cover{position:relative;width:210mm;height:297mm;overflow:hidden;page-break-after:always}.cover .cover-bg{position:absolute;left:0;top:0;right:0;bottom:0;width:100%;height:100%;object-fit:cover;display:block}.cover .cover-brand{position:absolute;top:20mm;left:20mm;font-size:18pt;font-weight:700;color:#1d4ed8}.cover .cover-footer{position:absolute;left:0;right:0;bottom:0;background:#1d4ed8;color:#fff;padding:12mm 20mm}.cover .cover-title{font-size:24pt;font-weight:700;margin:0}.cover .cover-subtitle{font-size:13pt;margin-top:4pt}.cover .cover-meta{margin-top:8pt;font-size:11pt;display:flex;gap:20mm}"
head.append(style)
if cover_src or product_name_opt or document_name_opt or product_version_opt or document_version_opt:
cov = soup.new_tag("section", attrs={"class": "cover"})
if cover_src:
bg = soup.new_tag("img", attrs={"class": "cover-bg", "src": cover_src})
cov.append(bg)
if product_name_opt:
brand_el = soup.new_tag("div", attrs={"class": "cover-brand"})
brand_el.string = product_name_opt
cov.append(brand_el)
footer = soup.new_tag("div", attrs={"class": "cover-footer"})
title_text = document_name_opt or None
if not title_text:
first_h1 = soup.body.find("h1") if soup.body else soup.find("h1")
if first_h1:
title_text = first_h1.get_text(strip=True)
title_el = soup.new_tag("div", attrs={"class": "cover-title"})
title_el.string = title_text or "文档"
footer.append(title_el)
subtitle_val = opts.get("filename_text") or ""
if subtitle_val:
subtitle_el = soup.new_tag("div", attrs={"class": "cover-subtitle"})
subtitle_el.string = subtitle_val
footer.append(subtitle_el)
meta_el = soup.new_tag("div", attrs={"class": "cover-meta"})
if product_version_opt:
pv = soup.new_tag("span")
pv.string = f"产品版本:{product_version_opt}"
meta_el.append(pv)
if document_version_opt:
dv = soup.new_tag("span")
dv.string = f"文档版本:{document_version_opt}"
meta_el.append(dv)
footer.append(meta_el)
cov.append(footer)
if soup.body:
soup.body.insert(1, cov)
else:
soup.insert(1, cov)
if toc_flag:
headings = [
el for el in (soup.find_all(["h1", "h2", "h3"]) or [])
if el.get("data-origin") != "subtitle"
]
if headings:
ul = soup.new_tag("ul")
idx = 1
for el in headings:
text = el.get_text(strip=True)
if not text:
continue
hid = el.get("id")
if not hid:
hid = f"sec-{idx}"
el["id"] = hid
idx += 1
li = soup.new_tag("li", attrs={"class": f"toc-{el.name}"})
a = soup.new_tag("a", attrs={"href": f"#{hid}", "class": "toc-text"})
a.string = text
dots = soup.new_tag("span", attrs={"class": "toc-dots"})
page = soup.new_tag("span", attrs={"class": "toc-page", "data-target": f"#{hid}"})
li.append(a)
li.append(dots)
li.append(page)
ul.append(li)
nav = soup.new_tag("nav", attrs={"class": "toc"})
h = soup.new_tag("h1")
h.string = "目录"
nav.append(h)
nav.append(ul)
if soup.body:
soup.body.insert(2, nav)
else:
soup.insert(2, nav)
if soup.body:
for h in soup.body.find_all(["h1", "h2", "h3"]):
sib: Optional[PageElement] = h.find_next_sibling()
blocks: List[Any] = []
first_table: Optional[Any] = None
while sib is not None:
# Skip pure whitespace nodes
if getattr(sib, "name", None) is None:
try:
if str(sib).strip() == "":
sib = sib.next_sibling
continue
except Exception:
break
# Stop if next heading encountered
name = getattr(sib, "name", None)
if name in ["h1", "h2", "h3"]:
break
# Collect explanatory blocks until first table
if name == "table":
first_table = sib
break
if name in ["p", "blockquote", "ul", "ol"]:
blocks.append(sib)
sib = sib.next_sibling
continue
# Unknown block: stop grouping to avoid wrapping unrelated content
break
if first_table is not None:
wrap = soup.new_tag("div", attrs={"class": "table-block"})
h.insert_before(wrap)
wrap.append(h.extract())
for el in blocks:
wrap.append(el.extract())
wrap.append(first_table.extract())
return str(soup)
def _stylesheets_for(css_name: Optional[str], css_text: Optional[str]):
sheets: List[Any] = []
if CSS is None:
return sheets
if css_text:
sheets.append(CSS(string=css_text))
if css_name:
css_path = Path(__file__).resolve().parent.parent / "configs" / "styles" / f"{css_name}.css"
if css_path.exists():
sheets.append(CSS(filename=str(css_path)))
return sheets
def md_to_pdf_bytes_with_renderer(md: str, renderer: str = "weasyprint", css_name: Optional[str] = None, css_text: Optional[str] = None, toc: bool = False, header_text: Optional[str] = None, footer_text: Optional[str] = None, logo_url: Optional[str] = None, copyright_text: Optional[str] = None, filename_text: Optional[str] = None, cover_src: Optional[str] = None, product_name: Optional[str] = None, document_name: Optional[str] = None, product_version: Optional[str] = None, document_version: Optional[str] = None) -> bytes:
html = normalize_html(md, options={
"toc": "1" if toc else "",
"header_text": header_text,
"footer_text": footer_text,
"logo_url": logo_url,
"copyright_text": copyright_text,
"filename_text": filename_text,
"cover_src": cover_src,
"product_name": product_name,
"document_name": document_name,
"product_version": product_version,
"document_version": document_version,
})
if HTML is not None:
stylesheets = _stylesheets_for(css_name, css_text)
pdf_bytes = HTML(string=html).write_pdf(stylesheets=stylesheets or None)
return pdf_bytes
raise RuntimeError("WeasyPrint is not available")

View File

@@ -0,0 +1,190 @@
from typing import Optional, Tuple, Dict
import os
import logging
from urllib.request import urlopen
try:
from minio import Minio # type: ignore
import urllib3 # type: ignore
except Exception:
Minio = None
urllib3 = None # type: ignore
def minio_head_bucket(client: object, bucket: str) -> bool:
try:
if hasattr(client, "bucket_exists"):
try:
return bool(client.bucket_exists(bucket)) # type: ignore
except Exception:
pass
try:
region = client._get_region(bucket) # type: ignore
except Exception:
region = "us-east-1"
client._url_open(method="HEAD", region=region, bucket_name=bucket) # type: ignore
return True
except Exception:
try:
names = [getattr(b, "name", None) for b in client.list_buckets()] # type: ignore
return bucket in set(n for n in names if n)
except Exception:
return False
def minio_create_bucket(client: object, bucket: str) -> bool:
try:
if hasattr(client, "bucket_exists"):
try:
if client.bucket_exists(bucket): # type: ignore
return True
except Exception:
pass
if hasattr(client, "make_bucket"):
try:
client.make_bucket(bucket) # type: ignore
return True
except Exception:
try:
region = client._get_region(bucket) # type: ignore
except Exception:
region = "us-east-1"
try:
client.make_bucket(bucket, location=region) # type: ignore
return True
except Exception:
pass
try:
try:
region = client._get_region(bucket) # type: ignore
except Exception:
region = "us-east-1"
client._url_open(method="PUT", region=region, bucket_name=bucket) # type: ignore
return True
except Exception as ce:
if "BucketAlreadyOwnedByYou" in str(ce) or "BucketAlreadyExists" in str(ce):
return True
raise
except Exception as e:
raise e
def minio_client(endpoint: str, access: str, secret: str, secure: bool):
if urllib3 is not None:
try:
http = urllib3.PoolManager(timeout=urllib3.Timeout(connect=3.0, read=20.0))
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure, http_client=http) # type: ignore
except Exception:
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure) # type: ignore
return Minio(endpoint=endpoint, access_key=access, secret_key=secret, secure=secure) # type: ignore
def minio_time_hint(endpoint: str, secure: bool) -> Optional[str]:
try:
scheme = "https" if secure else "http"
r = urlopen(f"{scheme}://{endpoint}", timeout=3)
srv_date = r.headers.get("Date")
if not srv_date:
return None
from email.utils import parsedate_to_datetime
from datetime import datetime, timezone
dt = parsedate_to_datetime(srv_date)
now = datetime.now(timezone.utc)
diff = abs((now - dt).total_seconds())
return f"服务器时间与本机相差约 {int(diff)}"
except Exception:
return None
def join_prefix(prefix: str, rel: str) -> str:
pre = (prefix or "").strip("/")
r = rel.lstrip("/")
if pre and r.startswith(pre + "/"):
return r
return f"{pre}/{r}" if pre else r
def presigned_read(client: object, bucket: str, obj: str, expires_seconds: int) -> Optional[str]:
try:
from datetime import timedelta
exp = expires_seconds
try:
exp = int(exp)
except Exception:
pass
td = timedelta(seconds=exp)
try:
return client.get_presigned_url("GET", bucket, obj, expires=td) # type: ignore
except Exception:
return client.presigned_get_object(bucket, obj, expires=td) # type: ignore
except Exception:
return None
def minio_current(runtime_cfg: Dict[str, Dict[str, Optional[str]]]) -> Tuple[Optional[object], Optional[str], Optional[str], str]:
rc = runtime_cfg.get("minio", {})
endpoint_raw = rc.get("endpoint") or os.environ.get("MINIO_ENDPOINT")
access_raw = rc.get("access") or os.environ.get("MINIO_ACCESS_KEY")
secret_raw = rc.get("secret") or os.environ.get("MINIO_SECRET_KEY")
bucket_raw = rc.get("bucket") or os.environ.get("MINIO_BUCKET")
secure_flag = rc.get("secure") or os.environ.get("MINIO_SECURE", "false")
secure = str(secure_flag or "false").lower() in {"1","true","yes","on"}
public_raw = rc.get("public") or os.environ.get("MINIO_PUBLIC_ENDPOINT")
endpoint = (str(endpoint_raw).strip() if endpoint_raw else None)
try:
if isinstance(endpoint, str) and ":9001" in endpoint:
h = endpoint.split("/")[0]
if ":" in h:
parts = h.split(":")
endpoint = f"{parts[0]}:9000"
else:
endpoint = h
except Exception:
endpoint = endpoint
access = (str(access_raw).strip() if access_raw else None)
secret = (str(secret_raw).strip() if secret_raw else None)
bucket = (str(bucket_raw).strip() if bucket_raw else None)
public_base = (str(public_raw).strip() if public_raw else None)
try:
if isinstance(public_base, str) and (":9001" in public_base or "/browser" in public_base or "/minio" in public_base):
host = public_base.strip().split("/")[0]
scheme = "https" if secure else "http"
if ":" in host:
host = host.split("/")[0]
base_host = host.split(":")[0]
public_base = f"{scheme}://{base_host}:9000"
else:
public_base = f"{scheme}://{host}:9000"
except Exception:
public_base = public_base
if not public_base and endpoint:
public_base = f"https://{endpoint}" if secure else f"http://{endpoint}"
missing = []
if Minio is None:
missing.append("client")
if not endpoint:
missing.append("endpoint")
if not access:
missing.append("access")
if not secret:
missing.append("secret")
if not bucket:
missing.append("bucket")
if not public_base:
missing.append("public")
if missing:
try:
logging.error(f"minio config invalid: missing={missing}")
except Exception:
pass
return None, None, None, ""
client = minio_client(endpoint=endpoint, access=access, secret=secret, secure=secure)
try:
try:
client.list_buckets() # type: ignore
except Exception as e:
if secure and ("SSL" in str(e) or "HTTPSConnectionPool" in str(e) or "SSLError" in str(e)):
client = minio_client(endpoint=endpoint, access=access, secret=secret, secure=False)
except Exception:
pass
try:
exists = minio_head_bucket(client, bucket)
if not exists:
minio_create_bucket(client, bucket)
except Exception:
pass
prefix = rc.get("prefix") or os.environ.get("MINIO_PREFIX", "")
return client, bucket, public_base, prefix

View File

@@ -0,0 +1,492 @@
from pathlib import Path
from typing import Optional, Tuple
import re
import tempfile
import sys
from urllib.parse import urlsplit
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
import io
_DOC_AVAILABLE = True
try:
_DOC_BASE = Path(__file__).resolve().parents[2] / "docling"
p = str(_DOC_BASE)
if p not in sys.path:
sys.path.insert(0, p)
except Exception:
pass
try:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling_core.types.doc import ImageRefMode
except Exception:
_DOC_AVAILABLE = False
class DocumentConverter: # type: ignore
def __init__(self, *args, **kwargs):
pass
def convert(self, source):
raise RuntimeError("docling unavailable")
class InputFormat: # type: ignore
PDF = "pdf"
class PdfFormatOption: # type: ignore
def __init__(self, *args, **kwargs):
pass
class StandardPdfPipeline: # type: ignore
pass
class PdfPipelineOptions: # type: ignore
def __init__(self):
pass
class ImageRefMode: # type: ignore
EMBEDDED = None
"""
@api Unified Converter Service
@description Provides core document conversion logic unifying Docling and word2markdown engines
"""
_W2M_AVAILABLE = False
try:
from app.services.word2markdown import convert_any as _w2m_convert_any # type: ignore
_W2M_AVAILABLE = True
except Exception:
_W2M_AVAILABLE = False
try:
from bs4 import BeautifulSoup # type: ignore
except Exception:
BeautifulSoup = None # type: ignore
try:
from app.services.docling_adapter import normalize_html as _normalize_html # type: ignore
from app.services.docling_adapter import resolve_link as _resolve_link # type: ignore
from app.services.docling_adapter import _render_markdown_html as _render_md_html # type: ignore
except Exception:
_normalize_html = None # type: ignore
_resolve_link = None # type: ignore
_render_md_html = None # type: ignore
def _is_http(s: str) -> bool:
t = (s or "").lower()
return t.startswith("http://") or t.startswith("https://")
def _read_bytes(source: str) -> Tuple[bytes, str]:
ct = ""
try:
if _is_http(source):
from urllib.request import urlopen
with urlopen(source, timeout=10) as r:
ct = r.headers.get("Content-Type") or ""
return r.read() or b"", ct
p = Path(source)
if p.exists() and p.is_file():
return p.read_bytes(), ct
except Exception:
return b"", ct
return b"", ct
def _decode_to_utf8(raw: bytes, ct: str = "") -> str:
if not raw:
return ""
if raw.startswith(b"\xef\xbb\xbf"):
try:
return raw[3:].decode("utf-8")
except Exception:
pass
if raw.startswith(b"\xff\xfe"):
try:
return raw[2:].decode("utf-16le")
except Exception:
pass
if raw.startswith(b"\xfe\xff"):
try:
return raw[2:].decode("utf-16be")
except Exception:
pass
try:
m = re.search(r"charset=([\w-]+)", ct or "", re.IGNORECASE)
if m:
enc = m.group(1).strip().lower()
try:
return raw.decode(enc)
except Exception:
pass
except Exception:
pass
candidates = [
"utf-8", "gb18030", "gbk", "big5", "shift_jis", "iso-8859-1", "windows-1252",
]
for enc in candidates:
try:
return raw.decode(enc)
except Exception:
continue
return raw.decode("utf-8", errors="replace")
def _normalize_newlines(s: str) -> str:
return (s or "").replace("\r\n", "\n").replace("\r", "\n")
def _html_to_markdown(html: str) -> str:
if not html:
return ""
if BeautifulSoup is None:
return html
soup = BeautifulSoup(html, "html.parser")
out: list[str] = []
def txt(node) -> str:
return (getattr(node, "get_text", lambda **kwargs: str(node))(strip=True) if node else "")
def inline(node) -> str:
if isinstance(node, str):
return node
name = getattr(node, "name", None)
if name in {None}: # type: ignore
return str(node)
if name in {"strong", "b"}:
return "**" + txt(node) + "**"
if name in {"em", "i"}:
return "*" + txt(node) + "*"
if name == "code":
return "`" + txt(node) + "`"
if name == "a":
href_val = node.get("href")
extra_val = node.get("data-doc")
href = href_val if isinstance(href_val, str) else None
extra = extra_val if isinstance(extra_val, str) else None
resolved = _resolve_link(href, extra) if _resolve_link else (href or extra)
url = resolved or ""
text = txt(node)
if url:
return f"[{text}]({url})"
return text
if name == "img":
alt = node.get("alt") or "image"
src = node.get("src") or ""
return f"![{alt}]({src})"
res = []
for c in getattr(node, "children", []):
res.append(inline(c))
return "".join(res)
def block(node):
name = getattr(node, "name", None)
if name is None:
s = str(node).strip()
if s:
out.append(s)
return
if name in {"h1", "h2", "h3", "h4", "h5", "h6"}:
lvl = int(name[1])
out.append("#" * lvl + " " + txt(node))
out.append("")
return
if name == "p":
segs = [inline(c) for c in node.children]
out.append("".join(segs))
out.append("")
return
if name == "br":
out.append("")
return
if name in {"ul", "ol"}:
is_ol = name == "ol"
idx = 1
for li in node.find_all("li", recursive=False):
text = "".join(inline(c) for c in li.children)
if is_ol:
out.append(f"{idx}. {text}")
idx += 1
else:
out.append(f"- {text}")
out.append("")
return
if name == "pre":
code_node = node.find("code")
code_text = code_node.get_text() if code_node else node.get_text()
lang = ""
cls = (code_node.get("class") if code_node else node.get("class")) or []
for c in cls:
s = str(c)
if s.startswith("language-"):
lang = s.split("-", 1)[-1]
break
out.append(f"```{lang}\n{code_text}\n```\n")
return
if name == "blockquote":
lines = [l for l in txt(node).splitlines() if l.strip()]
for l in lines:
out.append("> " + l)
out.append("")
return
if name == "table":
rows = node.find_all("tr")
if not rows:
return
headers = [h.get_text(strip=True) for h in (rows[0].find_all(["th","td"]) or [])]
if headers:
out.append("|" + "|".join(headers) + "|")
sep = "|" + "|".join(["---" for _ in headers]) + "|"
out.append(sep)
for tr in rows[1:]:
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
if cells:
out.append("|" + "|".join(cells) + "|")
out.append("")
return
if name == "div":
for c in node.children:
block(c)
return
segs = [inline(c) for c in node.children]
if segs:
out.append("".join(segs))
out.append("")
root = soup.body or soup
for ch in getattr(root, "children", []):
block(ch)
return _normalize_newlines("\n".join(out)).strip()
def _lower_html_table_tags(html: str) -> str:
"""
@function _lower_html_table_tags
@description Normalizes HTML table tags to lowercase
@param html Input HTML string
@return Normalized HTML string
"""
if not html:
return html
tags = ["TABLE", "THEAD", "TBODY", "TFOOT", "TR", "TH", "TD"]
out = html
for t in tags:
out = re.sub(r"</?" + t + r"\b", lambda m: m.group(0).lower(), out)
out = re.sub(r">\s*\n+\s*", ">\n", out)
return out
def _replace_admonitions(md: str) -> str:
"""
@function _replace_admonitions
@description Replaces ::: style admonitions with !!! style
@param md Input markdown string
@return Processed markdown string
"""
if not md:
return md
lines = md.split("\n")
out = []
in_block = False
for raw in lines:
t = raw.strip()
if t.startswith(":::"):
if not in_block:
name = t[3:].strip()
if not name:
out.append("!!!")
else:
out.append("!!! " + name)
in_block = True
else:
out.append("!!!")
in_block = False
continue
out.append(raw)
return "\n".join(out)
def _enhance_codeblocks(md: str) -> str:
if not md:
return md
lines = md.split("\n")
res = []
in_fence = False
fence_lang = ""
i = 0
while i < len(lines):
line = lines[i]
t = line.strip()
if t.startswith("```"):
in_fence = not in_fence
try:
fence_lang = (t[3:] or "").strip() if in_fence else ""
except Exception:
fence_lang = ""
res.append(line)
i += 1
continue
if in_fence:
res.append(line)
i += 1
continue
if t.startswith("{") or t.startswith("["):
buf = [line]
j = i + 1
closed = False
depth = t.count("{") - t.count("}")
while j < len(lines):
buf.append(lines[j])
s = lines[j].strip()
depth += s.count("{") - s.count("}")
if depth <= 0 and s.endswith("}"):
closed = True
break
j += 1
if closed and len(buf) >= 3:
lang = "json"
res.append("```" + lang)
res.extend(buf)
res.append("```")
i = j + 1
continue
code_sig = (
("public static" in t) or ("private static" in t) or ("class " in t) or ("return " in t) or ("package " in t) or ("import " in t)
)
if code_sig:
buf = [line]
j = i + 1
while j < len(lines):
s = lines[j].strip()
if not s:
break
if s.startswith("# ") or s.startswith("## ") or s.startswith("### "):
break
buf.append(lines[j])
j += 1
if len(buf) >= 3:
res.append("```")
res.extend(buf)
res.append("```")
i = j + 1
continue
res.append(line)
i += 1
return "\n".join(res)
class FormatConverter:
"""
@class FormatConverter
@description Unified converter class wrapping Docling and word2markdown
"""
def __init__(self) -> None:
self._docling = DocumentConverter()
def convert(self, source: str, export: str = "markdown", engine: Optional[str] = None, mdx_safe_mode_enabled: bool = True) -> Tuple[str, str, Optional[str]]:
"""
@function convert
@description Convert a document source to specified format
@param source Path or URL to source document
@param export Output format (markdown, html, json, doctags)
@param engine Optional engine override (word2markdown/docling)
@param mdx_safe_mode_enabled Toggle safe mode for MDX
@return Tuple of (encoding, content)
"""
# Prefer custom word2markdown engine for DOC/DOCX when available
auto_engine = None
try:
from pathlib import Path as _P
suf = _P(source).suffix.lower()
if not engine and suf in {".doc", ".docx"} and _W2M_AVAILABLE:
auto_engine = "word2markdown"
except Exception:
auto_engine = None
use_engine = (engine or auto_engine or "").lower()
try:
from urllib.parse import urlsplit
path = source
if _is_http(source):
path = urlsplit(source).path or ""
ext = Path(path).suffix.lower()
except Exception:
ext = Path(source).suffix.lower()
if ext in {".txt"}:
raw, ct = _read_bytes(source)
text = _normalize_newlines(_decode_to_utf8(raw, ct))
if export.lower() == "html":
if _render_md_html is not None:
html = _render_md_html(text)
else:
try:
import marko
html = marko.convert(text)
except Exception:
html = f"<pre>{text}</pre>"
return "utf-8", _lower_html_table_tags(html), None
md = _enhance_codeblocks(text)
return "utf-8", md, None
if ext in {".md"}:
raw, ct = _read_bytes(source)
text = _normalize_newlines(_decode_to_utf8(raw, ct))
if export.lower() == "html":
if _render_md_html is not None:
html = _render_md_html(text)
else:
try:
import marko
html = marko.convert(text)
except Exception:
html = text
return "utf-8", _lower_html_table_tags(html), None
return "utf-8", text, None
if ext in {".html", ".htm"}:
try:
conv = DocumentConverter(allowed_formats=[InputFormat.HTML])
result = conv.convert(source)
if export.lower() == "html":
html = result.document.export_to_html()
html = _lower_html_table_tags(html)
return "utf-8", html, None
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
md = _replace_admonitions(md)
md = _enhance_codeblocks(md)
return "utf-8", md, None
except Exception:
raw, ct = _read_bytes(source)
html_in = _normalize_newlines(_decode_to_utf8(raw, ct))
if export.lower() == "html":
html = _normalize_html(html_in) if _normalize_html is not None else html_in
return "utf-8", _lower_html_table_tags(html), None
md = _html_to_markdown(html_in)
md = _replace_admonitions(md)
md = _enhance_codeblocks(md)
return "utf-8", md, None
if use_engine in {"pandoc", "custom", "word2markdown"} and _W2M_AVAILABLE:
enc, md = _w2m_convert_any(Path(source), mdx_safe_mode_enabled=mdx_safe_mode_enabled)
md = _replace_admonitions(md)
md = _enhance_codeblocks(md)
return enc or "utf-8", md, None
# Configure PDF pipeline to generate picture images into a per-call artifacts directory
artifacts_dir = tempfile.mkdtemp(prefix="docling_artifacts_")
pdf_opts = PdfPipelineOptions()
pdf_opts.generate_picture_images = True
pdf_opts.generate_page_images = True
pdf_opts.images_scale = 2.0
pdf_opts.do_code_enrichment = True
pdf_opts.do_formula_enrichment = True
self._docling = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=StandardPdfPipeline,
pipeline_options=pdf_opts,
)
}
)
result = self._docling.convert(source)
if export.lower() == "markdown":
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
md = _replace_admonitions(md)
md = _enhance_codeblocks(md)
return "utf-8", md, artifacts_dir
if export.lower() == "html":
html = result.document.export_to_html()
html = _lower_html_table_tags(html)
return "utf-8", html, artifacts_dir
if export.lower() == "json":
js = result.document.export_to_json()
return "utf-8", js, artifacts_dir
if export.lower() == "doctags":
dt = result.document.export_to_doctags()
return "utf-8", dt, artifacts_dir
raise RuntimeError("unsupported export")

View File

@@ -0,0 +1,429 @@
from pathlib import Path
from typing import Tuple, List
from docx import Document
from docx.table import Table
from docx.text.paragraph import Paragraph
import re
import base64
import hashlib
import tempfile
import subprocess
from lxml import etree
def _iter_blocks(doc: Document):
parent = doc
parent_elm = parent.element.body
for child in parent_elm.iterchildren():
tag = child.tag.split('}')[-1]
if tag == 'p':
yield Paragraph(child, parent)
elif tag == 'tbl':
yield Table(child, parent)
def _cell_text(cell) -> str:
parts = []
for p in cell.paragraphs:
t = p.text or ""
parts.append(t)
return "\n".join([s for s in parts if s is not None])
def _guess_lang(text: str) -> str:
t = (text or "").strip()
head = t[:512]
if re.search(r"\b(package|import\s+java\.|public\s+class|public\s+static|private\s+static|@Override)\b", head):
return "java"
if re.search(r"\b(def\s+\w+\(|import\s+\w+|print\(|from\s+\w+\s+import)\b", head):
return "python"
if re.search(r"\b(function\s+\w+\(|console\.log|let\s+\w+|const\s+\w+|=>)\b", head):
return "javascript"
if re.search(r"^#include|\bint\s+main\s*\(\)", head):
return "c"
if re.search(r"\busing\s+namespace\b|\bstd::\b|\btemplate\b", head):
return "cpp"
if re.search(r"\b(SELECT|INSERT|UPDATE|DELETE|CREATE\s+TABLE|DROP\s+TABLE|ALTER\s+TABLE)\b", head, re.IGNORECASE):
return "sql"
if head.startswith("{") or head.startswith("["):
return "json"
if re.search(r"<html|<div|<span|<table|<code|<pre", head, re.IGNORECASE):
return "html"
if re.search(r"<\?xml|</?[A-Za-z0-9:_-]+>", head):
return "xml"
return ""
def _table_to_md(tbl: Table) -> str:
rows = tbl.rows
cols = tbl.columns
if len(rows) == 1 and len(cols) == 1:
txt = _cell_text(rows[0].cells[0]).strip()
lang = _guess_lang(txt)
return f"```{lang}\n{txt}\n```\n"
def _cell_inline_md(doc: Document, paragraph: Paragraph) -> str:
el = paragraph._element
parts: List[str] = []
try:
for ch in el.iterchildren():
tag = ch.tag.split('}')[-1]
if tag == 'r':
for rc in ch.iterchildren():
rtag = rc.tag.split('}')[-1]
if rtag == 't':
s = rc.text or ''
if s:
parts.append(s)
elif rtag == 'br':
parts.append('\n')
elif rtag == 'drawing':
try:
for node in rc.iter():
local = node.tag.split('}')[-1]
rid = None
if local == 'blip':
rid = node.get(f"{{{NS['r']}}}embed") or node.get(f"{{{NS['r']}}}link")
elif local == 'imagedata':
rid = node.get(f"{{{NS['r']}}}id")
if not rid:
continue
try:
part = None
rp = getattr(doc.part, 'related_parts', None)
if isinstance(rp, dict) and rid in rp:
part = rp.get(rid)
if part is None:
rels = getattr(doc.part, 'rels', None)
if rels is not None and hasattr(rels, 'get'):
rel = rels.get(rid)
part = getattr(rel, 'target_part', None)
if part is None:
rel = getattr(doc.part, '_rels', {}).get(rid)
part = getattr(rel, 'target_part', None)
ct = getattr(part, 'content_type', '') if part is not None else ''
data = part.blob if part is not None and hasattr(part, 'blob') else None
if data:
b64 = base64.b64encode(data).decode('ascii')
parts.append(f"![Image](data:{ct};base64,{b64})")
except Exception:
pass
except Exception:
pass
except Exception:
pass
return ''.join(parts)
out = []
# python-docx table parent is the Document
doc = getattr(tbl, '_parent', None) or getattr(tbl, 'part', None)
for r_i, r in enumerate(rows):
vals = []
for c in r.cells:
segs: List[str] = []
for p in c.paragraphs:
s = _cell_inline_md(doc, p)
if s:
segs.append(s)
cell_text = '<br>'.join([x for x in segs if x is not None])
vals.append((cell_text or '').replace('|', '\\|').strip())
line = "| " + " | ".join(vals) + " |"
out.append(line)
if r_i == 0:
sep = "| " + " | ".join(["---" for _ in vals]) + " |"
out.append(sep)
return "\n".join(out) + "\n"
def _paragraph_to_md(p: Paragraph) -> str:
return (p.text or "").strip() + "\n\n"
def convert_any(path: Path, mdx_safe_mode_enabled: bool = True) -> Tuple[str, str]:
ext = path.suffix.lower()
use_path = path
if ext == ".doc":
use_path = _convert_doc_to_docx_cross_platform(path)
if use_path.suffix.lower() not in {".docx"}:
raise RuntimeError("unsupported input for word2markdown")
doc = Document(str(use_path))
out: List[str] = []
in_code = False
code_lines: List[str] = []
lang_hint: str = ''
for blk in _iter_blocks(doc):
if isinstance(blk, Table):
out.append(_table_to_md(blk))
elif isinstance(blk, Paragraph):
tboxes = _paragraph_textboxes(blk)
for tb in tboxes:
if tb.strip():
out.append(_md_code_block(tb.strip()))
sdts = _paragraph_sdts(blk)
for s in sdts:
if s.strip():
out.append(_md_code_block(s.strip()))
btx = _paragraph_bordered_text(blk)
for s in btx:
if s.strip():
out.append(_md_code_block(s.strip()))
ftx = _paragraph_framed(blk)
for s in ftx:
if s.strip():
out.append(_md_code_block(s.strip()))
raw = (blk.text or "")
sraw = raw.strip()
if _looks_like_code_paragraph(sraw) or (in_code and sraw == ""):
if not in_code:
in_code = True
lang_hint = _guess_lang(sraw)
code_lines = []
code_lines.append(raw)
continue
if in_code and code_lines:
text = "\n".join(code_lines)
use_lang = lang_hint or _guess_lang(text)
out.append(f"```{use_lang}\n{text}\n```\n")
in_code = False
code_lines = []
lang_hint = ''
def _paragraph_with_images(doc: Document, p: Paragraph) -> str:
el = p._element
parts: List[str] = []
try:
for ch in el.iterchildren():
tag = ch.tag.split('}')[-1]
if tag == 'r':
for rc in ch.iterchildren():
rtag = rc.tag.split('}')[-1]
if rtag == 't':
s = rc.text or ''
if s:
parts.append(s)
elif rtag == 'br':
parts.append('\n')
elif rtag == 'drawing':
for node in rc.iter():
local = node.tag.split('}')[-1]
rid = None
if local == 'blip':
rid = node.get(f"{{{NS['r']}}}embed") or node.get(f"{{{NS['r']}}}link")
elif local == 'imagedata':
rid = node.get(f"{{{NS['r']}}}id")
if not rid:
continue
try:
part = None
rp = getattr(doc.part, 'related_parts', None)
if isinstance(rp, dict) and rid in rp:
part = rp.get(rid)
if part is None:
rels = getattr(doc.part, 'rels', None)
if rels is not None and hasattr(rels, 'get'):
rel = rels.get(rid)
part = getattr(rel, 'target_part', None)
if part is None:
rel = getattr(doc.part, '_rels', {}).get(rid)
part = getattr(rel, 'target_part', None)
ct = getattr(part, 'content_type', '') if part is not None else ''
data = part.blob if part is not None and hasattr(part, 'blob') else None
if data:
b64 = base64.b64encode(data).decode('ascii')
parts.append(f"![Image](data:{ct};base64,{b64})")
except Exception:
pass
except Exception:
pass
s = ''.join(parts).strip()
return (s + '\n\n') if s else ''
txt = _paragraph_with_images(doc, blk)
if txt.strip():
out.append(txt)
if in_code and code_lines:
text = "\n".join(code_lines)
use_lang = lang_hint or _guess_lang(text)
out.append(f"```{use_lang}\n{text}\n```\n")
try:
boxes = _doclevel_textboxes(doc)
existing_texts = set()
try:
for seg in out:
if isinstance(seg, str):
ss = seg.strip()
if ss.startswith("```"):
m = re.search(r"^```[\w-]*\n([\s\S]*?)\n```\s*$", ss)
if m:
existing_texts.add(m.group(1).strip())
continue
existing_texts.add(ss)
except Exception:
pass
for tb in boxes:
s = (tb or '').strip()
if not s:
continue
if s in existing_texts:
continue
out.append(_md_code_block(s))
existing_texts.add(s)
except Exception:
pass
md = "".join(out)
return "utf-8", md
NS = {
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
"wp": "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing",
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
"wps": "http://schemas.microsoft.com/office/word/2010/wordprocessingShape",
"v": "urn:schemas-microsoft-com:vml",
"r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships",
"pic": "http://schemas.openxmlformats.org/drawingml/2006/picture",
}
def _paragraph_textboxes(p: Paragraph) -> List[str]:
try:
el = p._element
texts: List[str] = []
for tbox in el.xpath('.//wps:txbx/w:txbxContent', namespaces=NS):
paras = tbox.xpath('.//w:p', namespaces=NS)
buf: List[str] = []
for w_p in paras:
ts = w_p.xpath('.//w:t', namespaces=NS)
s = ''.join([t.text or '' for t in ts]).strip()
if s:
buf.append(s)
if buf:
texts.append('\n'.join(buf))
for tbox in el.xpath('.//v:textbox/w:txbxContent', namespaces=NS):
paras = tbox.xpath('.//w:p', namespaces=NS)
buf: List[str] = []
for w_p in paras:
ts = w_p.xpath('.//w:t', namespaces=NS)
s = ''.join([t.text or '' for t in ts]).strip()
if s:
buf.append(s)
if buf:
texts.append('\n'.join(buf))
return texts
except Exception:
return []
def _paragraph_sdts(p: Paragraph) -> List[str]:
try:
el = p._element
texts: List[str] = []
for sdt in el.xpath('.//w:sdt/w:sdtContent', namespaces=NS):
paras = sdt.xpath('.//w:p', namespaces=NS)
buf: List[str] = []
for w_p in paras:
ts = w_p.xpath('.//w:t', namespaces=NS)
s = ''.join([t.text or '' for t in ts]).strip()
if s:
buf.append(s)
if buf:
texts.append('\n'.join(buf))
return texts
except Exception:
return []
def _paragraph_bordered_text(p: Paragraph) -> List[str]:
try:
el = p._element
has_border = bool(el.xpath('./w:pPr/w:pBdr', namespaces=NS))
t = (p.text or '').strip()
if has_border and t:
return [t]
except Exception:
pass
return []
def _paragraph_framed(p: Paragraph) -> List[str]:
try:
el = p._element
has_frame = bool(el.xpath('./w:pPr/w:framePr', namespaces=NS))
t = (p.text or '').strip()
if has_frame and t:
return [t]
except Exception:
pass
return []
def _md_code_block(text: str) -> str:
lang = _guess_lang(text)
return f"```{lang}\n{text}\n```\n"
def _looks_like_code_paragraph(t: str) -> bool:
s = (t or '').strip()
if not s:
return False
if s.startswith('{') or s.startswith('[') or s.endswith('}'):
return True
if s.startswith(' ') or s.startswith('\t'):
return True
if ';' in s or '{' in s or '}' in s:
return True
keywords = ['public static', 'private static', 'class ', 'return ', 'import ', 'package ', 'byte[]', 'String ', 'Cipher', 'KeyFactory']
return any(k in s for k in keywords)
def _doclevel_textboxes(doc: Document) -> List[str]:
texts: List[str] = []
try:
el = doc.element.body
for tbox in el.xpath('.//wps:txbx/w:txbxContent', namespaces=NS):
paras = tbox.xpath('.//w:p', namespaces=NS)
buf: List[str] = []
for w_p in paras:
ts = w_p.xpath('.//w:t', namespaces=NS)
s = ''.join([(t.text or '') for t in ts]).strip()
if s:
buf.append(s)
if buf:
texts.append('\n'.join(buf))
for tbox in el.xpath('.//v:textbox/w:txbxContent', namespaces=NS):
paras = tbox.xpath('.//w:p', namespaces=NS)
buf: List[str] = []
for w_p in paras:
ts = w_p.xpath('.//w:t', namespaces=NS)
s = ''.join([(t.text or '') for t in ts]).strip()
if s:
buf.append(s)
if buf:
texts.append('\n'.join(buf))
except Exception:
pass
return texts
def _convert_doc_to_docx_cross_platform(path: Path) -> Path:
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".docx") as tmp:
tmp.close()
subprocess.run(["textutil", "-convert", "docx", str(path), "-output", tmp.name], check=True)
return Path(tmp.name)
except Exception:
pass
try:
outdir = Path(tempfile.mkdtemp(prefix="doc2docx_"))
subprocess.run(["soffice", "--headless", "--convert-to", "docx", "--outdir", str(outdir), str(path)], check=True)
candidate = outdir / (path.stem + ".docx")
if candidate.exists():
return candidate
except Exception:
pass
try:
out = Path(tempfile.NamedTemporaryFile(delete=False, suffix=".docx").name)
subprocess.run(["unoconv", "-f", "docx", "-o", str(out), str(path)], check=True)
if out.exists():
return out
except Exception:
pass
raise RuntimeError("doc to docx conversion failed; please install 'soffice' or 'unoconv' or convert manually")

View File

@@ -0,0 +1,80 @@
import io
import os
import zipfile
from pathlib import Path
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "test", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def main():
setup()
app = server.app
c = TestClient(app)
tmp = Path("/tmp/run_batch_upload_debug")
tmp.mkdir(parents=True, exist_ok=True)
zpath = tmp / "pkg.zip"
md_dir = tmp / "docs"
img_dir = md_dir / "images"
img_dir.mkdir(parents=True, exist_ok=True)
(img_dir / "p.png").write_bytes(b"PNG")
(md_dir / "a.md").write_text("![](images/p.png)", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
zf.write(str(md_dir / "a.md"), arcname="a.md")
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
with open(zpath, "rb") as fp:
files = {"file": ("pkg.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
print("stage status:", r1.status_code, r1.json())
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1001"})
print("process status:", r2.status_code, r2.json())
list_text = str(md_dir / "a.md")
lf = io.BytesIO(list_text.encode("utf-8"))
r3 = c.post("/api/upload-list", files={"list_file": ("list.txt", lf.getvalue())}, data={"prefix": "assets", "versionId": "1002"})
print("upload-list status:", r3.status_code, r3.json())
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,75 @@
import io
import os
from pathlib import Path
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "test", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def main():
setup()
app = server.app
c = TestClient(app)
tmp = Path("/tmp/run_convert_folder_debug")
if tmp.exists():
for p in tmp.rglob("*"):
try:
p.unlink()
except Exception:
pass
try:
tmp.rmdir()
except Exception:
pass
tmp.mkdir(parents=True, exist_ok=True)
root = tmp / "数+产品手册-MD源文件"
sub = root / "DMDRS_DRS_Language_User_Manual"
img = sub / "images"
img.mkdir(parents=True, exist_ok=True)
(img / "p.png").write_bytes(b"PNG")
(sub / "a.md").write_text("# Title\n\n![](images/p.png)", "utf-8")
r = c.post("/md/convert-folder", data={"folder_path": str(root), "prefix": "assets"})
print("convert-folder:", r.status_code)
print(r.json())
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,97 @@
import io
import zipfile
from pathlib import Path
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "test", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def run():
setup()
app = server.app
c = TestClient(app)
r = c.post("/api/archive/process", data={"id": "missing"})
print("invalid-id:", r.status_code, r.json())
tmp = Path("/tmp/run_edge_cases_debug")
tmp.mkdir(parents=True, exist_ok=True)
rar_path = tmp / "pkg.rar"
rar_path.write_bytes(b"RAR")
with open(rar_path, "rb") as fp:
files = {"file": ("pkg.rar", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid})
print("rar-process:", r2.status_code, r2.json())
r3 = c.post("/api/archive/process", data={"id": sid})
print("rar-reprocess:", r3.status_code, r3.json())
root = tmp / "listcase2"
root.mkdir(parents=True, exist_ok=True)
(root / "img.png").write_bytes(b"PNG")
(root / "a.md").write_text("![](img.png)", "utf-8")
(root / "b.txt").write_text("![](img.png)", "utf-8")
lines = ["", "# comment", "http://example.com/x.md", str(root / "a.md"), str(root / "b.txt")]
data_bytes = "\n".join(lines).encode("utf-8")
files = {"list_file": ("list.txt", data_bytes)}
r4 = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1005"})
print("upload-list:", r4.status_code, r4.json())
zpath = tmp / "dup.zip"
base = tmp / "src"
sub = base / "sub"
sub.mkdir(parents=True, exist_ok=True)
(base / "a.md").write_text("![](img.png)", "utf-8")
(base / "img.png").write_bytes(b"PNG")
(sub / "a.md").write_text("![](../img.png)", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
zf.write(str(base / "a.md"), arcname="a.md")
zf.write(str(base / "img.png"), arcname="img.png")
zf.write(str(sub / "a.md"), arcname="sub/a.md")
with open(zpath, "rb") as fp:
files = {"file": ("dup.zip", fp.read())}
r5 = c.post("/api/archive/stage", files=files)
sid2 = r5.json()["data"]["id"]
r6 = c.post("/api/archive/process", data={"id": sid2, "prefix": "assets", "versionId": "1006"})
print("archive-dup:", r6.status_code, r6.json())
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,77 @@
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class _Resp:
def __init__(self, data: bytes):
self._data = data
def read(self) -> bytes:
return self._data
def close(self):
pass
class FakeMinio:
def __init__(self):
self.store = {
("doctest", "assets/rewritten/x.md"): (b"# Title\n\nhello", "text/markdown; charset=utf-8")
}
def stat_object(self, bucket: str, object_name: str):
class S:
def __init__(self, ct: str):
self.content_type = ct
k = (bucket, object_name)
if k in self.store:
return S(self.store[k][1])
return S("application/octet-stream")
def get_object(self, bucket: str, object_name: str):
k = (bucket, object_name)
if k in self.store:
return _Resp(self.store[k][0])
return _Resp(b"")
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "doctest", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def run():
setup()
app = server.app
c = TestClient(app)
r = c.get("/minio/object", params={"bucket": "doctest", "object": "assets/rewritten/x.md"})
print("status:", r.status_code)
print("ct:", r.headers.get("Content-Type"))
print(r.text)
import urllib.parse as _u
enc = _u.quote("assets/rewritten/数字+产品手册-MD源文件/x.md")
cur_client, _, _, _ = server._minio_current() # type: ignore
cur_client.store[("doctest", "assets/rewritten/数字+产品手册-MD源文件/x.md")] = ("hello 中文+plus".encode("utf-8"), "text/markdown; charset=utf-8")
r2 = c.get("/minio/object", params={"bucket": "doctest", "object": enc})
print("status2:", r2.status_code)
print("ct2:", r2.headers.get("Content-Type"))
print(r2.text)
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,50 @@
import io
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class FakeMinio:
def __init__(self):
pass
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}?e={expires}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}?e={expires}"
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "doctest",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "doctest", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def run():
setup()
app = server.app
c = TestClient(app)
url = "http://127.0.0.1:9000/doctest/assets/rewritten/%E6%B5%8B%E8%AF%95/a.md"
r = c.post("/minio/presign", data={"url": url, "expires": 7200})
print("status:", r.status_code)
print(r.json())
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,74 @@
import io
import zipfile
from pathlib import Path
from fastapi.testclient import TestClient
import sys
from pathlib import Path as _Path
base = _Path(__file__).resolve().parents[2]
sys.path.insert(0, str(base))
sys.path.insert(0, str(base / "docling"))
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup():
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "test", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def main():
setup()
app = server.app
c = TestClient(app)
tmp = Path("/tmp/run_slash_path_debug")
tmp.mkdir(parents=True, exist_ok=True)
zpath = tmp / "pkg.zip"
md_dir = tmp / "docs"
img_dir = md_dir / "images"
img_dir.mkdir(parents=True, exist_ok=True)
(img_dir / "p.png").write_bytes(b"PNG")
(md_dir / "a.md").write_text("![](/images/p.png)", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
zf.write(str(md_dir / "a.md"), arcname="a.md")
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
with open(zpath, "rb") as fp:
files = {"file": ("pkg.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1007"})
print("process:", r2.status_code)
print(r2.json())
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,29 @@
import unittest
from fastapi.testclient import TestClient
from pathlib import Path
import io
from app.server import app
class ApiConvertTest(unittest.TestCase):
def setUp(self):
self.client = TestClient(app)
def test_api_convert_markdown_file(self):
tmpdir = Path("./scratch_unittest")
tmpdir.mkdir(exist_ok=True)
p = tmpdir / "sample.md"
p.write_text("# Title\n\n::: note\nBody\n:::\n", "utf-8")
with open(p, "rb") as f:
files = {"file": (p.name, io.BytesIO(f.read()), "text/markdown")}
r = self.client.post("/api/convert", files=files, data={"export": "markdown"})
self.assertEqual(r.status_code, 200)
j = r.json()
self.assertEqual(j.get("code"), 0)
self.assertIsInstance(j.get("data", {}).get("content"), str)
self.assertIn("!!! note", j["data"]["content"])
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,113 @@
import io
import zipfile
from pathlib import Path
from fastapi.testclient import TestClient
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup_module(module=None):
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur():
return fake, "test", "http://127.0.0.1:9000", "assets"
server._minio_current = _cur # type: ignore
def test_process_invalid_id():
app = server.app
c = TestClient(app)
r = c.post("/api/archive/process", data={"id": "missing"})
assert r.status_code == 200
j = r.json()
assert j["code"] != 0
def test_stage_unsupported_format_and_cleanup(tmp_path: Path):
app = server.app
c = TestClient(app)
rar_path = tmp_path / "pkg.rar"
rar_path.write_bytes(b"RAR")
with open(rar_path, "rb") as fp:
files = {"file": ("pkg.rar", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
assert r1.status_code == 200
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid})
assert r2.status_code == 200
j2 = r2.json()
assert j2["code"] != 0
r3 = c.post("/api/archive/process", data={"id": sid})
assert r3.status_code == 200
j3 = r3.json()
assert j3["code"] != 0
def test_upload_list_empty_lines_comments_and_urls(tmp_path: Path):
app = server.app
c = TestClient(app)
root = tmp_path / "listcase2"
root.mkdir(parents=True, exist_ok=True)
(root / "img.png").write_bytes(b"PNG")
(root / "a.md").write_text("![](img.png)", "utf-8")
(root / "b.txt").write_text("![](img.png)", "utf-8")
lines = ["", "# comment", "http://example.com/x.md", str(root / "a.md"), str(root / "b.txt")]
data_bytes = "\n".join(lines).encode("utf-8")
files = {"list_file": ("list.txt", data_bytes)}
r = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1005"})
assert r.status_code == 200
j = r.json()
assert j["code"] == 0
assert j["data"]["count"] >= 2
def test_archive_duplicate_filenames_tree(tmp_path: Path):
app = server.app
c = TestClient(app)
zpath = tmp_path / "dup.zip"
base = tmp_path / "src"
sub = base / "sub"
sub.mkdir(parents=True, exist_ok=True)
(base / "a.md").write_text("![](img.png)", "utf-8")
(base / "img.png").write_bytes(b"PNG")
(sub / "a.md").write_text("![](../img.png)", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
zf.write(str(base / "a.md"), arcname="a.md")
zf.write(str(base / "img.png"), arcname="img.png")
zf.write(str(sub / "a.md"), arcname="sub/a.md")
with open(zpath, "rb") as fp:
files = {"file": ("dup.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
assert r1.status_code == 200
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1006"})
assert r2.status_code == 200
j = r2.json()
assert j["code"] == 0
tree = j["data"]["import"]["tree"]
names = [n["name"] for n in tree]
assert "sub" in names or any((isinstance(n, dict) and n.get("type") == "FOLDER" and n.get("name") == "sub") for n in tree)

View File

@@ -0,0 +1,185 @@
import io
import os
import zipfile
from pathlib import Path
from fastapi.testclient import TestClient
import app.server as server
class FakeMinio:
def __init__(self):
self.objs = {}
def put_object(self, bucket_name: str, object_name: str, data: io.BytesIO, length: int, content_type: str):
self.objs[(bucket_name, object_name)] = data.read(length)
def get_presigned_url(self, method: str, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def presigned_get_object(self, bucket: str, obj: str, expires: int):
return f"http://minio.test/presigned/{bucket}/{obj}"
def setup_module(module=None):
server.RUNTIME_CONFIG["minio"].update({
"endpoint": "127.0.0.1:9000",
"public": "http://127.0.0.1:9000",
"access": "ak",
"secret": "sk",
"bucket": "test",
"secure": "false",
"prefix": "assets",
"store_final": "true",
"public_read": "true",
})
fake = FakeMinio()
def _cur_cfg(_cfg):
return fake, "test", "http://127.0.0.1:9000", "assets"
server.minio_current = _cur_cfg # type: ignore
try:
server._minio_current = lambda: _cur_cfg(None) # type: ignore
except Exception:
pass
def test_archive_stage_and_process(tmp_path: Path):
app = server.app
c = TestClient(app)
zpath = tmp_path / "pkg.zip"
md_dir = tmp_path / "docs"
img_dir = md_dir / "images"
img_dir.mkdir(parents=True, exist_ok=True)
(img_dir / "p.png").write_bytes(b"PNG")
(md_dir / "a.md").write_text("![](images/p.png)", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
zf.write(str(md_dir / "a.md"), arcname="a.md")
zf.write(str(img_dir / "p.png"), arcname="images/p.png")
with open(zpath, "rb") as fp:
files = {"file": ("pkg.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
assert r1.status_code == 200
j1 = r1.json()
assert j1["code"] == 0 and j1["data"]["id"]
sid = j1["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1001"})
assert r2.status_code == 200
j2 = r2.json()
assert j2["code"] == 0
assert j2["data"]["count"] >= 1
assert "import" in j2["data"]
def test_upload_list(tmp_path: Path):
app = server.app
c = TestClient(app)
root = tmp_path / "listcase"
root.mkdir(parents=True, exist_ok=True)
(root / "img.png").write_bytes(b"PNG")
(root / "b.md").write_text("![](img.png)", "utf-8")
list_text = str(root / "b.md")
lf = io.BytesIO(list_text.encode("utf-8"))
files = {"list_file": ("list.txt", lf.getvalue())}
r = c.post("/api/upload-list", files=files, data={"prefix": "assets", "versionId": "1002"})
assert r.status_code == 200
j = r.json()
assert j["code"] == 0
assert j["data"]["count"] >= 1
assert "import" in j["data"]
def test_archive_process_html_conversion(tmp_path: Path):
app = server.app
c = TestClient(app)
zpath = tmp_path / "web.zip"
root = tmp_path / "web"
static = root / "static"
static.mkdir(parents=True, exist_ok=True)
(static / "pic.png").write_bytes(b"PNG")
(root / "index.html").write_text("<html><body><h1>T</h1><img src='static/pic.png'/></body></html>", "utf-8")
pages = root / "pages"
pages.mkdir(parents=True, exist_ok=True)
(pages / "a.html").write_text("<img src='../static/pic.png'>", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
for p in root.rglob("*"):
if p.is_file():
zf.write(str(p), arcname=p.relative_to(root).as_posix())
with open(zpath, "rb") as fp:
files = {"file": ("web.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
assert r1.status_code == 200
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1003"})
assert r2.status_code == 200
j = r2.json()
assert j["code"] == 0
files_list = j["data"]["files"]
names = {Path(str(f.get("source") or "")).name for f in files_list}
assert "index.md" in names
assert "a.md" in names
for f in files_list:
n = Path(str(f.get("source") or "")).name
if n in {"index.md", "a.md"}:
assert f.get("minio_url")
assert str(f.get("object_name") or "").startswith("assets/rewritten/")
imp = j["data"]["import"]
nodes = []
def walk(children):
for n in children:
if n.get("type") == "FILE":
nodes.append(n.get("name"))
elif n.get("type") == "FOLDER":
walk(n.get("children", []))
walk(imp["tree"])
assert "index" in nodes
assert "a" in nodes
def test_archive_process_html_abs_uppercase(tmp_path: Path):
app = server.app
c = TestClient(app)
zpath = tmp_path / "web2.zip"
root = tmp_path / "web2"
(root / "static").mkdir(parents=True, exist_ok=True)
(root / "static" / "p.png").write_bytes(b"PNG")
(root / "INDEX.HTML").write_text("<img src='/static/p.png'>", "utf-8")
(root / "pages").mkdir(parents=True, exist_ok=True)
(root / "pages" / "A.HTM").write_text("<img src='/static/p.png'>", "utf-8")
with zipfile.ZipFile(str(zpath), "w") as zf:
for p in root.rglob("*"):
if p.is_file():
zf.write(str(p), arcname=p.relative_to(root).as_posix())
with open(zpath, "rb") as fp:
files = {"file": ("web2.zip", fp.read())}
r1 = c.post("/api/archive/stage", files=files)
assert r1.status_code == 200
sid = r1.json()["data"]["id"]
r2 = c.post("/api/archive/process", data={"id": sid, "prefix": "assets", "versionId": "1004"})
assert r2.status_code == 200
j = r2.json()
assert j["code"] == 0
files_list = j["data"]["files"]
names = {Path(str(f.get("source") or "")).name for f in files_list}
assert "INDEX.md" in names
assert "A.md" in names

View File

@@ -0,0 +1,53 @@
import io
import os
import base64
from pathlib import Path
from zipfile import ZipFile
from app.services.docling_adapter import md_to_docx_bytes
def _make_png(tmpdir: Path) -> Path:
# Minimal 1x1 PNG
data = base64.b64decode(
b"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGNgYAAAAAMAASsJTYQAAAAASUVORK5CYII="
)
p = tmpdir / "tiny.png"
p.write_bytes(data)
return p
def test_md_to_docx_renders_blocks_and_media(tmp_path: Path):
png = _make_png(tmp_path)
html = (
f"<h1>标题</h1>"
f"<p>内容</p>"
f"<pre><code>print(\"hello\")\n</code></pre>"
f"<img src='{png.as_posix()}'>"
f"<table><thead><tr><th>A</th><th>B</th></tr></thead>"
f"<tbody><tr><td>1</td><td>2</td></tr></tbody></table>"
)
docx = md_to_docx_bytes(
html,
toc=True,
header_text="Left|Right",
footer_text="Footer",
filename_text="FileName",
product_name="Product",
document_name="DocName",
product_version="1.0",
document_version="2.0",
)
assert isinstance(docx, (bytes, bytearray)) and len(docx) > 0
zf = ZipFile(io.BytesIO(docx))
names = set(zf.namelist())
assert any(n.startswith("word/") for n in names)
# Document XML should contain core texts
doc_xml = zf.read("word/document.xml").decode("utf-8")
for tok in ["标题", "内容", "print(\"hello\")", "A", "B", "1", "2"]:
assert tok in doc_xml
# Media should be present for the image
assert any(n.startswith("word/media/") for n in names)

View File

@@ -0,0 +1,51 @@
import unittest
from pathlib import Path
import base64
import tempfile
import sys
# ensure 'app' package is importable
try:
root = Path(__file__).resolve().parents[2]
p = str(root)
if p not in sys.path:
sys.path.insert(0, p)
except Exception:
pass
from docx import Document
from app.services.word2markdown import convert_any
def _tiny_png_bytes() -> bytes:
return base64.b64decode(
b"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGNgYAAAAAMAASsJTYQAAAAASUVORK5CYII="
)
class InlineImagesTest(unittest.TestCase):
def test_paragraph_image_order(self):
tmp = Path(tempfile.mkdtemp(prefix="w2m_inline_test_"))
img = tmp / "tiny.png"
img.write_bytes(_tiny_png_bytes())
docx = tmp / "sample.docx"
doc = Document()
doc.add_paragraph("前文A")
doc.add_picture(str(img)) # 图片单独段落
doc.add_paragraph("后文B")
doc.save(str(docx))
enc, md = convert_any(docx)
self.assertEqual(enc, "utf-8")
a_pos = md.find("前文A")
img_pos = md.find("![Image](data:")
b_pos = md.find("后文B")
# 顺序应为 A -> 图片 -> B
self.assertTrue(a_pos != -1 and img_pos != -1 and b_pos != -1)
self.assertTrue(a_pos < img_pos < b_pos)
if __name__ == "__main__":
unittest.main()

1
docling/docling Submodule

Submodule docling/docling added at ad97e52851

28
docling/requirements.txt Normal file
View File

@@ -0,0 +1,28 @@
fastapi
uvicorn
python-multipart
minio
beautifulsoup4
marko
markdown-it-py
mdit-py-plugins
pydantic-settings
filetype
python-docx
openpyxl
mammoth
weasyprint
reportlab
pypdfium2
python-pptx
pluggy
requests
docling-core
docling-parse
docling-ibm-models
transformers
sentencepiece
safetensors
scipy
opencv-python
pymupdf

View File

@@ -0,0 +1,17 @@
import sys
from pathlib import Path
from fastapi.testclient import TestClient
root = Path(__file__).resolve().parents[2] / "docling"
sys.path.insert(0, str(root))
import app.server as server
from docling.tests.test_api_prd import setup_module, PNG
setup_module()
app = server.app
c = TestClient(app)
files = {"file": ("管理端使用说明 (1).pdf", b"%PDF-1.4\n")}
data = {"export": "markdown", "save": "true", "filename": "管理端使用说明 (1)"}
r = c.post("/api/convert", files=files, data=data)
print(r.json())

View File

@@ -0,0 +1,131 @@
import os
import sys
import tempfile
from pathlib import Path
from fastapi.testclient import TestClient
import types
root = Path(__file__).resolve().parents[2] / "docling"
sys.path.insert(0, str(root))
dc = types.ModuleType('docling.document_converter')
class _DC:
def __init__(self, *a, **k):
pass
def convert(self, src):
class R:
class D:
def export_to_markdown(self, image_mode=None):
return ""
def export_to_html(self):
return ""
def export_to_json(self):
return "{}"
def export_to_doctags(self):
return "{}"
document = D()
return R()
class _PF:
def __init__(self, *a, **k):
pass
dc.DocumentConverter = _DC
dc.PdfFormatOption = _PF
sys.modules['docling.document_converter'] = dc
bm = types.ModuleType('docling.datamodel.base_models')
class _IF:
PDF = 'pdf'
bm.InputFormat = _IF
sys.modules['docling.datamodel.base_models'] = bm
pl = types.ModuleType('docling.pipeline.standard_pdf_pipeline')
class _SP:
def __init__(self, *a, **k):
pass
pl.StandardPdfPipeline = _SP
sys.modules['docling.pipeline.standard_pdf_pipeline'] = pl
po = types.ModuleType('docling.datamodel.pipeline_options')
class _PPO:
def __init__(self, *a, **k):
pass
po.PdfPipelineOptions = _PPO
sys.modules['docling.datamodel.pipeline_options'] = po
ct = types.ModuleType('docling_core.types.doc')
class _IRM:
PLACEHOLDER = 'placeholder'
ct.ImageRefMode = _IRM
sys.modules['docling_core.types.doc'] = ct
da = types.ModuleType('app.services.docling_adapter')
def _convert_source(src, export):
return ("", "text/markdown")
def _md2docx(md, **k):
return b""
def _md2pdf(md, *a, **k):
return b""
def _infer(source_url, upload_name):
return "document"
def _san(name):
return name or "document"
def _load():
return {}
def _save(m):
return None
da.convert_source = _convert_source
da.md_to_docx_bytes = _md2docx
da.md_to_pdf_bytes_with_renderer = _md2pdf
da.infer_basename = _infer
da.sanitize_filename = _san
da.load_linkmap = _load
da.save_linkmap = _save
sys.modules['app.services.docling_adapter'] = da
import app.server as server
class DummyMinio:
def __init__(self):
self.objs = []
def put_object(self, bucket_name, object_name, data, length, content_type):
self.objs.append((bucket_name, object_name, length, content_type))
def get_presigned_url(self, method, bucket, obj, expires=None):
return f"http://127.0.0.1:9000/{bucket}/{obj}"
def presigned_get_object(self, bucket, obj, expires=None):
return f"http://127.0.0.1:9000/{bucket}/{obj}"
PNG = (b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00\x00\nIDATx\x9cc\xf8\x0f\x00\x01\x01\x01\x00\x18\xdd\xdc\xa4\x00\x00\x00\x00IEND\xaeB`\x82")
def setup_module(module=None):
server._minio_current = lambda: (DummyMinio(), "doctest", "http://127.0.0.1:9000", "assets")
def fake_convert(src, export="markdown", engine=None):
d = Path(tempfile.mkdtemp(prefix="artifacts_"))
(d / "img.png").write_bytes(PNG)
return ("utf-8", "A\n<!-- image -->\nB", str(d))
server._converter_v2.convert = fake_convert
server._extract_pdf_images = lambda pdf_path: [("png", PNG), ("png", PNG)]
import unittest
class TestApiConvert(unittest.TestCase):
@classmethod
def setUpClass(cls):
setup_module()
def test_api_convert_save_true_returns_md_url(self):
app = server.app
mc = server._minio_current()
assert mc[1] == 'doctest'
c = TestClient(app)
files = {"file": ("管理端使用说明 (1).pdf", b"%PDF-1.4\n")}
data = {"export": "markdown", "save": "true", "filename": "管理端使用说明 (1)"}
r = c.post("/api/convert", files=files, data=data)
j = r.json()
self.assertEqual(j["code"], 0, str(j))
self.assertTrue(j["data"]["name"].lower().endswith(".md"))
self.assertTrue(j["data"]["minio_url"].lower().endswith(".md"))
def test_api_convert_save_false_returns_content_and_md_name(self):
app = server.app
mc = server._minio_current()
assert mc[1] == 'doctest'
c = TestClient(app)
files = {"file": ("文档.pdf", b"%PDF-1.4\n")}
data = {"export": "markdown", "save": "false", "filename": "文档"}
r = c.post("/api/convert", files=files, data=data)
j = r.json()
self.assertEqual(j["code"], 0, str(j))
self.assertTrue(j["data"]["name"].lower().endswith(".md"))
self.assertIn("![image](", j["data"]["content"])

24
frontend/.gitignore vendored Normal file
View File

@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?

3
frontend/.vscode/extensions.json vendored Normal file
View File

@@ -0,0 +1,3 @@
{
"recommendations": ["Vue.volar"]
}

5
frontend/README.md Normal file
View File

@@ -0,0 +1,5 @@
# Vue 3 + TypeScript + Vite
This template should help get you started developing with Vue 3 and TypeScript in Vite. The template uses Vue 3 `<script setup>` SFCs, check out the [script setup docs](https://v3.vuejs.org/api/sfc-script-setup.html#sfc-script-setup) to learn more.
Learn more about the recommended Project Setup and IDE Support in the [Vue Docs TypeScript Guide](https://vuejs.org/guide/typescript/overview.html#project-setup).

13
frontend/index.html Normal file
View File

@@ -0,0 +1,13 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>frontend</title>
</head>
<body>
<div id="app"></div>
<script type="module" src="/src/main.ts"></script>
</body>
</html>

1454
frontend/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

23
frontend/package.json Normal file
View File

@@ -0,0 +1,23 @@
{
"name": "frontend",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vue-tsc -b && vite build",
"preview": "vite preview"
},
"dependencies": {
"marked": "^17.0.1",
"vue": "^3.5.24"
},
"devDependencies": {
"@types/node": "^24.10.1",
"@vitejs/plugin-vue": "^6.0.1",
"@vue/tsconfig": "^0.8.1",
"typescript": "~5.9.3",
"vite": "^7.2.4",
"vue-tsc": "^3.1.4"
}
}

1
frontend/public/vite.svg Normal file
View File

@@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="iconify iconify--logos" width="31.88" height="32" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 257"><defs><linearGradient id="IconifyId1813088fe1fbc01fb466" x1="-.828%" x2="57.636%" y1="7.652%" y2="78.411%"><stop offset="0%" stop-color="#41D1FF"></stop><stop offset="100%" stop-color="#BD34FE"></stop></linearGradient><linearGradient id="IconifyId1813088fe1fbc01fb467" x1="43.376%" x2="50.316%" y1="2.242%" y2="89.03%"><stop offset="0%" stop-color="#FFEA83"></stop><stop offset="8.333%" stop-color="#FFDD35"></stop><stop offset="100%" stop-color="#FFA800"></stop></linearGradient></defs><path fill="url(#IconifyId1813088fe1fbc01fb466)" d="M255.153 37.938L134.897 252.976c-2.483 4.44-8.862 4.466-11.382.048L.875 37.958c-2.746-4.814 1.371-10.646 6.827-9.67l120.385 21.517a6.537 6.537 0 0 0 2.322-.004l117.867-21.483c5.438-.991 9.574 4.796 6.877 9.62Z"></path><path fill="url(#IconifyId1813088fe1fbc01fb467)" d="M185.432.063L96.44 17.501a3.268 3.268 0 0 0-2.634 3.014l-5.474 92.456a3.268 3.268 0 0 0 3.997 3.378l24.777-5.718c2.318-.535 4.413 1.507 3.936 3.838l-7.361 36.047c-.495 2.426 1.782 4.5 4.151 3.78l15.304-4.649c2.372-.72 4.652 1.36 4.15 3.788l-11.698 56.621c-.732 3.542 3.979 5.473 5.943 2.437l1.313-2.028l72.516-144.72c1.215-2.423-.88-5.186-3.54-4.672l-25.505 4.922c-2.396.462-4.435-1.77-3.759-4.114l16.646-57.705c.677-2.35-1.37-4.583-3.769-4.113Z"></path></svg>

After

Width:  |  Height:  |  Size: 1.5 KiB

101
frontend/src/App.vue Normal file
View File

@@ -0,0 +1,101 @@
<script setup lang="ts">
import { ref } from 'vue'
import DocToMd from './components/DocToMd.vue'
import BatchProcess from './components/BatchProcess.vue'
import MdToDoc from './components/MdToDoc.vue'
import ConfigModal from './components/ConfigModal.vue'
const showConfig = ref(false)
const activePage = ref<'doc-to-md' | 'batch' | 'md-to-doc'>('doc-to-md')
</script>
<template>
<div class="app-container">
<div class="hero">
<div class="hero-title">FunMD文档处理接口</div>
<button class="hero-btn" @click="showConfig = true">数据库配置</button>
</div>
<div class="top-tabs">
<div class="top-tab" :class="{active: activePage === 'doc-to-md'}" @click="activePage = 'doc-to-md'">DOCX/PDF Markdown</div>
<div class="top-tab" :class="{active: activePage === 'batch'}" @click="activePage = 'batch'">批量处理</div>
<div class="top-tab" :class="{active: activePage === 'md-to-doc'}" @click="activePage = 'md-to-doc'">Markdown DOCX/PDF</div>
</div>
<div class="main-content">
<DocToMd v-if="activePage === 'doc-to-md'" />
<BatchProcess v-if="activePage === 'batch'" />
<MdToDoc v-if="activePage === 'md-to-doc'" />
</div>
<ConfigModal v-model="showConfig" />
</div>
</template>
<style>
body {
font-family: -apple-system, system-ui, Segoe UI, Roboto, Helvetica, Arial, sans-serif;
margin: 0;
padding: 24px;
background: #f6f7f9;
}
* {
box-sizing: border-box;
}
</style>
<style scoped>
.app-container {
max-width: 920px;
margin: 0 auto;
}
.hero {
text-align: center;
margin: 12px 0 16px;
}
.hero-title {
font-size: 28px;
font-weight: 800;
color: #111827;
letter-spacing: 0.5px;
}
.hero-btn {
margin-top: 8px;
background: linear-gradient(90deg, #1d4ed8, #2563eb);
color: #fff;
border: none;
border-radius: 10px;
padding: 10px 16px;
cursor: pointer;
box-shadow: 0 2px 10px rgba(29, 78, 216, 0.25);
font-weight: 500;
}
.spacer {
height: 16px;
}
.top-tabs {
display: flex;
gap: 8px;
justify-content: center;
margin: 8px 0 16px;
}
.top-tab {
padding: 8px 14px;
border: 1px solid #d1d5db;
border-radius: 999px;
cursor: pointer;
background: #f9fafb;
font-size: 14px;
}
.top-tab.active {
background: #2563eb;
color: #fff;
border-color: #2563eb;
}
</style>

View File

@@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="iconify iconify--logos" width="37.07" height="36" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 198"><path fill="#41B883" d="M204.8 0H256L128 220.8L0 0h97.92L128 51.2L157.44 0h47.36Z"></path><path fill="#41B883" d="m0 0l128 220.8L256 0h-51.2L128 132.48L50.56 0H0Z"></path><path fill="#35495E" d="M50.56 0L128 133.12L204.8 0h-47.36L128 51.2L97.92 0H50.56Z"></path></svg>

After

Width:  |  Height:  |  Size: 496 B

View File

@@ -0,0 +1,448 @@
<script setup lang="ts">
import { ref } from 'vue'
import { convertFolder, stageArchive, processArchive, uploadList, sendImportToCms, setCmsConfig } from '../services/api'
const mode = ref<'folder' | 'archive' | 'list'>('folder')
const folderPath = ref('')
const prefix = ref('')
const file = ref<File | null>(null)
const stagedId = ref('')
const versionId = ref<number>(1001)
const listFile = ref<File | null>(null)
const loading = ref(false)
const result = ref<any>(null)
const error = ref('')
const showRaw = ref(false)
const cmsConfigured = ref(false)
function onFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
file.value = selectedFile
}
}
function onListFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
listFile.value = selectedFile
}
}
async function ensureCmsConfig(): Promise<boolean> {
try {
const base = (localStorage.getItem('cms.api.base') || '').trim()
if (base) { cmsConfigured.value = true; return true }
const b = window.prompt('请输入 CMS 接口地址(如 http://127.0.0.1:8080 ')
if (!b) return false
const t = window.prompt('可选:请输入 Bearer Token若接口需要认证') || ''
setCmsConfig(b, t)
cmsConfigured.value = true
return true
} catch { return false }
}
async function sendToCms() {
if (!result.value || !result.value.import) {
alert('没有导入JSON可发送')
return
}
const ok = await ensureCmsConfig()
if (!ok) return
const r = await sendImportToCms(result.value.import)
if (r.ok) {
alert('已发送到 CMS 导入接口')
} else {
alert(`发送失败:${r.error || '未知错误'}${r.status ? ' (HTTP '+r.status+')' : ''}`)
}
}
async function handleStageArchive() {
if (!file.value) {
error.value = '请选择一个压缩包zip、tar、tgz'
return
}
loading.value = true
error.value = ''
result.value = null
try {
const res = await stageArchive(file.value, prefix.value || undefined)
if (res.code === 0) {
stagedId.value = res.data.id
} else {
error.value = res.msg || '上传失败'
}
} catch (e) {
error.value = '网络错误或上传失败'
} finally {
loading.value = false
}
}
async function handleProcessArchive() {
if (!stagedId.value) {
error.value = '请先上传压缩包'
return
}
loading.value = true
error.value = ''
result.value = null
try {
const res = await processArchive(stagedId.value, prefix.value || undefined, versionId.value)
if (res.code === 0) {
result.value = res.data
} else {
error.value = res.msg || '处理失败'
}
} catch (e) {
error.value = '网络错误或处理失败'
} finally {
loading.value = false
}
}
async function handleConvertFolder() {
if (!folderPath.value.trim()) {
error.value = '请输入本地文件夹路径'
return
}
loading.value = true
error.value = ''
result.value = null
try {
const res = await convertFolder(folderPath.value, prefix.value || undefined)
if (res.ok) {
result.value = res
} else {
error.value = '处理失败'
}
} catch (e) {
error.value = '网络错误或处理失败'
} finally {
loading.value = false
}
}
async function handleUploadList() {
if (!listFile.value) {
error.value = '请选择一个包含路径或URL的文本文件'
return
}
loading.value = true
error.value = ''
result.value = null
try {
const res = await uploadList(listFile.value, prefix.value || undefined, versionId.value)
if (res.code === 0) {
result.value = res.data
} else {
error.value = res.msg || '处理失败'
}
} catch (e) {
error.value = '网络错误或处理失败'
} finally {
loading.value = false
}
}
</script>
<template>
<div class="card">
<h1>批量处理</h1>
<div class="row">
<label>模式</label>
<div class="radio-group">
<label><input type="radio" v-model="mode" value="folder"> 本地文件夹路径</label>
<label><input type="radio" v-model="mode" value="archive"> 上传压缩包</label>
<label><input type="radio" v-model="mode" value="list"> 路径/URL 列表</label>
</div>
</div>
<template v-if="mode === 'folder'">
<p class="description">输入本地文件夹路径服务端将批量重写 Markdown 引用并上传到 MinIO</p>
<div class="row">
<label>文件夹路径</label>
<input type="text" v-model="folderPath" placeholder="/Users/xxx/Docs/Manuals" />
</div>
<div class="row">
<label>MinIO 前缀可选</label>
<input type="text" v-model="prefix" placeholder="assets" />
</div>
<div class="actions">
<button @click="handleConvertFolder" :disabled="loading">
{{ loading ? '正在处理...' : '开始处理' }}
</button>
</div>
<div v-if="error" class="error-msg">{{ error }}</div>
<div v-if="result" class="result-area">
<h4>处理完成</h4>
<div class="actions" style="margin-top:8px;">
<button class="btn-secondary" @click="showRaw = !showRaw">{{ showRaw ? '隐藏返回JSON' : '显示返回JSON' }}</button>
</div>
<div class="actions" style="margin-top:8px;">
<button class="btn-secondary" @click="showRaw = !showRaw">{{ showRaw ? '隐藏返回JSON' : '显示返回JSON' }}</button>
<button class="btn-secondary" v-if="result.import" @click="sendToCms">发送到 CMS</button>
</div>
<div class="info-item">
<span class="label">文件数</span>
<span>{{ result.count }}</span>
</div>
<div class="files">
<div class="file" v-for="(f, i) in result.files" :key="i">
<div class="file-row">
<span class="label">源文件</span>
<span>{{ f.source }}</span>
</div>
<div class="file-row" v-if="(f.minio_presigned_url || f.minio_url)">
<span class="label">打开</span>
<a :href="f.minio_presigned_url || f.minio_url" target="_blank">查看</a>
</div>
<div class="file-row" v-if="f.minio_url">
<span class="label">MinIO</span>
<a :href="f.minio_url" target="_blank">{{ f.minio_url }}</a>
</div>
<div class="file-row" v-if="f.minio_presigned_url">
<span class="label">临时下载</span>
<a :href="f.minio_presigned_url" target="_blank">下载链接</a>
</div>
<div class="file-row">
<span class="label">资源重写</span>
<span>成功 {{ f.asset_ok }}失败 {{ f.asset_fail }}</span>
</div>
</div>
</div>
<pre v-if="showRaw" class="json-view">{{ JSON.stringify(result, null, 2) }}</pre>
</div>
</template>
<template v-else-if="mode === 'archive'">
<p class="description">分两步先上传压缩包再点击开始转换</p>
<div class="row">
<label>压缩包文件</label>
<input type="file" accept=".zip,.tar,.gz,.tgz" @change="onFileChange" />
</div>
<div class="actions">
<button @click="handleStageArchive" :disabled="loading">上传压缩包</button>
<button @click="handleProcessArchive" :disabled="loading || !stagedId">开始转换</button>
</div>
<div v-if="error" class="error-msg">{{ error }}</div>
<div v-if="result" class="result-area">
<h4>处理完成</h4>
<div class="actions" style="margin-top:8px;">
<button class="btn-secondary" @click="showRaw = !showRaw">{{ showRaw ? '隐藏返回JSON' : '显示返回JSON' }}</button>
<button class="btn-secondary" v-if="result.import" @click="sendToCms">发送到 CMS</button>
</div>
<div class="info-item">
<span class="label">文件数</span>
<span>{{ result.count }}</span>
</div>
<div class="files">
<div class="file" v-for="(f, i) in result.files" :key="i">
<div class="file-row">
<span class="label">源文件</span>
<span>{{ f.source }}</span>
</div>
<div class="file-row" v-if="(f.minio_presigned_url || f.minio_url)">
<span class="label">打开</span>
<a :href="f.minio_presigned_url || f.minio_url" target="_blank">查看</a>
</div>
<div class="file-row" v-if="f.minio_url">
<span class="label">MinIO</span>
<a :href="f.minio_url" target="_blank">{{ f.minio_url }}</a>
</div>
<div class="file-row" v-if="f.minio_presigned_url">
<span class="label">临时下载</span>
<a :href="f.minio_presigned_url" target="_blank">下载链接</a>
</div>
</div>
</div>
<div class="info-item" v-if="result.import">
<span class="label">导入JSON</span>
<span>{{ JSON.stringify(result.import) }}</span>
</div>
<pre v-if="showRaw" class="json-view">{{ JSON.stringify(result, null, 2) }}</pre>
</div>
</template>
<template v-else>
<p class="description">上传一个包含本地路径或 URL 的文本文件每行一个</p>
<div class="row">
<label>列表文件</label>
<input type="file" accept="text/plain" @change="onListFileChange" />
</div>
<div class="row">
<label>MinIO 前缀可选</label>
<input type="text" v-model="prefix" placeholder="assets" />
</div>
<div class="row">
<label>版本ID</label>
<input type="number" v-model.number="versionId" />
</div>
<div class="actions">
<button @click="handleUploadList" :disabled="loading">开始处理</button>
</div>
<div v-if="error" class="error-msg">{{ error }}</div>
<div v-if="result" class="result-area">
<h4>处理完成</h4>
<div class="actions" style="margin-top:8px;">
<button class="btn-secondary" @click="showRaw = !showRaw">{{ showRaw ? '隐藏返回JSON' : '显示返回JSON' }}</button>
<button class="btn-secondary" v-if="result.import" @click="sendToCms">发送到 CMS</button>
</div>
<div class="info-item">
<span class="label">文件数</span>
<span>{{ result.count }}</span>
</div>
<div class="files">
<div class="file" v-for="(f, i) in result.files" :key="i">
<div class="file-row">
<span class="label">源文件</span>
<span>{{ f.source }}</span>
</div>
<div class="file-row" v-if="(f.minio_presigned_url || f.minio_url)">
<span class="label">打开</span>
<a :href="f.minio_presigned_url || f.minio_url" target="_blank">查看</a>
</div>
<div class="file-row" v-if="f.minio_url">
<span class="label">MinIO</span>
<a :href="f.minio_url" target="_blank">{{ f.minio_url }}</a>
</div>
<div class="file-row" v-if="f.minio_presigned_url">
<span class="label">临时下载</span>
<a :href="f.minio_presigned_url" target="_blank">下载链接</a>
</div>
</div>
</div>
<div class="info-item" v-if="result.import">
<span class="label">导入JSON</span>
<span>{{ JSON.stringify(result.import) }}</span>
</div>
<pre v-if="showRaw" class="json-view">{{ JSON.stringify(result, null, 2) }}</pre>
</div>
</template>
</div>
</template>
<style scoped>
.card {
background: #fff;
border: 1px solid #e5e7eb;
border-radius: 12px;
padding: 20px;
box-shadow: 0 1px 2px rgba(0,0,0,0.05);
}
h1 {
font-size: 20px;
margin: 0 0 12px;
color: #111827;
}
.description {
color: #6b7280;
margin-bottom: 20px;
font-size: 14px;
}
.row {
display: flex;
gap: 12px;
align-items: center;
margin: 12px 0;
}
label {
min-width: 120px;
color: #374151;
font-weight: 500;
}
input[type="file"] {
flex: 1;
}
.actions {
margin-top: 20px;
}
button {
background: #2563eb;
color: #fff;
border: none;
border-radius: 8px;
padding: 10px 20px;
cursor: pointer;
font-weight: 500;
}
.btn-secondary {
background: #6b7280;
}
.json-view {
margin-top: 12px;
border: 1px solid #e5e7eb;
border-radius: 8px;
padding: 12px;
background: #f9fafb;
color: #111827;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
max-height: 300px;
overflow: auto;
}
button:disabled {
background: #9ca3af;
cursor: not-allowed;
}
.error-msg {
color: #dc2626;
margin-top: 12px;
}
.result-area {
margin-top: 24px;
background: #f0fdf4;
border: 1px solid #bbf7d0;
padding: 16px;
border-radius: 8px;
}
.files {
margin-top: 12px;
display: grid;
gap: 10px;
}
.file {
background: #fff;
border: 1px solid #e5e7eb;
border-radius: 8px;
padding: 10px;
}
.file-row {
display: flex;
gap: 8px;
align-items: center;
margin: 6px 0;
}
.info-item {
margin-bottom: 8px;
display: flex;
gap: 8px;
}
.label {
font-weight: 600;
}
.note {
margin-top: 12px;
font-size: 13px;
color: #6b7280;
}
</style>

View File

@@ -0,0 +1,494 @@
<script setup lang="ts">
import { ref, reactive, onMounted, computed } from 'vue'
import { setMinioConfig, testMinioConfig, listProfiles, saveProfile, loadProfile, getConfigSnapshot, type MinioConfig, setApiBase, createBucket, checkServerTime, syncServerTime } from '../services/api'
defineProps<{
modelValue: boolean
}>()
const emit = defineEmits<{
(e: 'update:modelValue', value: boolean): void
}>()
const config = reactive<MinioConfig>({
endpoint: '',
public: '',
access: '',
secret: '',
bucket: 'docs',
secure: false,
prefix: 'assets',
store_final: true,
public_read: true
})
const statusMsg = ref('未验证')
const statusColor = ref('#6b7280')
const showSecret = ref(false)
const secretType = computed(() => (showSecret.value ? 'text' : 'password'))
const profiles = ref<string[]>([])
const selectedProfile = ref('')
const saveName = ref('')
const apiBase = ref('')
const timeMsg = ref('未校准')
const timeColor = ref('#6b7280')
const syncing = ref(false)
function toBool(v: any): boolean {
const s = String(v ?? '').toLowerCase()
return s === '1' || s === 'true' || s === 'yes' || s === 'on'
}
function close() {
emit('update:modelValue', false)
}
async function testConnection() {
statusMsg.value = '正在测试...'
statusColor.value = '#2563eb'
try {
const res = await testMinioConfig(config)
if (res.ok && res.bucket_exists) {
statusMsg.value = '连接正常'
statusColor.value = '#059669'
} else if (res.ok && !res.bucket_exists) {
// 自动尝试创建桶并应用策略
const mk = await createBucket(config)
if (mk.ok) {
// 创建成功后再次验证
const ver = await testMinioConfig(config)
if (ver.ok && ver.bucket_exists) {
statusMsg.value = '桶已创建并应用策略'
statusColor.value = '#059669'
} else {
statusMsg.value = '创建后校验失败(请检查端口、凭据或网络)'
statusColor.value = '#dc2626'
}
} else {
// 创建返回失败,仍做一次验证,若已存在则视为成功
const ver = await testMinioConfig(config)
if (ver.ok && ver.bucket_exists) {
statusMsg.value = '桶已存在(创建返回失败但验证通过)'
statusColor.value = '#059669'
} else {
statusMsg.value = `桶不存在(创建失败):${mk.error || '未知错误'}`
statusColor.value = '#dc2626'
}
}
} else if (!res.ok && res.error) {
statusMsg.value = res.hint ? `错误:${res.error}${res.hint}` : `错误:${res.error}`
statusColor.value = '#dc2626'
} else {
statusMsg.value = '连接失败'
statusColor.value = '#dc2626'
}
} catch (e) {
statusMsg.value = '网络错误'
statusColor.value = '#dc2626'
}
}
async function saveConfig() {
try {
setApiBase(apiBase.value.trim())
await setMinioConfig(config)
await fetchProfiles()
alert('配置已保存!')
close()
} catch (e) {
alert('保存失败')
}
}
async function checkTime() {
try {
timeMsg.value = '正在检查...'
timeColor.value = '#2563eb'
const r = await checkServerTime(config)
if (r.ok) {
const d = r.diff_sec ?? null
if (d !== null) {
timeMsg.value = d === 0 ? '时间一致' : `偏差约 ${d}${r.hint ? '' + r.hint + '' : ''}`
timeColor.value = d <= 2 ? '#059669' : '#f59e0b'
} else {
timeMsg.value = r.hint ? `无法获取时间(${r.hint}` : '无法获取时间'
timeColor.value = '#dc2626'
}
} else {
timeMsg.value = r.error ? `错误:${r.error}` : '检查失败'
timeColor.value = '#dc2626'
}
} catch {
timeMsg.value = '网络错误'
timeColor.value = '#dc2626'
}
}
async function syncTime() {
if (syncing.value) return
syncing.value = true
timeMsg.value = '正在同步...'
timeColor.value = '#2563eb'
try {
const r = await syncServerTime('auto')
const ok = !!(r && r.ok)
await checkTime()
if (!ok) {
alert('时间同步可能需要管理员权限,请手动授予或在服务端执行')
}
} catch {
timeMsg.value = '同步失败'
timeColor.value = '#dc2626'
} finally {
syncing.value = false
}
}
async function fetchProfiles() {
try {
const res = await listProfiles()
profiles.value = res.profiles || []
} catch {
profiles.value = []
}
}
async function applyProfile() {
if (!selectedProfile.value) return
const res = await loadProfile(selectedProfile.value)
if (res.ok && res.config && res.config.minio) {
const m = res.config.minio
config.endpoint = m.endpoint || ''
config.public = m.public || ''
config.access = m.access || ''
config.secret = m.secret || ''
config.bucket = m.bucket || ''
config.secure = toBool(m.secure)
config.prefix = m.prefix || ''
config.store_final = toBool(m.store_final)
config.public_read = toBool(m.public_read)
}
}
async function saveAsProfile() {
if (!saveName.value.trim()) return
await setMinioConfig(config)
const res = await saveProfile(saveName.value.trim())
if (res.ok) {
await fetchProfiles()
alert('配置已保存为:' + (res.name || saveName.value))
} else {
alert('保存失败')
}
}
onMounted(async () => {
await fetchProfiles()
try {
const snap = await getConfigSnapshot()
const m = snap.minio || {}
config.endpoint = m.endpoint || config.endpoint
config.public = m.public || config.public
config.access = m.access || config.access
config.secret = m.secret || config.secret
config.bucket = m.bucket || config.bucket
config.secure = toBool(m.secure)
config.prefix = m.prefix || config.prefix
config.store_final = toBool(m.store_final)
config.public_read = toBool(m.public_read)
} catch {}
try {
const v = localStorage.getItem('app.api.base') || ''
apiBase.value = v
} catch {}
try { await checkTime() } catch {}
})
</script>
<template>
<div v-if="modelValue" class="modal-overlay" @click.self="close">
<div class="modal-content">
<div class="modal-header">
<h3>连接配置</h3>
<button class="close-btn" @click="close">×</button>
</div>
<div class="modal-body">
<h4 class="section-title">通用配置</h4>
<div class="form-row">
<label>接口地址</label>
<input v-model="apiBase" placeholder="http://127.0.0.1:8000" />
</div>
<h4 class="section-title">MinIO 配置</h4>
<div class="form-row">
<label>服务地址</label>
<input v-model="config.endpoint" placeholder="127.0.0.1:9000" />
</div>
<div class="form-row">
<label>公共访问地址</label>
<input v-model="config.public" placeholder="http://127.0.0.1:9000" />
</div>
<div class="form-row">
<label>访问密钥</label>
<input v-model="config.access" />
</div>
<div class="form-row">
<label>密钥</label>
<div class="password-row">
<input v-model="config.secret" :type="secretType" autocomplete="off" />
<button type="button" class="btn-secondary" @click="showSecret = !showSecret">{{ showSecret ? '隐藏' : '显示' }}</button>
</div>
</div>
<div class="form-row">
<label>存储桶</label>
<input v-model="config.bucket" placeholder="docs" />
</div>
<div class="form-row checkbox-row">
<label>
<input type="checkbox" v-model="config.secure" />
启用 HTTPS
</label>
</div>
<div class="form-row">
<label>对象前缀</label>
<input v-model="config.prefix" placeholder="assets" />
</div>
<div class="form-row checkbox-row">
<label>
<input type="checkbox" v-model="config.store_final" />
保存最终文件
</label>
<label style="margin-left: 16px;">
<input type="checkbox" v-model="config.public_read" />
桶公开读取
</label>
</div>
<div class="form-row">
<label>已保存的配置</label>
<div class="profiles-row">
<select v-model="selectedProfile">
<option value="">请选择</option>
<option v-for="p in profiles" :key="p" :value="p">{{ p }}</option>
</select>
<button type="button" class="btn-secondary" @click="applyProfile" :disabled="!selectedProfile">应用配置</button>
</div>
</div>
<div class="form-row">
<label>配置名称</label>
<div class="profiles-row">
<input v-model="saveName" placeholder="例如test 或 default" />
<button type="button" class="btn-secondary" @click="saveAsProfile" :disabled="!saveName.trim()">保存为配置</button>
</div>
</div>
</div>
<div class="modal-footer">
<div class="status" :style="{ color: statusColor }">{{ statusMsg }}</div>
<div class="actions">
<button class="btn-cancel" @click="close">取消</button>
<button class="btn-primary" @click="testConnection">测试连接</button>
<button class="btn-primary" @click="saveConfig">保存配置</button>
<button class="btn-secondary" @click="checkTime">检查时间</button>
<button class="btn-secondary" @click="syncTime" :disabled="syncing">时间同步</button>
<div class="status" :style="{ color: timeColor }">{{ timeMsg }}</div>
</div>
</div>
</div>
</div>
</template>
<style scoped>
.modal-overlay {
position: fixed;
top: 0;
left: 0;
width: 100%;
height: 100%;
background: rgba(0, 0, 0, 0.5);
display: flex;
justify-content: center;
align-items: center;
z-index: 1000;
}
.modal-content {
background: white;
border-radius: 12px;
width: 640px;
max-width: 90%;
box-shadow: 0 8px 24px rgba(0, 0, 0, 0.12);
display: flex;
flex-direction: column;
}
.modal-header {
padding: 16px;
border-bottom: 1px solid #e5e7eb;
display: flex;
justify-content: space-between;
align-items: center;
}
.modal-header h3 {
margin: 0;
font-size: 18px;
font-weight: 600;
}
.close-btn {
background: none;
border: none;
font-size: 24px;
cursor: pointer;
color: #6b7280;
}
.modal-body {
padding: 16px;
overflow-y: auto;
max-height: 60vh;
}
.section-title {
margin: 0 0 12px 0;
font-size: 14px;
color: #6b7280;
text-transform: uppercase;
letter-spacing: 0.05em;
}
.form-row {
margin-bottom: 12px;
display: grid;
grid-template-columns: 140px 1fr auto;
align-items: center;
gap: 8px 12px;
}
.form-row label {
display: block;
margin: 0;
font-size: 14px;
color: #374151;
font-weight: 500;
text-align: right;
padding-right: 8px;
}
.form-row input[type="text"],
.form-row input[type="password"],
.form-row select {
width: 100%;
padding: 8px 12px;
border: 1px solid #d1d5db;
border-radius: 6px;
font-size: 14px;
box-sizing: border-box;
}
.password-row {
display: grid;
grid-template-columns: 1fr auto;
align-items: center;
gap: 8px;
}
.profiles-row {
display: flex;
gap: 8px;
}
.checkbox-row {
display: flex;
align-items: center;
padding-left: 140px;
}
.checkbox-row label {
display: flex;
align-items: center;
gap: 6px;
margin-bottom: 0;
cursor: pointer;
}
.modal-footer {
padding: 16px;
border-top: 1px solid #e5e7eb;
display: flex;
justify-content: space-between;
align-items: center;
}
.actions {
display: flex;
gap: 8px;
}
.btn-cancel {
padding: 8px 16px;
background: #f3f4f6;
border: 1px solid #d1d5db;
border-radius: 6px;
cursor: pointer;
color: #111827;
font-size: 14px;
}
.btn-primary {
padding: 8px 16px;
background: #2563eb;
color: white;
border: none;
border-radius: 6px;
cursor: pointer;
font-size: 14px;
}
.btn-primary:hover {
background: #1d4ed8;
}
.btn-secondary {
padding: 8px 12px;
background: #f3f4f6;
border: 1px solid #d1d5db;
border-radius: 6px;
cursor: pointer;
height: 36px;
display: inline-flex;
align-items: center;
justify-content: center;
font-size: 13px;
color: #111827;
}
.btn-secondary:disabled {
background: #f9fafb;
border-color: #e5e7eb;
color: #9ca3af;
cursor: not-allowed;
opacity: 1;
}
.btn-cancel {
height: 36px;
}
.status {
font-size: 14px;
font-weight: 500;
background: #f9fafb;
border: 1px solid #e5e7eb;
border-radius: 999px;
padding: 6px 12px;
}
</style>

View File

@@ -0,0 +1,337 @@
<script setup lang="ts">
import { ref, computed } from 'vue'
import { convertDoc } from '../services/api'
import { marked } from 'marked'
const mode = ref<'url' | 'file'>('url')
const sourceUrl = ref('')
const file = ref<File | null>(null)
const exportFormat = ref('markdown')
const saveToServer = ref(true)
const filename = ref('')
const loading = ref(false)
const result = ref<any>(null)
const error = ref('')
const activeTab = ref<'preview' | 'raw' | 'trace'>('preview')
function onFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
file.value = selectedFile
}
}
async function handleConvert() {
loading.value = true
error.value = ''
result.value = null
const formData = new FormData()
if (mode.value === 'url') {
if (!sourceUrl.value) {
error.value = '请输入链接'
loading.value = false
return
}
formData.append('source_url', sourceUrl.value)
} else {
if (!file.value) {
error.value = '请选择文件'
loading.value = false
return
}
formData.append('file', file.value)
}
formData.append('export', exportFormat.value)
formData.append('save', String(saveToServer.value))
if (filename.value) {
formData.append('filename', filename.value)
}
try {
const res = await convertDoc(formData)
if (res.code === 0) {
result.value = res.data
if (!result.value.content) {
const tryFetch = async (url?: string) => {
if (!url) return false
try {
const r = await fetch(url)
if (r.ok) {
result.value.content = await r.text()
return true
}
} catch {}
return false
}
// 优先尝试直链;失败则回退到临时下载链接(桶未公开读取时可用)
const ok = await tryFetch(result.value.minio_url)
if (!ok) {
await tryFetch(result.value.minio_presigned_url)
}
}
} else {
error.value = res.msg || '转换失败'
}
} catch (e) {
error.value = '网络错误'
} finally {
loading.value = false
}
}
const renderedContent = computed(() => {
const mt = String(result.value?.media_type || '').toLowerCase()
const isMd = exportFormat.value === 'markdown' || mt.startsWith('text/markdown')
const isHtml = mt.startsWith('text/html')
let c = result.value?.content as string | undefined
if (!c) return ''
if (isMd) {
try {
c = c.replace(/!\[[^\]]*\]\(([^)]+)\)/g, (m, u) => {
try {
const url = new URL(u)
url.pathname = encodeURI(url.pathname)
return m.replace(u, url.toString())
} catch {
try {
const enc = encodeURI(u)
return m.replace(u, enc)
} catch {
return m
}
}
})
} catch {}
return marked.parse(c as string)
}
if (isHtml) return c
return c
})
</script>
<template>
<div class="card">
<h1>DOCX/PDF Markdown</h1>
<div class="row">
<label>输入方式</label>
<div class="radio-group">
<label><input type="radio" v-model="mode" value="url"> 链接</label>
<label><input type="radio" v-model="mode" value="file"> 文件</label>
</div>
</div>
<div class="row" v-if="mode === 'url'">
<label>链接</label>
<input type="text" v-model="sourceUrl" placeholder="https://..." />
</div>
<div class="row" v-if="mode === 'file'">
<label>文件</label>
<input type="file" @change="onFileChange" />
</div>
<div class="row">
<label>输出格式</label>
<select v-model="exportFormat">
<option value="markdown">Markdown</option>
<option value="html">HTML</option>
<option value="json">JSON</option>
<option value="doctags">DocTags</option>
</select>
</div>
<div class="row">
<label>选项</label>
<div class="checkbox-group">
<label><input type="checkbox" v-model="saveToServer"> 保存到服务器MinIO</label>
</div>
</div>
<div class="row">
<label>文件名可选</label>
<input type="text" v-model="filename" placeholder="默认从来源推断" />
</div>
<div class="actions">
<button @click="handleConvert" :disabled="loading">
{{ loading ? '正在转换...' : '开始转换' }}
</button>
</div>
<div v-if="error" class="error-msg">{{ error }}</div>
<div v-if="result" class="result-area">
<div class="tabs">
<div
class="tab"
:class="{ active: activeTab === 'preview' }"
@click="activeTab = 'preview'"
>预览</div>
<div
class="tab"
:class="{ active: activeTab === 'raw' }"
@click="activeTab = 'raw'"
>源文本</div>
<div
class="tab"
:class="{ active: activeTab === 'trace' }"
@click="activeTab = 'trace'"
>过程日志</div>
</div>
<div class="output-container">
<div v-if="activeTab === 'preview'" class="preview-content markdown-body" v-html="renderedContent"></div>
<div v-if="activeTab === 'raw'" class="raw-content">{{ result.content }}</div>
<div v-if="activeTab === 'trace'" class="raw-content">
<div v-for="(line, idx) in (result.trace || [])" :key="idx">{{ line }}</div>
<div v-if="(result.mappings || []).length">
<div>mappings:</div>
<div v-for="(m, i) in result.mappings" :key="i">{{ JSON.stringify(m) }}</div>
</div>
</div>
</div>
<div class="links-section" v-if="result.minio_url || result.minio_presigned_url">
<h4>文档链接</h4>
<div class="link-item" v-if="result.minio_presigned_url">
<span class="label">打开</span>
<a :href="result.minio_presigned_url" target="_blank">查看</a>
</div>
<div class="link-item" v-if="result.minio_url && !result.minio_presigned_url">
<span class="label">MinIO 地址</span>
<a :href="result.minio_url" target="_blank">{{ result.minio_url }}</a>
</div>
</div>
</div>
</div>
</template>
<style scoped>
.card {
background: #fff;
border: 1px solid #e5e7eb;
border-radius: 12px;
padding: 20px;
box-shadow: 0 1px 2px rgba(0,0,0,0.05);
}
h1 {
font-size: 20px;
margin: 0 0 12px;
color: #111827;
}
.row {
display: flex;
gap: 12px;
align-items: center;
margin: 12px 0;
}
label {
min-width: 120px;
color: #374151;
font-weight: 500;
}
input[type="text"], select {
flex: 1;
padding: 8px 10px;
border: 1px solid #d1d5db;
border-radius: 6px;
}
.radio-group, .checkbox-group {
display: flex;
gap: 12px;
}
.actions {
margin-top: 20px;
}
button {
background: #2563eb;
color: #fff;
border: none;
border-radius: 8px;
padding: 10px 20px;
cursor: pointer;
font-weight: 500;
}
button:disabled {
background: #9ca3af;
cursor: not-allowed;
}
.error-msg {
color: #dc2626;
margin-top: 12px;
}
.result-area {
margin-top: 24px;
border-top: 1px solid #e5e7eb;
padding-top: 16px;
}
.tabs {
display: flex;
gap: 8px;
margin-bottom: 12px;
}
.tab {
padding: 8px 16px;
border-radius: 6px;
border: 1px solid #d1d5db;
background: #f9fafb;
cursor: pointer;
}
.tab.active {
background: #2563eb;
color: white;
border-color: #2563eb;
}
.output-container {
border: 1px solid #e5e7eb;
border-radius: 8px;
padding: 16px;
min-height: 200px;
background: #fff;
overflow: auto;
max-height: 500px;
}
.raw-content {
white-space: pre-wrap;
font-family: monospace;
}
.links-section {
margin-top: 16px;
background: #f3f4f6;
padding: 12px;
border-radius: 8px;
}
.link-item {
margin-top: 4px;
display: flex;
gap: 8px;
}
.link-item .label {
font-weight: 600;
min-width: auto;
}
</style>

View File

@@ -0,0 +1,41 @@
<script setup lang="ts">
import { ref } from 'vue'
defineProps<{ msg: string }>()
const count = ref(0)
</script>
<template>
<h1>{{ msg }}</h1>
<div class="card">
<button type="button" @click="count++">count is {{ count }}</button>
<p>
Edit
<code>components/HelloWorld.vue</code> to test HMR
</p>
</div>
<p>
Check out
<a href="https://vuejs.org/guide/quick-start.html#local" target="_blank"
>create-vue</a
>, the official Vue + Vite starter
</p>
<p>
Learn more about IDE Support for Vue in the
<a
href="https://vuejs.org/guide/scaling-up/tooling.html#ide-support"
target="_blank"
>Vue Docs Scaling up Guide</a
>.
</p>
<p class="read-the-docs">Click on the Vite and Vue logos to learn more</p>
</template>
<style scoped>
.read-the-docs {
color: #888;
}
</style>

View File

@@ -0,0 +1,384 @@
<script setup lang="ts">
import { ref, watch } from 'vue'
import { convertMd } from '../services/api'
const mode = ref<'text' | 'file' | 'url'>('text')
const mdText = ref('')
const file = ref<File | null>(null)
const url = ref('')
const targetFormat = ref('pdf')
const saveToServer = ref(false)
const filename = ref('')
// Advanced
const showAdvanced = ref(false)
const toc = ref(true)
const headerText = ref('')
const footerText = ref('')
const cssName = ref('default')
const cssText = ref('')
const logoUrl = ref('')
const logoFile = ref<File | null>(null)
const coverUrl = ref('')
const coverFile = ref<File | null>(null)
const productName = ref('')
const documentName = ref('')
const productVersion = ref('')
const documentVersion = ref('')
const copyrightText = ref('')
const loading = ref(false)
const result = ref<any>(null)
const downloadUrl = ref('')
const error = ref('')
watch(targetFormat, (val) => {
if (val === 'docx' || val === 'pdf') {
showAdvanced.value = true
} else {
showAdvanced.value = false
}
}, { immediate: true })
function onFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
file.value = selectedFile
}
}
function onLogoFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
logoFile.value = selectedFile
}
}
function onCoverFileChange(e: Event) {
const target = e.target as HTMLInputElement
const selectedFile = target.files?.[0]
if (selectedFile) {
coverFile.value = selectedFile
}
}
async function handleConvert() {
loading.value = true
error.value = ''
result.value = null
downloadUrl.value = ''
const fd = new FormData()
if (mode.value === 'text') {
if (!mdText.value.trim()) {
error.value = '请输入 Markdown 文本'
loading.value = false
return
}
fd.append('markdown_text', mdText.value)
} else if (mode.value === 'file') {
if (!file.value) {
error.value = '请选择文件'
loading.value = false
return
}
fd.append('md_file', file.value)
} else {
if (!url.value.trim()) {
error.value = '请输入链接'
loading.value = false
return
}
fd.append('markdown_url', url.value)
}
fd.append('target', targetFormat.value)
if (saveToServer.value) fd.append('save', 'true')
if (filename.value) fd.append('filename', filename.value)
// Advanced params
if (showAdvanced.value) {
fd.append('toc', String(toc.value))
if (headerText.value) fd.append('header_text', headerText.value)
if (footerText.value) fd.append('footer_text', footerText.value)
if (cssName.value) fd.append('css_name', cssName.value)
if (cssText.value) fd.append('css_text', cssText.value)
if (logoUrl.value) fd.append('logo_url', logoUrl.value)
if (logoFile.value) fd.append('logo_file', logoFile.value)
if (coverUrl.value) fd.append('cover_url', coverUrl.value)
if (coverFile.value) fd.append('cover_file', coverFile.value)
if (productName.value) fd.append('product_name', productName.value)
if (documentName.value) fd.append('document_name', documentName.value)
if (productVersion.value) fd.append('product_version', productVersion.value)
if (documentVersion.value) fd.append('document_version', documentVersion.value)
if (copyrightText.value) fd.append('copyright_text', copyrightText.value)
}
try {
const res = await convertMd(fd)
if (!res.ok) {
throw new Error(`HTTP ${res.status}`)
}
const ct = res.headers.get('content-type') || ''
if (ct.includes('application/json')) {
const json = await res.json()
result.value = json
downloadUrl.value = json.minio_presigned_url || json.minio_url
} else {
const blob = await res.blob()
downloadUrl.value = URL.createObjectURL(blob)
}
} catch (e) {
error.value = '转换失败:' + String(e)
} finally {
loading.value = false
}
}
</script>
<template>
<div class="card">
<h1>Markdown DOCX/PDF</h1>
<div class="row">
<label>输入方式</label>
<div class="radio-group">
<label><input type="radio" v-model="mode" value="text"> 文本</label>
<label><input type="radio" v-model="mode" value="file"> 文件</label>
<label><input type="radio" v-model="mode" value="url"> 链接</label>
</div>
</div>
<div class="row" v-if="mode === 'text'">
<label>Markdown 内容</label>
<textarea v-model="mdText" rows="8" placeholder="# 在此输入 Markdown 内容..."></textarea>
</div>
<div class="row" v-if="mode === 'file'">
<label>文件</label>
<input type="file" accept=".md,.markdown,.txt" @change="onFileChange" />
</div>
<div class="row" v-if="mode === 'url'">
<label>链接</label>
<input type="text" v-model="url" placeholder="http(s)://..." />
</div>
<div class="row">
<label>目标格式</label>
<select v-model="targetFormat">
<option value="docx">DOCX</option>
<option value="pdf">PDF</option>
</select>
</div>
<div class="advanced-section" v-if="showAdvanced">
<div class="row">
<label>CSS 模板</label>
<select v-model="cssName">
<option value="default">Default</option>
<option value="">None</option>
</select>
</div>
<div class="row">
<label>自定义 CSS</label>
<textarea v-model="cssText" rows="6" placeholder="/* Enter custom CSS here */"></textarea>
</div>
<div class="row">
<label>目录</label>
<select v-model="toc">
<option :value="true">开启</option>
<option :value="false">关闭</option>
</select>
</div>
<div class="row">
<label>页眉文本</label>
<input type="text" v-model="headerText" placeholder="e.g. Internal Document" />
</div>
<div class="row">
<label>页脚文本</label>
<input type="text" v-model="footerText" placeholder="e.g. Confidential" />
</div>
<div class="row">
<label>Logo</label>
<div class="input-group">
<input type="text" v-model="logoUrl" placeholder="http(s) or /absolute/path" />
<input type="file" accept="image/png,image/jpeg,image/svg+xml,image/webp" @change="onLogoFileChange" />
</div>
</div>
<div class="row">
<label>封面图片</label>
<div class="input-group">
<input type="text" v-model="coverUrl" placeholder="http(s) or /absolute/path" />
<input type="file" accept="image/png,image/jpeg,image/svg+xml,image/webp" @change="onCoverFileChange" />
</div>
</div>
<div class="row">
<label>封面文字</label>
<div class="grid-group">
<input type="text" v-model="productName" placeholder="Product Name" />
<input type="text" v-model="documentName" placeholder="Document Name" />
<input type="text" v-model="productVersion" placeholder="Product Version" />
<input type="text" v-model="documentVersion" placeholder="Document Version" />
<div class="db-note">此处应从数据库接口自动获取待接口完成后即可</div>
</div>
</div>
<div class="row">
<label>版权信息</label>
<input type="text" v-model="copyrightText" placeholder="© Company Name" />
</div>
</div>
<div class="row">
<label>选项</label>
<div class="checkbox-group">
<label><input type="checkbox" v-model="saveToServer"> 保存到服务器</label>
</div>
</div>
<div class="row">
<label>文件名可选</label>
<input type="text" v-model="filename" placeholder="默认文档" />
</div>
<div class="actions">
<button @click="handleConvert" :disabled="loading">
{{ loading ? '正在转换...' : '开始转换' }}
</button>
<a v-if="downloadUrl" :href="downloadUrl" class="download-btn" download target="_blank">下载文件</a>
</div>
<div v-if="error" class="error-msg">{{ error }}</div>
<div v-if="result" class="result-area">
<h4>结果</h4>
<pre>{{ JSON.stringify(result, null, 2) }}</pre>
</div>
</div>
</template>
<style scoped>
.card {
background: #fff;
border: 1px solid #e5e7eb;
border-radius: 12px;
padding: 20px;
box-shadow: 0 1px 2px rgba(0,0,0,0.05);
}
h1 {
font-size: 20px;
margin: 0 0 12px;
color: #111827;
}
.row {
display: flex;
gap: 12px;
align-items: center;
margin: 12px 0;
}
label {
min-width: 120px;
color: #374151;
font-weight: 500;
}
input[type="text"], select, textarea {
flex: 1;
padding: 8px 10px;
border: 1px solid #d1d5db;
border-radius: 6px;
font-family: inherit;
}
textarea {
font-family: monospace;
white-space: pre;
}
.radio-group, .checkbox-group {
display: flex;
gap: 12px;
}
.actions {
margin-top: 20px;
display: flex;
gap: 12px;
}
button, .download-btn {
background: #2563eb;
color: #fff;
border: none;
border-radius: 8px;
padding: 10px 20px;
cursor: pointer;
font-weight: 500;
text-decoration: none;
display: inline-flex;
align-items: center;
}
button:disabled {
background: #9ca3af;
cursor: not-allowed;
}
.download-btn {
background: #059669;
}
.error-msg {
color: #dc2626;
margin-top: 12px;
}
.result-area {
margin-top: 24px;
border-top: 1px solid #e5e7eb;
padding-top: 16px;
background: #f9fafb;
padding: 12px;
border-radius: 8px;
overflow: auto;
}
.input-group {
display: flex;
gap: 8px;
align-items: center;
flex: 1;
}
.grid-group {
flex: 1;
display: grid;
grid-template-columns: repeat(2, 1fr);
gap: 8px;
}
.db-note {
color: red;
font-size: 12px;
grid-column: 1/-1;
margin-top: 4px;
}
</style>

5
frontend/src/main.ts Normal file
View File

@@ -0,0 +1,5 @@
import { createApp } from 'vue'
import './style.css'
import App from './App.vue'
createApp(App).mount('#app')

View File

@@ -0,0 +1,305 @@
export interface ConvertResponse {
code: number
msg: string
data: {
encoding?: string
content?: string
name?: string
minio_url?: string
minio_presigned_url?: string
export?: string
media_type?: string
}
}
export interface ArchiveResponse {
code: number
msg: string
data: {
count: number
files: Array<{ source: string, minio_url?: string, minio_presigned_url?: string, object_name?: string, size?: number }>
import?: { versionId: number, tree: any[] }
}
}
export interface MinioConfig {
endpoint: string
public?: string
access: string
secret: string
bucket: string
secure?: boolean
prefix?: string
store_final?: boolean
public_read?: boolean
}
const API_BASE = '/api'
const CONFIG_BASE = '/config'
const API_BASE_KEY = 'app.api.base'
const CMS_BASE_KEY = 'cms.api.base'
const CMS_TOKEN_KEY = 'cms.api.token'
function normalizeApiBase(v: string): string {
let s = String(v || '').trim()
if (!s) return ''
if (s.startsWith('//')) s = s.slice(2)
if (s.startsWith('/')) s = s.slice(1)
if (!/^https?:\/\//i.test(s)) s = `http://${s}`
return s.replace(/\/+$/, '')
}
export function setApiBase(v: string) {
try { localStorage.setItem(API_BASE_KEY, normalizeApiBase(v)) } catch {}
}
function baseUrl(): string {
try {
const ls = normalizeApiBase(localStorage.getItem(API_BASE_KEY) || '')
const env = normalizeApiBase((import.meta as any)?.env?.VITE_API_BASE_URL || '')
if (ls) {
console.debug('[API] using localStorage base:', ls)
return ls
}
if (env) {
console.debug('[API] using env base:', env)
return env
}
// No auto-fallback: use same-origin relative paths when not configured
return ''
} catch {
return ''
}
}
function joinUrl(base: string, path: string): string {
const b = (base || '').replace(/\/+$/, '')
const p = path.startsWith('/') ? path : `/${path}`
return `${b}${p}`
}
function apiFetch(path: string, init?: RequestInit) {
const b = baseUrl()
const url = b ? joinUrl(b, path) : path
console.debug('[API] fetch:', url)
return fetch(url, init)
}
function normalizeEndpoint(ep: string): string {
let s = String(ep || '').trim()
if (!s) return ''
try {
const hasScheme = /^https?:\/\//i.test(s)
if (hasScheme) {
const u = new URL(s)
s = u.host
}
const first = s.split('/')[0] || ''
s = first
} catch {
const first = s.split('/')[0] || ''
s = first
}
return s
}
export async function convertDoc(formData: FormData): Promise<ConvertResponse> {
const res = await apiFetch(`${API_BASE}/convert`, {
method: 'POST',
body: formData
})
return res.json()
}
export async function uploadArchive(formData: FormData): Promise<ArchiveResponse> {
const res = await apiFetch(`${API_BASE}/upload-archive`, {
method: 'POST',
body: formData
})
return res.json()
}
export async function setMinioConfig(config: MinioConfig): Promise<{ ok: boolean }> {
const formData = new FormData()
Object.entries(config).forEach(([key, value]) => {
if (value !== undefined) {
const v = key === 'endpoint' ? normalizeEndpoint(String(value)) : String(value)
formData.append(key, v)
}
})
const res = await apiFetch(`${CONFIG_BASE}/minio`, {
method: 'POST',
body: formData
})
return res.json()
}
export async function testMinioConfig(config: MinioConfig): Promise<{ ok: boolean, connected: boolean, bucket_exists: boolean, error?: string, created?: boolean, hint?: string }> {
const formData = new FormData()
Object.entries(config).forEach(([key, value]) => {
if (value !== undefined) {
const v = key === 'endpoint' ? normalizeEndpoint(String(value)) : String(value)
formData.append(key, v)
}
})
formData.append('create_if_missing', 'true')
const res = await apiFetch(`${CONFIG_BASE}/minio/test`, {
method: 'POST',
body: formData
})
return res.json()
}
export async function createBucket(config: MinioConfig): Promise<{ ok: boolean, bucket_exists?: boolean, error?: string, hint?: string }> {
const formData = new FormData()
formData.append('endpoint', normalizeEndpoint(String(config.endpoint)))
formData.append('access', String(config.access))
formData.append('secret', String(config.secret))
formData.append('bucket', String(config.bucket))
if (config.secure !== undefined) formData.append('secure', String(config.secure))
if (config.public_read !== undefined) formData.append('public_read', String(config.public_read))
const res = await apiFetch(`/config/minio/create-bucket`, { method: 'POST', body: formData })
return res.json()
}
export async function convertMd(formData: FormData): Promise<Response> {
return apiFetch(`/md/convert`, {
method: 'POST',
body: formData
})
}
export async function convertFolder(folderPath: string, prefix?: string): Promise<{ ok: boolean, count: number, files: any[] }> {
const form = new FormData()
form.append('folder_path', folderPath)
if (prefix) form.append('prefix', prefix)
const res = await apiFetch(`/md/convert-folder`, { method: 'POST', body: form })
return res.json()
}
export async function listProfiles(): Promise<{ ok: boolean, profiles: string[] }> {
try {
const res = await apiFetch(`/config/profiles`)
try {
return await res.json()
} catch {
return { ok: false, profiles: [] }
}
} catch {
return { ok: false, profiles: [] }
}
}
export async function stageArchive(file: File, prefix?: string): Promise<{ code: number, msg: string, data: { id: string, name: string, size: number } }> {
const fd = new FormData()
fd.append('file', file)
if (prefix) fd.append('prefix', prefix)
const res = await apiFetch(`/api/archive/stage`, { method: 'POST', body: fd })
return res.json()
}
export async function processArchive(id: string, prefix?: string, versionId?: number): Promise<ArchiveResponse> {
const fd = new FormData()
fd.append('id', id)
if (prefix) fd.append('prefix', prefix)
if (versionId !== undefined) fd.append('versionId', String(versionId))
const res = await apiFetch(`/api/archive/process`, { method: 'POST', body: fd })
return res.json()
}
export async function uploadList(file: File, prefix?: string, versionId?: number): Promise<ArchiveResponse> {
const fd = new FormData()
fd.append('list_file', file)
if (prefix) fd.append('prefix', prefix)
if (versionId !== undefined) fd.append('versionId', String(versionId))
const res = await apiFetch(`/api/upload-list`, { method: 'POST', body: fd })
return res.json()
}
function cmsBaseUrl(): string {
try {
const val = localStorage.getItem(CMS_BASE_KEY) || ''
return normalizeApiBase(val)
} catch { return '' }
}
export function setCmsConfig(base?: string, token?: string) {
try {
if (base !== undefined) localStorage.setItem(CMS_BASE_KEY, normalizeApiBase(base))
if (token !== undefined) localStorage.setItem(CMS_TOKEN_KEY, String(token))
} catch {}
}
export async function sendImportToCms(payload: any): Promise<{ ok: boolean, status?: number, error?: string }> {
const base = cmsBaseUrl()
if (!base) return { ok: false, error: '未配置 CMS 接口地址' }
const url = joinUrl(base, '/cms/api/v1/document/directory/import')
const token = (localStorage.getItem(CMS_TOKEN_KEY) || '').trim()
const headers: Record<string, string> = { 'Content-Type': 'application/json' }
if (token) headers['Authorization'] = `Bearer ${token}`
const res = await fetch(url, { method: 'POST', headers, body: JSON.stringify(payload) })
if (!res.ok) return { ok: false, status: res.status, error: `HTTP ${res.status}` }
return { ok: true }
}
export async function saveProfile(name: string): Promise<{ ok: boolean, name?: string }> {
const form = new FormData()
form.append('name', name)
const res = await apiFetch(`/config/save_profile`, { method: 'POST', body: form })
return res.json()
}
export async function loadProfile(name: string): Promise<{ ok: boolean, config?: any }> {
const res = await apiFetch(`/config/load_profile?name=${encodeURIComponent(name)}`)
return res.json()
}
export async function getConfigSnapshot(): Promise<{ minio: MinioConfig, db: Record<string, any> }> {
const res = await apiFetch(`/config`)
return res.json()
}
export async function checkServerTime(config?: Partial<MinioConfig>): Promise<{ ok: boolean, diff_sec?: number, server_time?: string, local_time?: string, hint?: string, error?: string }>{
try {
const ep = config?.endpoint ? normalizeEndpoint(String(config?.endpoint)) : ''
const pub = String(config?.public || '').trim()
const sec = config?.secure !== undefined ? String(!!config?.secure) : ''
const qs: string[] = []
if (ep) qs.push(`endpoint=${encodeURIComponent(ep)}`)
if (pub) qs.push(`public=${encodeURIComponent(pub)}`)
if (sec) qs.push(`secure=${encodeURIComponent(sec)}`)
const q = qs.length ? `?${qs.join('&')}` : ''
let res = await apiFetch(`/system/time/check${q}`)
if (res.ok) {
try { return await res.json() } catch {}
}
res = await apiFetch(`/api/system/time/check${q}`)
if (res.ok) {
try { return await res.json() } catch {}
}
return { ok: false, error: `HTTP ${res.status}` }
} catch (e: any) {
return { ok: false, error: 'NETWORK' }
}
}
export async function syncServerTime(method?: string, ntpServer?: string): Promise<{ ok: boolean, result?: any, check?: any }>{
const fd = new FormData()
if (method) fd.append('method', method)
if (ntpServer) fd.append('ntp_server', ntpServer)
try {
let res = await apiFetch(`/system/time/sync`, { method: 'POST', body: fd })
if (res.ok) {
try { return await res.json() } catch {}
}
res = await apiFetch(`/api/system/time/sync`, { method: 'POST', body: fd })
if (res.ok) {
try { return await res.json() } catch {}
}
return { ok: false }
} catch {
return { ok: false }
}
}

72
frontend/src/style.css Normal file
View File

@@ -0,0 +1,72 @@
:root {
font-family: -apple-system, system-ui, Segoe UI, Roboto, Helvetica, Arial, sans-serif;
line-height: 1.5;
font-weight: 400;
color-scheme: light;
color: #111827;
background-color: #ffffff;
font-synthesis: none;
text-rendering: optimizeLegibility;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
a {
font-weight: 500;
color: #646cff;
text-decoration: inherit;
}
a:hover {
color: #535bf2;
}
body {
margin: 0;
min-width: 320px;
min-height: 100vh;
}
h1 {
font-size: 3.2em;
line-height: 1.1;
}
button {
border-radius: 8px;
border: 1px solid transparent;
padding: 0.6em 1.2em;
font-size: 1em;
font-weight: 500;
font-family: inherit;
background-color: #2563eb;
color: #fff;
cursor: pointer;
transition: border-color 0.25s;
}
button:hover {
border-color: #646cff;
}
button:focus,
button:focus-visible {
outline: 4px auto -webkit-focus-ring-color;
}
.card {
padding: 2em;
}
#app {
max-width: 1280px;
margin: 0 auto;
padding: 2rem;
text-align: center;
}
input, select, textarea {
background: #ffffff;
color: #111827;
}
input::placeholder, textarea::placeholder {
color: #9ca3af;
}

View File

@@ -0,0 +1,14 @@
import fs from 'node:fs'
import path from 'node:path'
const p = path.resolve(process.cwd(), 'frontend/src/components/DocToMd.vue')
const s = fs.readFileSync(p, 'utf-8')
if (!s.includes('const saveToServer = ref(true)')) {
console.error('saveToServer 默认未设置为 true')
process.exit(1)
}
if (!s.includes("mt.startsWith('text/markdown')")) {
console.error('renderedContent 未按 media_type 判断 Markdown')
process.exit(1)
}
console.log('前端源码检查通过')

View File

@@ -0,0 +1,16 @@
{
"extends": "@vue/tsconfig/tsconfig.dom.json",
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.app.tsbuildinfo",
"types": ["vite/client"],
/* Linting */
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": ["src/**/*.ts", "src/**/*.tsx", "src/**/*.vue"]
}

7
frontend/tsconfig.json Normal file
View File

@@ -0,0 +1,7 @@
{
"files": [],
"references": [
{ "path": "./tsconfig.app.json" },
{ "path": "./tsconfig.node.json" }
]
}

View File

@@ -0,0 +1,26 @@
{
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.node.tsbuildinfo",
"target": "ES2023",
"lib": ["ES2023"],
"module": "ESNext",
"types": ["node"],
"skipLibCheck": true,
/* Bundler mode */
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"moduleDetection": "force",
"noEmit": true,
/* Linting */
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": ["vite.config.ts"]
}

47
frontend/vite.config.ts Normal file
View File

@@ -0,0 +1,47 @@
import { defineConfig } from 'vite'
import vue from '@vitejs/plugin-vue'
// https://vite.dev/config/
export default defineConfig({
plugins: [vue()],
server: {
proxy: {
'/api': {
target: 'http://localhost:8000',
changeOrigin: true,
configure: (proxy) => {
const p = proxy as any
p.timeout = 120000
p.proxyTimeout = 120000
},
},
'/config': {
target: 'http://localhost:8000',
changeOrigin: true,
configure: (proxy) => {
const p = proxy as any
p.timeout = 120000
p.proxyTimeout = 120000
},
},
'/md': {
target: 'http://localhost:8000',
changeOrigin: true,
configure: (proxy) => {
const p = proxy as any
p.timeout = 120000
p.proxyTimeout = 120000
},
},
'/refresh.js': {
target: 'http://localhost:8000',
changeOrigin: true,
configure: (proxy) => {
const p = proxy as any
p.timeout = 120000
p.proxyTimeout = 120000
},
}
}
}
})

129
import.json Normal file
View File

@@ -0,0 +1,129 @@
{
"versionId": 1001,
"tree": [
{
"name": "数+产品手册-MD源文件",
"type": "FOLDER",
"children": [
{
"name": "DMDRS诊断工具使用手册",
"type": "FILE",
"sortOrder": 100,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS诊断工具使用手册.md",
"fileName": "DMDRS诊断工具使用手册.md",
"fileSize": 16402
}
]
},
{
"name": "DMDRS控制台命令手册",
"type": "FILE",
"sortOrder": 101,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS控制台命令手册.md",
"fileName": "DMDRS控制台命令手册.md",
"fileSize": 314014
}
]
},
{
"name": "DMDRS搭建手册-Oracle",
"type": "FILE",
"sortOrder": 102,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS搭建手册-Oracle.md",
"fileName": "DMDRS搭建手册-Oracle.md",
"fileSize": 159147
}
]
},
{
"name": "DMDRS DRS API使用手册",
"type": "FILE",
"sortOrder": 103,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS DRS API使用手册.md",
"fileName": "DMDRS DRS API使用手册.md",
"fileSize": 51475
}
]
},
{
"name": "DMDRS参考手册",
"type": "FILE",
"sortOrder": 104,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS参考手册.md",
"fileName": "DMDRS参考手册.md",
"fileSize": 265225
}
]
},
{
"name": "定时调度工具使用手册",
"type": "FILE",
"sortOrder": 105,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/定时调度工具使用手册.md",
"fileName": "定时调度工具使用手册.md",
"fileSize": 104637
}
]
},
{
"name": "DMDRS搭建手册-DM8",
"type": "FILE",
"sortOrder": 106,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS搭建手册-DM8.md",
"fileName": "DMDRS搭建手册-DM8.md",
"fileSize": 217027
}
]
},
{
"name": "DMDRS产品介绍",
"type": "FILE",
"sortOrder": 107,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS产品介绍.md",
"fileName": "DMDRS产品介绍.md",
"fileSize": 94882
}
]
},
{
"name": "DMDRS DRS语言使用手册",
"type": "FILE",
"sortOrder": 108,
"files": [
{
"languageId": 1,
"objectName": "assets/rewritten/数+产品手册-MD源文件/DMDRS DRS语言使用手册.md",
"fileName": "DMDRS DRS语言使用手册.md",
"fileSize": 177757
}
]
}
],
"sortOrder": 100
}
]
}

51
k8s/deployment.yaml Normal file
View File

@@ -0,0 +1,51 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: funmd-convert
namespace: default
labels:
app: funmd-convert
spec:
replicas: 1
selector:
matchLabels:
app: funmd-convert
template:
metadata:
labels:
app: funmd-convert
spec:
containers:
- name: funmd-convert
image: funmd-convert:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
env:
- name: MINIO_ENDPOINT
value: "minio-service:9000"
- name: MINIO_ACCESS_KEY
value: "minioadmin"
- name: MINIO_SECRET_KEY
value: "minioadmin"
- name: MINIO_BUCKET
value: "funmd"
resources:
limits:
cpu: "1000m"
memory: "1Gi"
requests:
cpu: "200m"
memory: "256Mi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5

60
package_offline.sh Normal file
View File

@@ -0,0 +1,60 @@
#!/bin/bash
set -euo pipefail
# Image naming
# 注意Docker 镜像名必须为小写且不含下划线,这里使用 funmd-convert
IMAGE_NAME="funmd-convert"
IMAGE_TAG="latest"
OUTPUT_DIR="$(pwd)"
# 导出的 tar 文件命名为项目名 FunMD_Convert 以便识别
OUTPUT_TAR="${OUTPUT_DIR}/FunMD_Convert.tar"
echo "[1/4] Building Docker image ${IMAGE_NAME}:${IMAGE_TAG}"
docker build -t ${IMAGE_NAME}:${IMAGE_TAG} .
echo "[2/4] Saving image to ${OUTPUT_TAR}"
docker save ${IMAGE_NAME}:${IMAGE_TAG} -o "${OUTPUT_TAR}"
echo "[3/4] Image size and sha256"
ls -lh "${OUTPUT_TAR}"
SHA256=$(shasum -a 256 "${OUTPUT_TAR}" | awk '{print $1}')
echo "sha256=${SHA256}"
cat <<'EON'
[4/4] Transfer and run on offline server:
1) 复制镜像包到服务器,例如 /opt/FunMD_Convert/FunMD_Convert.tar
scp FunMD_Convert.tar user@server:/opt/FunMD_Convert/
2) 加载镜像:
docker load -i /opt/FunMD_Convert/FunMD_Convert.tar
3) 验证镜像:
docker images | grep funmd-convert
4) 启动容器(后端端口 8000同时托管前端 /ui
docker run -d \
-p 8000:8000 \
--name FunMD_Convert \
--restart unless-stopped \
funmd-convert:${IMAGE_TAG}
5) 访问:
后端健康检查: http://<服务器IP>:8000/health
前端页面: http://<服务器IP>:8000/ui/
6) (可选)配置 MinIO
curl -X POST \
-F endpoint=10.9.35.31:9000 \
-F public=http://10.9.35.31:9000 \
-F access=你的AK \
-F secret=你的SK \
-F bucket=file-cms \
-F secure=false \
-F public_read=true \
http://<服务器IP>:8000/config/minio
EON
echo "[Done] Offline package ready: ${OUTPUT_TAR}"

63
prd.md Normal file
View File

@@ -0,0 +1,63 @@
目前这里有两个项目文件夹docling和word2markdown它们全部都是用来处理文档格式的转换。
这里的转换主要包括两个转换方向:
一、其他格式转换成markdown格式
1、通用格式转换器docling/docling里面有对应的转换函数主要是docx和pdf等格式转换成markdown格式
2、定制格式转换器word2markdown里面有对应的的转换函数主要是如下几种特殊情况的处理
- 所有非UTF-8编码的文档都需要先转换成UTF-8编码再进行后续的处理
- doc/docx格式中有单行单列的表格需处理为markdown格式中的代码高亮格式
- HTML格式中table标签名全部需要小写
- 删除HTML格式中标签后面多余的换行
- 所有markdown中的::: 的提示块 改为 !!! 的提示块;
- 所有markdown文件的渲染可以设置半角和全角默认是全角
- 所有转换文件中有图片或其他静态资源的存入MinIO中并返回对应的图片URL或其他静态资源的URL
- 所有转换文件中的URL相对路径都需转换成MinIO中的URL
3、文件上传有若干情况
- 单个文件上传;
- 多个文件上传通过资源路径或URL上传
- 非加密压缩文件的上传;
4、批量上传功能
- 可以通过上传一个包含多个文件路径或URL的文本文件来批量上传多个文件
- 可以上传压缩包zip、tar.gz等分步骤完成先上传压缩包前端显示上传文件成功点击开始转换按钮再解压缩将文件中的markdown文件中的相对路径图片等静态资源地址全部转化为minio的地址同时按文件结构将转化好的md文件URL也是存在minio中按json示例文件的格式返回前端给出相应的处理信息方便调试
- 示例的json文件路径为/Users/fanyang/Desktop/FunMD_Convert/批量导入目录树.json
- 处理完之后将压缩文件和服务端已解压本地文件都删除但是就要确保已转换好的md文件和对应的资源都存放在minio中且都返回了URL到正确的json中
- 确保图像等相对路径资源上传到minio中并正确返回URL
- 当markdown和对应的image资源在同一个文件目录下的时候转化后路径提升一级就是正确的markdown文件放在上一级目录下原文件夹就不需要了且根据这个规则返回import.json
二、接口规范
以上所有的能力全部通过python的FastAPI实现接口规范如下
1、所有的接口都需要通过POST方法调用
2、所有的接口都需要返回JSON格式的响应体
3、所有的接口都需要在响应体中包含一个code字段用于表示接口调用是否成功
4、所有的接口都需要在响应体中包含一个msg字段用于表示接口调用的结果信息
5、所有的接口都需要在响应体中包含一个data字段用于表示接口调用的结果数据
三、接口实现
1、所有的接口都需要在FastAPI中实现
2、所有的接口都需要在实现中包含一个try...except...语句,用于捕获异常并返回对应的错误信息;
3、所有的接口都需要在实现中包含一个return语句用于返回对应的响应体
四、接口重构
1、将docling和word2markdown中的转换函数封装成一个类类中包含一个convert方法用于实现格式转换
2、将所有的接口封装成一个类类中包含一个convert方法用于实现接口调用
3、将所有的类封装成一个模块模块中包含一个app对象用于启动FastAPI应用
五、接口测试
1、使用python的unittest模块编写测试用例测试所有的接口是否符合规范
2、使用python的requests模块模拟调用所有的接口测试接口是否正常工作
六、提供前端页面参考docling/app/static/index.html来实现和优化
1. 通过vue3来重构这个页面
2. 可以配置以上的能力,实现针对不同用户的不同配置文件,来提供不同的能力;
七、部署
1、将所有的代码打包成一个docker镜像
2、将这个docker镜像可以本地ssh部署到服务器中服务器无法拉取镜像
3、配置一个ingress用于暴露这个应用
八、其他
1、所有的代码都需要符合PEP8规范
2、所有的接口都需要在注释中包含一个@api标签,用于标识这是一个接口;
3、所有的类都需要在注释中包含一个@class标签,用于标识这是一个类;
4、所有的函数都需要在注释中包含一个@function标签,用于标识这是一个函数;

460
修改总结.md Normal file
View File

@@ -0,0 +1,460 @@
toolName: search_codebase
status: success
query: Locate existing Python code for docling and word2markdown converters, FastAPI app, and any existing front-end reference at docling/app/static/index.html. Identify project structure and existing functions for converting docx/pdf to markdown and custom conversions.
toolName: view_folder
status: success
/Users/fanyang/Desktop/FunMD_Convert
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/services/docling_adapter.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/services/docling_adapter.py
toolName: search_by_regex
status: success
query: def .*\(
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/word2markdown/convert_to_md.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/word2markdown/convert_to_md.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/static/index.html
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/static/index.html
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/static/index.html
toolName: search_by_regex
status: success
query: convert-folder
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: search_by_regex
status: success
query: Dockerfile
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/tests/test_md_to_docx.py
toolName: view_folder
status: success
/Users/fanyang/Desktop/FunMD_Convert/word2markdown
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/word2markdown/app.py
toolName: todo_write
status: success
Todos updated: 5 items
toolName: ApplyPatch
status: running
toolName: write_to_file
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/services/unified_converter.py
content: undefined
toolName: ApplyPatch
status: running
toolName: edit_file_update
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
changes:
toolName: ApplyPatch
status: running
toolName: edit_file_update
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
changes:
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
toolName: ApplyPatch
status: running
toolName: edit_file_update
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/server.py
changes:
toolName: ApplyPatch
status: running
toolName: write_to_file
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/tests/test_api_convert.py
content: undefined
toolName: run_command
status: success
command: python -m pytest -q undefined
toolName: ApplyPatch
status: running
toolName: edit_file_update
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/tests/test_api_convert.py
changes:
toolName: run_command
status: success
command: python -m unittest -q undefined
toolName: run_command
status: success
command: python -m unittest discover -s tests -p "test_*.py" -v undefined
toolName: todo_write
status: success
Todos updated: 5 items
toolName: view_files
status: success
filePath: /Users/fanyang/Desktop/FunMD_Convert/docling/app/services/unified_converter.py
**重构概要**
- 统一了两套转换能力Docling 通用转换 + Word2Markdown 定制转换),新增统一入口类并在后端暴露标准化接口。
- 按文档规范统一了接口输出结构POST、JSON、包含 code/msg/data并为错误处理加上 try/except 包装。
- 增强 Markdown/HTML 的后处理:将 `:::` 提示块转换为 `!!!`HTML 表格标签强制小写,去除标签后无意义换行。
- 扩展批量上传:支持 ZIP/TAR 压缩包上传,自动解压、资源上传至 MinIO、Markdown 资产重写并回传可用 URL。
- 预置了单元测试样例,覆盖统一转换和接口返回结构,后续安装依赖后可运行。
**核心改动**
- 统一转换类
- 新增 `FormatConverter`,可一键选择使用 Docling 或 Word2Markdown 引擎,统一返回目标格式。
- 文件位置:`docling/app/services/unified_converter.py:57`,核心方法 `convert``docling/app/services/unified_converter.py:61`
- 行为要点:
- 引擎选择:传 `engine=word2markdown|pandoc|custom` 时走 `word2markdown.convert_any`;默认走 `Docling.DocumentConverter`
- Markdown后处理`:::` 提示块转换为 `!!!``docling/app/services/unified_converter.py:33`HTML表格标签小写化`docling/app/services/unified_converter.py:22`)。
- 标准化接口
- 新增统一转换接口 `POST /api/convert`,提供 file 或 source_url 二选一;可选 `export=markdown|html|json|doctags``engine`;支持 `save=true` 上传到 MinIO。
- 位置:`docling/app/server.py:1413`,统一返回结构在 `docling/app/server.py:1380``docling/app/server.py:1396`
- 返回字段:
- 不保存:`data = {encoding, content, name, media_type}`
- 保存到 MinIO`data = {encoding, name, media_type, minio_url, minio_presigned_url}`
- 所有响应遵循 `{"code": 0|500, "msg": "...", "data": ...}`,实现于 `docling/app/server.py:1363``docling/app/server.py:1368`
- 压缩包上传
- 新增 `POST /api/upload-archive`:上传 ZIP/TAR 包,后端解压→批量上传资源至 MinIO→重写 `*.md` 内部相对资源链接为 MinIO URL→返回列表。
- 位置:`docling/app/server.py:1413`(函数起始见 `docling/app/server.py:1414`)。
- 返回数据:`data={count, files:[{source, minio_url, minio_presigned_url, mappings}]}`
- 用到了现有资产上传与链接重写能力:`_bulk_upload_assets``docling/app/server.py:1079``_rewrite_md_assets_to_minio``docling/app/server.py:885`
- MinIO 配置与管理
- 运行时配置体 `RUNTIME_CONFIG` 保持不变,动态载入/保存配置、桶管理接口均已可用。相关路径:
- 载入/保存配置:`docling/app/server.py:349``docling/app/server.py:369``docling/app/server.py:386`
- 桶列表/创建:`docling/app/server.py:287``docling/app/server.py:304`
- Markdown → DOCX/PDF
- 保留原有 `POST /md/convert` 接口能力页眉页脚、logo、封面、CSS 模板、目录 TOC、MinIO 存储等),详见 `docling/app/server.py:473`
- 实际转换使用 `md_to_docx_bytes``md_to_pdf_bytes_with_renderer``docling/app/services/docling_adapter.py:105``docling/app/services/docling_adapter.py:676`),并已支持页眉页脚与封面元信息渲染(`normalize_html` 流水线在 `docling/app/services/docling_adapter.py:433``docling/app/services/docling_adapter.py:662`)。
**满足 PRD 的关键点**
- 统一 POST + JSON + code/msg/data新增 `api_convert``api_upload-archive`,所有返回体包含 `code/msg/data``docling/app/server.py:1363``docling/app/server.py:1368``docling/app/server.py:1413`)。
- 异常捕获:统一接口有 `try...except...` 并转换为标准结构(`docling/app/server.py:1408``docling/app/server.py:1476`),全局异常中间件已存在(`docling/app/server.py:1231``docling/app/server.py:1245``docling/app/server.py:1255`)。
- 转换特性:
- 非 UTF-8 文档编码识别与归一化由 `word2markdown.convert_to_md` 内置实现(例如 `word2markdown/convert_to_md.py:309``word2markdown/convert_to_md.py:326`)。
- DOC/DOCX 单行单列表格识别为代码块由 `word2markdown/convert_to_md.py:196``word2markdown/convert_to_md.py:200` 完成。
- HTML `<table>` 标签统一小写由统一转换类的 HTML 后处理完成(`docling/app/services/unified_converter.py:22`)。
- 删除 HTML 标签后多余换行由小写化时一并规整(`docling/app/services/unified_converter.py:29`)。
- Markdown 中 `:::` 提示块 → `!!!` 由统一转换类完成(`docling/app/services/unified_converter.py:33``docling/app/services/unified_converter.py:55`)。
- 图片/静态资源入库 MinIO 与 URL 重写:单文件在 `md_convert` 内部,批量在 `md_convert_folder``md_upload_folder``docling/app/server.py:1099``docling/app/server.py:1151`),新 `api_upload-archive` 同样实现(`docling/app/server.py:1413`)。
- 文件上传场景:
- 单文件上传:`POST /api/convert` 提供 `file``docling/app/server.py:1369`)。
- URL 上传:`POST /api/convert` 提供 `source_url``docling/app/server.py:1369`)。
- 压缩包:`POST /api/upload-archive``docling/app/server.py:1413`)。
- 前端页面
- 现有测试页仍可用(`docling/app/static/index.html`。Vue3 重构计划后续推进。
- 部署
- 现有 FastAPI 服务与静态页集成,后续可将 `docling/app` 打包为镜像MinIO 配置已支持运行时动态调整。
**接口用法示例**
- 单文件转 Markdown不保存
- `curl -s -X POST http://localhost:8000/api/convert -F file=@/path/sample.docx -F export=markdown -F save=false`
- URL 转换为 Markdown保存到 MinIO
- `curl -s -X POST http://localhost:8000/api/convert -F source_url="https://example.com/sample.pdf" -F export=markdown -F save=true`
- 压缩包批量上传并重写链接
- `curl -s -X POST http://localhost:8000/api/upload-archive -F file=@/path/docs.zip -F prefix=assets`
**教育性说明**
- 引擎路由与契约统一
- 统一转换类屏蔽底层差异,默认走 Docling遇到需要 Pandoc/定制规则时切 Word2Markdown实现“契约优先”的稳定输出。
- Markdown/HTML 后处理
-`:::``!!!` 保证提示块在统一渲染器(包括 PDF/DOCX 流程下保持一致HTML 表格小写化避免某些渲染器的大小写敏感差异。
- 资产重写与 MinIO
- 多层回退:先批量上传图片资源,再按路径和搜索根解析相对链接;任何失败的映射保留在 `mappings` 中,便于后续修复。
- 防御式异常处理
- 接口返回始终有 `code/msg/data`,即使遇到异常也不会破坏前端调用流程;需要详细排错时可开启 `debug` 查询参数或请求头(`docling/app/server.py:1216`)。
**已添加的测试**
- 位置:`docling/app/tests/test_api_convert.py:1`
- 用例:上传一个包含 `:::` 的 Markdown 文件,验证返回 `code=0``data.content` 存在并包含 `!!! note`
- 运行方法(需要安装依赖):
- `python -m unittest discover -s tests -p "test_*.py" -v`
**后续建议**
- 前端 Vue3 重构:将 `docling/app/static/index.html` 的交互改造为 Vue3抽象配置表单与批量任务的状态管理。
- 镜像化与部署:整理 `requirements`,增加 Dockerfile 与构建脚本,结合 Ingress 暴露。
- 配置管理:完成配置文件的多环境切换与权限控制,支持 Profile 导入导出(已有后端支持,`docling/app/server.py:349``docling/app/server.py:369``docling/app/server.py:386`)。
- 测试矩阵:在 `tests` 中补充更多文件类型与异常场景(损坏 DOCX、RTF、伪 DOCX、MDX 复杂 JSX 等),并加上 MinIO 交互的集成测试。
如果你现在就要联调,可以直接用上述 curl 命令触发新接口;如需我继续把前端测试页改造以调用新接口或推进 Vue3 页面,我可以立刻开始。

202
批量导入目录树.json Normal file
View File

@@ -0,0 +1,202 @@
{
"openapi": "3.0.1",
"info": {
"title": "默认模块",
"description": "认证授权微服务API文档支持用户管理、角色管理、登录认证等功能",
"version": "1.0.0",
"contact": {
"name": "开发团队",
"email": "dev@example.com"
},
"license": {
"name": "Apache 2.0",
"url": "https://www.apache.org/licenses/LICENSE-2.0.html"
}
},
"tags": [
{
"name": "文档目录管理"
}
],
"paths": {
"/cms/api/v1/document/directory/import": {
"post": {
"summary": "批量导入目录树",
"deprecated": false,
"description": "根据提供的目录树JSON批量生成目录及文件默认覆盖该版本下的草稿内容",
"operationId": "importDirectoryTree",
"tags": [
"文档目录管理"
],
"parameters": [],
"requestBody": {
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/DirectoryImportRequest"
}
}
},
"required": true
},
"responses": {
"200": {
"description": "OK",
"content": {
"*/*": {
"schema": {
"$ref": "#/components/schemas/ResultVoid"
}
}
},
"headers": {}
}
},
"security": [
{
"Bearer Authentication": []
}
]
}
}
},
"components": {
"schemas": {
"ResultVoid": {
"type": "object",
"properties": {
"code": {
"type": "integer",
"format": "int32"
},
"message": {
"type": "string"
},
"data": {
"type": "object",
"properties": {}
}
}
},
"DirectoryImportFile": {
"required": [
"languageId",
"objectName"
],
"type": "object",
"properties": {
"languageId": {
"type": "integer",
"description": "语言ID",
"format": "int64",
"example": 1
},
"objectName": {
"type": "string",
"description": "MinIO对象名",
"example": "version_1001/dir_10/xxx.md"
},
"fileName": {
"type": "string",
"description": "文件名(用于展示)",
"example": "install.md"
},
"fileSize": {
"type": "integer",
"description": "文件大小(字节)",
"format": "int64",
"example": 1024
}
},
"description": "目录文件信息"
},
"DirectoryImportNode": {
"required": [
"name",
"type"
],
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "目录名称",
"example": "安装指南"
},
"type": {
"type": "string",
"description": "节点类型FOLDER/FILE",
"enum": [
"FOLDER",
"FILE"
],
"example": "FOLDER"
},
"sortOrder": {
"type": "integer",
"description": "排序顺序,值越小越靠前",
"format": "int32",
"example": 100
},
"description": {
"type": "string",
"description": "目录描述仅在FOLDER节点生效",
"example": "该章节包含快速开始说明"
},
"children": {
"type": "array",
"description": "子目录列表(仅在 FOLDER 类型下使用)",
"items": {
"$ref": "#/components/schemas/DirectoryImportNode"
}
},
"files": {
"type": "array",
"description": "文件列表(仅在 FILE 类型下使用)",
"items": {
"$ref": "#/components/schemas/DirectoryImportFile"
}
}
},
"description": "目录导入节点"
},
"DirectoryImportRequest": {
"required": [
"tree",
"versionId"
],
"type": "object",
"properties": {
"versionId": {
"type": "integer",
"description": "文档版本ID",
"format": "int64",
"example": 1001
},
"tree": {
"type": "array",
"description": "目录树",
"items": {
"$ref": "#/components/schemas/DirectoryImportNode"
}
}
},
"description": "目录批量导入请求"
}
},
"responses": {},
"securitySchemes": {
"Bearer Authentication": {
"type": "http",
"description": "输入Token格式Bearer {token}",
"scheme": "bearer",
"bearerFormat": "JWT"
}
}
},
"servers": [],
"security": [
{
"Bearer Authentication": []
}
]
}