feat: add inbound media understanding

Co-authored-by: Tristan Manchester <tmanchester96@gmail.com>
This commit is contained in:
Peter Steinberger
2026-01-17 03:52:37 +00:00
parent 4b749f1b8f
commit 1b973f7506
42 changed files with 2547 additions and 101 deletions
+11 -1
View File
@@ -38,13 +38,23 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
- `{{MediaUrl}}` pseudo-URL for the inbound media.
- `{{MediaPath}}` local temp path written before running the command.
- When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and `MediaPath`/`MediaUrl` are rewritten to a relative path like `media/inbound/<filename>`.
- Audio transcription (if configured via `tools.audio.transcription`) runs before templating and can replace `Body` with the transcript.
- Media understanding (if configured via `tools.media.*`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
- Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
- Video and image descriptions preserve any caption text for command parsing.
- Only the first matching image/audio/video attachment is processed; remaining attachments are left untouched.
## Limits & Errors
**Outbound send caps (WhatsApp web send)**
- Images: ~6MB cap after recompression.
- Audio/voice/video: 16MB cap; documents: 100MB cap.
- Oversize or unreadable media → clear error in logs and the reply is skipped.
**Media understanding caps (transcription/description)**
- Image default: 10MB (`tools.media.image.maxBytes`).
- Audio default: 20MB (`tools.media.audio.maxBytes`).
- Video default: 50MB (`tools.media.video.maxBytes`).
- Oversize media skips understanding, but replies still go through with the original body.
## Notes for Tests
- Cover send + reply flows for image/audio/document cases.
- Validate recompression for images (size bound) and voice-note flag for audio.