test_multimodal_agent.ipynb
1 { 2 "cells": [ 3 { 4 "cell_type": "markdown", 5 "id": "4kdmk3opy52", 6 "source": "# Multimodal Agent - Test Notebook\n\nThis notebook tests the Strands Agent with AgentCore Memory **before deployment**.\n\nIt validates that:\n1. The agent can process **text** messages\n2. The agent can process **images** and store the understanding as text in memory\n3. The agent can process **documents** (PDF) and store summaries in memory\n4. The agent can handle **audio transcripts**\n5. The agent can analyze **videos** using TwelveLabs Pegasus via Amazon Bedrock\n6. Memory persists across turns - the agent can answer questions about previously shared multimedia\n\n> **Prerequisites**: Install dependencies with `uv pip install strands-agents strands-agents-tools boto3 twelvelabs`", 7 "metadata": {} 8 }, 9 { 10 "cell_type": "code", 11 "id": "ohrarsqxe3q", 12 "source": "%pip install strands-agents strands-agents-tools boto3 --quiet", 13 "metadata": {}, 14 "execution_count": null, 15 "outputs": [] 16 }, 17 { 18 "cell_type": "markdown", 19 "id": "jukhasn1xhe", 20 "source": "## 1. Setup - Create the Agent (without AgentCore Memory)\n\nFor local testing, we create the agent **without** AgentCore Memory.\nThe memory integration will be tested separately when deployed to AgentCore Runtime.", 21 "metadata": {} 22 }, 23 { 24 "cell_type": "code", 25 "id": "lkmpe28j31", 26 "source": "import os\nimport json\nimport base64\nfrom pathlib import Path\n\nimport boto3\nfrom strands import Agent, tool\nfrom strands.models import BedrockModel\n\n\n# --- Video Analysis Tool (TwelveLabs Pegasus via Bedrock) ---\n@tool\ndef video_analysis(\n s3_uri: str,\n prompt: str = \"Describe this video in detail including visual content, actions, text on screen, and any spoken words.\",\n temperature: float = 0.2,\n) -> dict:\n \"\"\"Analyze a video using TwelveLabs Pegasus model via Amazon Bedrock.\n\n Args:\n s3_uri: S3 URI of the video (e.g., s3://bucket/key.mp4).\n prompt: Question or instruction about the video content.\n temperature: Model temperature for response generation.\n\n Returns:\n Dict with video analysis results.\n \"\"\"\n aws_region = os.environ.get(\"AWS_REGION\", \"us-east-1\")\n bedrock = boto3.client(\"bedrock-runtime\", region_name=aws_region)\n sts = boto3.client(\"sts\", region_name=aws_region)\n account_id = sts.get_caller_identity()[\"Account\"]\n\n body = {\n \"inputPrompt\": prompt,\n \"mediaSource\": {\"s3Location\": {\"uri\": s3_uri, \"bucketOwner\": account_id}},\n \"temperature\": temperature,\n }\n\n response = bedrock.invoke_model(\n modelId=\"twelvelabs.pegasus-1-2-v1:0\",\n body=json.dumps(body),\n contentType=\"application/json\",\n accept=\"application/json\",\n )\n\n response_body = json.loads(response[\"body\"].read())\n return {\n \"status\": \"success\",\n \"content\": [{\"json\": {\"s3_uri\": s3_uri, \"analysis\": response_body.get(\"message\", \"\")}}],\n }\n\n\n# --- System Prompt ---\nSYSTEM_PROMPT = \"\"\"You are a helpful assistant on WhatsApp. You can process text, images, documents, audio, and videos.\n\nAlways respond in the same language the user writes to you.\n\nWhen you receive multimedia, describe or summarize the content in detail so it is preserved in your memory for future questions.\n\nFor videos, use the video_analysis tool with the provided S3 URI.\n\nKeep responses under 4000 characters and use WhatsApp-friendly formatting.\n\"\"\"\n\nmodel = BedrockModel(model_id=\"us.anthropic.claude-sonnet-4-20250514-v1:0\")\nagent = Agent(model=model, system_prompt=SYSTEM_PROMPT, tools=[video_analysis])\nprint(\"Agent created with video_analysis tool\")", 27 "metadata": {}, 28 "execution_count": null, 29 "outputs": [] 30 }, 31 { 32 "cell_type": "markdown", 33 "id": "p26qv4girr", 34 "source": "## 2. Test Text Messages\n\nBasic text conversation to verify the agent responds correctly.", 35 "metadata": {} 36 }, 37 { 38 "cell_type": "code", 39 "id": "k2p7ymakfb", 40 "source": "result = agent(\"Hello! What can you help me with?\")\nprint(str(result))", 41 "metadata": {}, 42 "execution_count": null, 43 "outputs": [] 44 }, 45 { 46 "cell_type": "markdown", 47 "id": "qz10h8mgawi", 48 "source": "## 3. Test Image Processing\n\nSimulates sending an image to the agent. The agent should describe it in detail\nso the understanding can be stored in memory (text-only).\n\n> Replace `IMAGE_PATH` with a real image path on your system to test.", 49 "metadata": {} 50 }, 51 { 52 "cell_type": "code", 53 "id": "324b31fd1tg", 54 "source": "def build_image_message(image_path: str, prompt: str = \"Analyze this image in detail.\"):\n \"\"\"Build a multimodal message with an image for the agent.\"\"\"\n image_bytes = Path(image_path).read_bytes()\n\n # Detect format\n if image_bytes[:3] == b\"\\xff\\xd8\\xff\":\n fmt = \"jpeg\"\n elif image_bytes[:8] == b\"\\x89PNG\\r\\n\\x1a\\n\":\n fmt = \"png\"\n elif image_bytes[:4] == b\"GIF8\":\n fmt = \"gif\"\n elif image_bytes[:4] == b\"RIFF\" and image_bytes[8:12] == b\"WEBP\":\n fmt = \"webp\"\n else:\n fmt = \"jpeg\"\n\n return [\n {\n \"image\": {\n \"format\": fmt,\n \"source\": {\"bytes\": image_bytes},\n }\n },\n {\n \"text\": (\n f\"{prompt}\\n\\n\"\n \"IMPORTANT: Provide a detailed description of what you see in this image. \"\n \"This description will be stored in memory so you can answer future questions \"\n \"about this image even when you no longer have access to it.\"\n ),\n },\n ]\n\n# --- Test with a sample image ---\n# Replace with an actual image path to test\nIMAGE_PATH = \"sample_image.jpg\" # Change this\n\nif Path(IMAGE_PATH).exists():\n content = build_image_message(IMAGE_PATH, \"What do you see in this image?\")\n result = agent(content)\n print(str(result))\nelse:\n print(f\"Image not found: {IMAGE_PATH}\")\n print(\"Skipping image test. Place a test image and update IMAGE_PATH.\")", 55 "metadata": {}, 56 "execution_count": null, 57 "outputs": [] 58 }, 59 { 60 "cell_type": "markdown", 61 "id": "4i68pzkaru4", 62 "source": "## 4. Test Memory Recall (Follow-up Question)\n\nAfter processing an image, the agent should be able to answer follow-up questions\nabout it using conversational context (and in production, AgentCore Memory).", 63 "metadata": {} 64 }, 65 { 66 "cell_type": "code", 67 "id": "q8cb8r2kht", 68 "source": "# Follow-up question about the previously shared image\nresult = agent(\"What were the main things you saw in the image I just sent?\")\nprint(str(result))", 69 "metadata": {}, 70 "execution_count": null, 71 "outputs": [] 72 }, 73 { 74 "cell_type": "markdown", 75 "id": "aj6h9crh9b", 76 "source": "## 5. Test Audio Transcript Processing\n\nSimulates receiving a transcription from an audio message.\nThe agent processes the text and stores the understanding in memory.", 77 "metadata": {} 78 }, 79 { 80 "cell_type": "code", 81 "id": "yke7rifwnee", 82 "source": "# Simulate an audio transcription\naudio_transcript = \"Hey, I need to schedule a meeting for next Tuesday at 3 PM with the marketing team to discuss the Q4 campaign strategy.\"\n\naudio_message = [\n {\n \"text\": (\n f\"The user sent an audio message. Here is the transcription:\\n\\n\"\n f'\"{audio_transcript}\"\\n\\n'\n \"Respond to the transcribed audio message. The transcription content \"\n \"will be stored in memory for future reference.\"\n ),\n }\n]\n\nresult = agent(audio_message)\nprint(str(result))", 83 "metadata": {}, 84 "execution_count": null, 85 "outputs": [] 86 }, 87 { 88 "cell_type": "markdown", 89 "id": "5zpj5dy5i8v", 90 "source": "## 6. Test Document Processing\n\nSimulates sending a PDF document. The agent extracts key content and creates\na text summary for memory storage.\n\n> Replace `DOCUMENT_PATH` with a real PDF path to test.", 91 "metadata": {} 92 }, 93 { 94 "cell_type": "code", 95 "id": "j5k0zj2jui", 96 "source": "def build_document_message(doc_path: str, prompt: str = \"Analyze this document.\"):\n \"\"\"Build a multimodal message with a document for the agent.\"\"\"\n doc_bytes = Path(doc_path).read_bytes()\n ext = Path(doc_path).suffix.lstrip(\".\")\n name = Path(doc_path).stem\n\n return [\n {\n \"document\": {\n \"format\": ext,\n \"name\": name,\n \"source\": {\"bytes\": doc_bytes},\n }\n },\n {\n \"text\": (\n f\"{prompt}\\n\\n\"\n \"IMPORTANT: Extract and summarize the key content from this document. \"\n \"This summary will be stored in memory so you can answer future questions \"\n \"about this document even when you no longer have access to it.\"\n ),\n },\n ]\n\n# --- Test with a sample document ---\nDOCUMENT_PATH = \"sample_document.pdf\" # Change this\n\nif Path(DOCUMENT_PATH).exists():\n content = build_document_message(DOCUMENT_PATH, \"Summarize this document for me.\")\n result = agent(content)\n print(str(result))\nelse:\n print(f\"Document not found: {DOCUMENT_PATH}\")\n print(\"Skipping document test. Place a test PDF and update DOCUMENT_PATH.\")", 97 "metadata": {}, 98 "execution_count": null, 99 "outputs": [] 100 }, 101 { 102 "cell_type": "markdown", 103 "id": "bs9em3yau1c", 104 "source": "## 7. Test Video Analysis (TwelveLabs Pegasus via Bedrock)\n\nAnalyzes a video using the `video_analysis` tool which calls TwelveLabs Pegasus\nmodel via Amazon Bedrock. This provides rich visual + audio understanding,\nfar superior to audio-only transcription.\n\n> **Prerequisites**: Upload a test video to S3 and update `VIDEO_S3_URI` below.\n> Ensure the TwelveLabs Pegasus model (`twelvelabs.pegasus-1-2-v1:0`) is enabled in your Bedrock console.", 105 "metadata": {} 106 }, 107 { 108 "cell_type": "code", 109 "id": "nrsuyrhqhbd", 110 "source": "# --- Test video analysis with TwelveLabs Pegasus ---\n# Replace with an actual S3 URI of a video file\nVIDEO_S3_URI = \"s3://your-bucket/videos/sample-video.mp4\" # Change this\n\n# To upload a local video to S3 for testing:\n# s3 = boto3.client(\"s3\")\n# s3.upload_file(\"local_video.mp4\", \"your-bucket\", \"videos/sample-video.mp4\")\n\n# Simulate the prompt the agent would receive when a video is sent via WhatsApp\nvideo_prompt = (\n f\"The user sent a video stored at: {VIDEO_S3_URI}\\n\\n\"\n \"User's message: What is this video about?\\n\\n\"\n \"Use the video_analysis tool with the S3 URI above to analyze \"\n \"the video's visual content and audio. Provide a comprehensive \"\n \"description that includes visual elements, actions, spoken words, \"\n \"and any on-screen text.\"\n)\n\n# The agent will automatically call the video_analysis tool\nresult = agent(video_prompt)\nprint(str(result))", 111 "metadata": {}, 112 "execution_count": null, 113 "outputs": [] 114 }, 115 { 116 "cell_type": "markdown", 117 "id": "yetauglyc8j", 118 "source": "## 8. Test Conversation Memory\n\nAsk the agent about things from earlier in the conversation to verify it retains context.", 119 "metadata": {} 120 }, 121 { 122 "cell_type": "code", 123 "id": "zqkp2i1j1zs", 124 "source": "# Ask about the audio message from earlier\nresult = agent(\"What meeting did I ask about in my audio message?\")\nprint(str(result))", 125 "metadata": {}, 126 "execution_count": null, 127 "outputs": [] 128 }, 129 { 130 "cell_type": "code", 131 "id": "ya08ku2ao", 132 "source": "# Ask about the video from earlier\nresult = agent(\"What was the revenue increase mentioned in the video?\")\nprint(str(result))", 133 "metadata": {}, 134 "execution_count": null, 135 "outputs": [] 136 }, 137 { 138 "cell_type": "markdown", 139 "id": "0bfyjdf6ojrf", 140 "source": "## 9. Simulate Full Payload (as AgentCore would receive)\n\nThis simulates the exact payload format that the Lambda handler sends to AgentCore Runtime.", 141 "metadata": {} 142 }, 143 { 144 "cell_type": "code", 145 "id": "iajpxsoslf9", 146 "source": "# Simulate text-only payload\ntext_payload = {\n \"prompt\": \"What is the capital of France?\"\n}\n\n# Simulate image payload (would include base64 data in production)\nimage_payload = {\n \"prompt\": \"What do you see?\",\n \"media\": {\n \"type\": \"image\",\n \"format\": \"jpeg\",\n \"data\": \"<base64_encoded_image_data>\",\n }\n}\n\n# Simulate audio transcript payload\naudio_payload = {\n \"prompt\": \"Process this audio\",\n \"media\": {\n \"type\": \"audio_transcript\",\n \"data\": \"The user said: please remind me to buy groceries tomorrow.\",\n }\n}\n\n# Simulate video payload (TwelveLabs Pegasus via Bedrock)\nvideo_payload = {\n \"prompt\": \"Analyze this video in detail.\",\n \"media\": {\n \"type\": \"video\",\n \"s3_uri\": \"s3://bucket/videos/sample.mp4\",\n }\n}\n\n# Simulate document payload\ndocument_payload = {\n \"prompt\": \"Summarize this document\",\n \"media\": {\n \"type\": \"document\",\n \"format\": \"pdf\",\n \"data\": \"<base64_encoded_pdf_data>\",\n \"name\": \"quarterly_report\",\n }\n}\n\nprint(\"Payload formats validated:\")\nfor name, payload in [\n (\"text\", text_payload),\n (\"image\", image_payload),\n (\"audio_transcript\", audio_payload),\n (\"video\", video_payload),\n (\"document\", document_payload),\n]:\n has_media = \"media\" in payload\n media_type = payload.get(\"media\", {}).get(\"type\", \"none\")\n print(f\" {name}: prompt='{payload['prompt'][:40]}...' media={has_media} type={media_type}\")", 147 "metadata": {}, 148 "execution_count": null, 149 "outputs": [] 150 } 151 ], 152 "metadata": { 153 "kernelspec": { 154 "display_name": "Python 3", 155 "language": "python", 156 "name": "python3" 157 }, 158 "language_info": { 159 "name": "python", 160 "version": "3.11.0" 161 } 162 }, 163 "nbformat": 4, 164 "nbformat_minor": 5 165 }