Document processing & screen sharing
Document processing and screen sharing
Documents and screen shares are two of the most reliable multimodal inputs. PDFs arrive via data channels and get parsed into text. Screen shares arrive as high-resolution video Tracks that multimodal LLMs read with high accuracy. In this chapter you will build both capabilities and connect them with annotation overlays.
Document uploads via data channels
Documents arrive as binary payloads on LiveKit data channels. The frontend sends the file, the agent receives and processes it based on file type.
User uploads a file
The frontend provides a file picker. When the user selects a PDF or image, the app reads the file and sends it as a data message with the reliable flag.
Agent receives and processes
The agent listens for incoming data messages on the document-upload topic, detects the file type, and extracts text content.
Content enters context on demand
Extracted text is stored and exposed through a function tool. The LLM retrieves document content only when it needs to reference it, keeping token usage proportional to actual use.
import io
from livekit.agents import Agent, AgentSession, function_tool
from livekit.plugins import openai, deepgram, cartesia
class DocumentAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a study partner that can read documents.
When the user uploads a PDF or image, read its contents and help them
understand the material. Always refer to specific sections, page numbers,
or passages when answering questions about a document.""",
)
self._document_text = ""
@function_tool
async def get_document_content(self, context):
"""Retrieve the contents of the most recently uploaded document.
Call this when the user asks about their uploaded document."""
if not self._document_text:
return "No document has been uploaded yet. Ask the user to upload a file."
if len(self._document_text) < 4000:
return self._document_text
return self._document_text[:4000] + "
[Document truncated — ask about specific sections]"
@function_tool
async def search_document(self, context, query: str):
"""Search the uploaded document for information related to the query.
Returns relevant passages from the document."""
if not self._document_text:
return "No document uploaded yet."
chunks = self._document_text.split("
")
query_lower = query.lower()
relevant = [c for c in chunks if any(w in c.lower() for w in query_lower.split())]
return "
".join(relevant[:5]) if relevant else self._document_text[:4000]
async def on_data_received(self, payload: bytes, topic: str):
if topic == "document-upload":
self._document_text = await self._extract_text(payload)
await self.session.say("I have received your document. Ask me anything about it.")
async def _extract_text(self, payload: bytes) -> str:
import PyPDF2
reader = PyPDF2.PdfReader(io.BytesIO(payload))
pages = []
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
pages.append(f"--- Page {i + 1} ---
{text}")
return "
".join(pages)import { Agent, AgentSession, functionTool } from "@livekit/agents";
import { OpenAI } from "@livekit/agents-plugin-openai";
class DocumentAgent extends Agent {
private documentText = "";
constructor() {
super({
instructions: `You are a study partner that can read documents.
When the user uploads a PDF or image, read its contents and help them
understand the material. Always refer to specific sections, page numbers,
or passages when answering questions about a document.`,
});
}
@functionTool({
description: "Retrieve the contents of the most recently uploaded document.",
})
async getDocumentContent(context) {
if (!this.documentText) {
return "No document has been uploaded yet. Ask the user to upload a file.";
}
return this.documentText.length < 4000
? this.documentText
: this.documentText.slice(0, 4000) + "
[Document truncated]";
}
async onDataReceived(payload: Uint8Array, topic: string) {
if (topic === "document-upload") {
const pdfParse = await import("pdf-parse");
const result = await pdfParse.default(Buffer.from(payload));
this.documentText = result.text;
await this.session.say("I have received your document. Ask me anything about it.");
}
}
}The document text is stored in an instance variable and exposed through a function tool. The LLM does not receive the entire document in every message — it calls the tool when it needs to reference the document. This keeps token usage under control for long documents.
Image OCR with vision models
Not every document is a clean PDF with extractable text. Scanned pages, handwritten notes, and photos of textbooks require OCR. The multimodal LLM itself is a powerful OCR engine.
from livekit.plugins import openai
import base64
async def ocr_with_vision(image_bytes: bytes, llm: openai.LLM) -> str:
"""Use a multimodal LLM to extract text from an image."""
b64_image = base64.b64encode(image_bytes).decode("utf-8")
response = await llm.chat(
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all visible text from this image. "
"Preserve the original formatting as closely as possible. "
"If there are diagrams, describe them briefly.",
},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64_image}"},
},
],
}
]
)
return response.choices[0].message.contentimport { OpenAI } from "@livekit/agents-plugin-openai";
async function ocrWithVision(imageBytes: Uint8Array, llm: OpenAI.LLM): Promise<string> {
const b64Image = Buffer.from(imageBytes).toString("base64");
const response = await llm.chat({
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Extract all visible text from this image. "
+ "Preserve the original formatting as closely as possible. "
+ "If there are diagrams, describe them briefly.",
},
{
type: "image_url",
image_url: { url: `data:image/png;base64,${b64Image}` },
},
],
},
],
});
return response.choices[0].message.content;
}Vision models outperform traditional OCR on messy inputs
For clean printed text, traditional OCR libraries are faster and cheaper. But for handwritten notes, photos taken at angles, or documents with mixed text and diagrams, multimodal LLMs produce significantly better results. Choose based on your input quality.
Frontend document upload component
The frontend sends files to the agent via a LiveKit data message with reliable: true to guarantee complete delivery.
import { useDataChannel } from "@livekit/components-react";
function DocumentUpload() {
const { send } = useDataChannel("document-upload");
async function handleFileChange(e: React.ChangeEvent<HTMLInputElement>) {
const file = e.target.files?.[0];
if (!file) return;
const buffer = await file.arrayBuffer();
send(new Uint8Array(buffer), { reliable: true });
}
return (
<div className="upload-area">
<label htmlFor="doc-upload">Upload a document (PDF or image)</label>
<input
id="doc-upload"
type="file"
accept=".pdf,.png,.jpg,.jpeg"
onChange={handleFileChange}
/>
</div>
);
}Data messages in LiveKit can carry arbitrary binary payloads. The reliable: true option ensures the file arrives completely — unlike video frames, you cannot afford to drop parts of a PDF. The topic string "document-upload" lets the agent distinguish document uploads from other data messages.
Screen share analysis
A screen share is another video Track in the Room, but with different characteristics than a camera feed. Screen content is high-resolution text and UI elements. The LLM reads screen shares with higher accuracy than camera feeds, making this one of the most reliable vision use cases.
from livekit.agents import Agent, AgentSession, RoomInputOptions, VideoFrameOptions
class ScreenAssistantAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You can see the user's shared screen. Help them with whatever
they are working on.
When reading screen content:
- Read error messages exactly as written — do not paraphrase
- Reference specific UI elements by name and location
- Give step-by-step directions: "click the Settings icon in the top-right corner"
- If the screen content is unclear, ask the user to scroll or zoom in
You are patient and precise. Walk the user through each step one at a time.""",
)
async def entrypoint(ctx):
session = AgentSession(
stt=deepgram.STT(),
llm=openai.LLM(model="gpt-4o"),
tts=cartesia.TTS(),
)
await session.start(
agent=ScreenAssistantAgent(),
room=ctx.room,
room_input_options=RoomInputOptions(
video_enabled=True,
video_frame_options=VideoFrameOptions(
capture_interval=2.0,
max_frames_in_context=3,
max_width=1920,
max_height=1080,
),
),
)import { Agent, AgentSession } from "@livekit/agents";
import { OpenAI } from "@livekit/agents-plugin-openai";
import { Deepgram } from "@livekit/agents-plugin-deepgram";
import { Cartesia } from "@livekit/agents-plugin-cartesia";
class ScreenAssistantAgent extends Agent {
constructor() {
super({
instructions: `You can see the user's shared screen. Help them with whatever
they are working on.
When reading screen content:
- Read error messages exactly as written — do not paraphrase
- Reference specific UI elements by name and location
- Give step-by-step directions: "click the Settings icon in the top-right corner"
- If the screen content is unclear, ask the user to scroll or zoom in
You are patient and precise. Walk the user through each step one at a time.`,
});
}
}
async function entrypoint(ctx) {
const session = new AgentSession({
stt: new Deepgram.STT(),
llm: new OpenAI.LLM({ model: "gpt-4o" }),
tts: new Cartesia.TTS(),
});
await session.start({
agent: new ScreenAssistantAgent(),
room: ctx.room,
roomInputOptions: {
videoEnabled: true,
videoFrameOptions: {
captureInterval: 2.0,
maxFramesInContext: 3,
maxWidth: 1920,
maxHeight: 1080,
},
},
});
}Higher resolution for screen shares
Notice the max_width of 1920 instead of 1024 used for camera feeds. Screen content contains small text that becomes unreadable if downscaled too aggressively. The tradeoff is higher token cost — plan for 1,500-2,000 tokens per full-HD screen capture.
Annotation overlay
Voice directions like "click the button on the right" can be ambiguous. Annotations let the agent highlight specific regions of the screen using normalized coordinates (0-1) sent via data channels.
import json
from livekit.agents import Agent, function_tool
class AnnotatingAssistant(Agent):
def __init__(self):
super().__init__(
instructions="""You can see the user's screen and highlight areas using
the annotate tool. When guiding the user, highlight the specific element
you are referring to so they can find it easily.""",
)
@function_tool
async def annotate_screen(self, context, x: float, y: float, width: float, height: float, label: str):
"""Highlight a rectangular region on the user's screen.
Coordinates are normalized 0-1 relative to the screen dimensions.
x, y is the top-left corner. width and height define the box size."""
annotation = {
"type": "highlight",
"x": x, "y": y,
"width": width, "height": height,
"label": label,
}
await self.session.room.local_participant.publish_data(
json.dumps(annotation).encode(),
topic="screen-annotation",
)
return f"Highlighted region at ({x:.0%}, {y:.0%}) with label: {label}"import { Agent, functionTool } from "@livekit/agents";
class AnnotatingAssistant extends Agent {
constructor() {
super({
instructions: `You can see the user's screen and highlight areas using
the annotate tool. When guiding the user, highlight the specific element
you are referring to so they can find it easily.`,
});
}
@functionTool({
description: "Highlight a rectangular region on the user's screen. Coordinates are normalized 0-1.",
})
async annotateScreen(context, { x, y, width, height, label }) {
const annotation = { type: "highlight", x, y, width, height, label };
await this.session.room.localParticipant.publishData(
new TextEncoder().encode(JSON.stringify(annotation)),
{ topic: "screen-annotation" },
);
return `Highlighted region at (${(x * 100).toFixed(0)}%, ${(y * 100).toFixed(0)}%) with label: ${label}`;
}
}The frontend renders annotations as overlays on top of the screen share video:
import { useDataChannel } from "@livekit/components-react";
import { useState, useEffect } from "react";
function ScreenAnnotationOverlay({ screenWidth, screenHeight }) {
const [annotations, setAnnotations] = useState([]);
const { message } = useDataChannel("screen-annotation");
useEffect(() => {
if (!message) return;
const annotation = JSON.parse(new TextDecoder().decode(message.payload));
const id = Date.now();
setAnnotations((prev) => [...prev, { ...annotation, id }]);
// Auto-clear after 5 seconds
setTimeout(() => {
setAnnotations((prev) => prev.filter((a) => a.id !== id));
}, 5000);
}, [message]);
return (
<div className="annotation-layer" style={{ position: "absolute", inset: 0 }}>
{annotations.map((a) => (
<div
key={a.id}
className="annotation-highlight"
style={{
position: "absolute",
left: `${a.x * 100}%`,
top: `${a.y * 100}%`,
width: `${a.width * 100}%`,
height: `${a.height * 100}%`,
border: "3px solid #ff4444",
borderRadius: 4,
}}
>
<span className="annotation-label">{a.label}</span>
</div>
))}
</div>
);
}Annotation coordinates are approximate
The LLM estimates where UI elements are based on the captured frame. These estimates are usually close but not pixel-perfect. Use generous highlight boxes and clear labels rather than trying to pinpoint exact pixel coordinates.
Test your knowledge
Question 1 of 3
Why is document text exposed through a function tool rather than injected directly into the LLM context with every message?
What is next
In the final chapter, you will bring everything together into a complete multimodal study partner agent that combines voice, vision, documents, screen sharing, and whiteboard drawing — then test it end to end.