ENGINEERING5 min read

Tracing vs logging: why print() debugging fails for LLM apps

The Currai team, Engineering — May 22, 2026

Every LLM app starts the same way: a print() of the prompt, a print() of the response, and a scroll through the terminal when something looks off. It works until the app does more than one thing — and then it falls apart fast.

What logging can't capture

A log line is a point-in-time string. That's fine for "request received" or "cache miss," but an LLM request is a tree: a prompt is assembled from a template plus retrieved documents plus chat history, the model is called, it emits tool calls, those tools run, and the model is called again. Flatten that into log lines and you lose the structure that explains the behavior.

[INFO] prompt: You are a helpful assistant...
[INFO] response: I'll check that for you
[INFO] tool: search_orders(...)
[INFO] response: Your order shipped Tuesday

Which prompt produced which tool call? How long did the search take? What did the final prompt look like after the tool result was appended? The logs can't say.

A trace keeps the shape of the work

Tracing models the request as it actually happened — a root trace with nested spans and generations, each carrying its own input, output, timing, and metadata.

trace = currai.trace(name="order-status", user_id="user-1")

gen1 = trace.generation(name="plan", model="gpt-4o-mini", input=messages)
gen1.end(output=tool_call)

tool = trace.span(name="search_orders", input={"q": "latest"})
tool.end(output=orders)

gen2 = trace.generation(name="answer", model="gpt-4o-mini", input=messages_with_tool)
gen2.end(output=final_reply)

Open that trace and the whole turn is laid out in order, with the exact payloads and the time each step took. No reconstruction required.

The three things logs make you give up

Correlation. Logs from concurrent requests interleave; a trace groups everything for one request under one id automatically.
Replay. A log line is the string you chose to print. A trace stores the full input and output, so you can replay a response months later.
Cost and latency. Token usage and per-step timing live on the trace, ready to roll up — they were never going to fit in a log line.

You don't have to choose

Keep your logs for infrastructure events — they're great at that. Add tracing for the model calls, where structure and replay are the difference between a five-minute fix and an afternoon of guessing. The day you trade print() for a real trace is the day LLM debugging stops feeling like archaeology.