Why I Use Temporal for Long-Running AI Workflows
Why I Use Temporal for Long-Running AI Workflows
Quick answer: Temporal is a durable execution platform that runs your code as fault-tolerant workflows. If a step fails — an LLM timeout, a flaky API, a crashed worker — Temporal retries exactly that step, not the whole pipeline. For AI workflows specifically, where individual calls can hang for 30 seconds or silently fail, this changes how reliable your system can actually be.
We had an Argo workflow that synced connector metadata from Fivetran into our internal catalog. It ran fine, mostly. But "mostly fine" in data pipelines means "broken in ways you don't find out about until 3am."
The workflow would fail midway. Argo would mark it as failed, but the partial writes had already gone to the database. The next run would start from scratch, sometimes doubling records, sometimes skipping updates. Retry logic was hand-rolled in the YAML. Every edge case was someone's personal contribution to a growing ball of undocumented YAML mud.
We were migrating this to Temporal as part of moving our connector infrastructure to Automation Engine. That's when I got a proper look at what durable execution actually means in practice.
What broke with Argo
Argo Workflows is great for batch jobs and simple DAGs. It's Kubernetes-native, has a clean UI, and the community is solid. The problems showed up when we started building workflows that had real external dependencies.
Silent partial failures. A workflow would complete steps 1 through 6, fail on step 7, and Argo would flag the whole thing as failed. But steps 1 through 6 had already written data. The next retry would run from step 1, giving us duplicate writes, missing foreign keys, or stale state depending on which step had side effects.
Manual retry logic. Every step that needed retry behavior needed its own retry block in the YAML. Want exponential backoff with jitter? That's a custom wrapper or a templating hack.
No state between steps. Passing data between Argo steps meant writing to shared storage (usually S3) and reading it back. It added latency, a failure point, and made debugging a treasure hunt across logs and buckets.
The killer issue for AI workloads: LLM calls. A step that calls GPT-4 might take 10 seconds, might take 30, might time out, might return a rate-limit error at peak hours. Argo's timeout and retry model wasn't designed for this.
What Temporal actually is
Temporal splits your workflow into two concepts worth understanding clearly before you write a single line of code.
Workflows are the orchestration layer. They define the sequence of steps, the logic, the branching. Workflows are durable — Temporal records every event to a history log. If your worker process crashes mid-workflow, when it comes back up, Temporal replays the history and resumes exactly where it left off. The Workflow code must be deterministic for this to work.
Activities are the actual units of work. Each Activity is a function that does something: calls an API, writes to a database, invokes an LLM. Activities can be retried independently. If Activity 8 fails, Temporal retries Activity 8. Activities 1 through 7 are done. Their results are cached in the workflow history.
This distinction is the whole game. Retries become surgical instead of blunt. Failures don't cascade backwards.
The pattern I use for AI pipelines
Here's the concrete pattern for an LLM-powered data pipeline. Polling an external API, enriching the response with an LLM, writing the result to a database:
`python
@workflow.defn
class EnrichmentWorkflow:
@workflow.run
async def run(self, connector_id: str) -> dict:
raw_data = await workflow.execute_activity(
fetch_connector_metadata,
connector_id,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3),
)
enriched = await workflow.execute_activity(
call_llm_enrichment,
raw_data,
start_to_close_timeout=timedelta(seconds=60),
retry_policy=RetryPolicy(
maximum_attempts=5,
initial_interval=timedelta(seconds=2),
backoff_coefficient=2.0,
non_retryable_error_types=["InvalidInputError"],
),
)
result = await workflow.execute_activity(
write_to_catalog,
enriched,
start_to_close_timeout=timedelta(seconds=15),
retry_policy=RetryPolicy(maximum_attempts=3),
)
return result
`
The LLM activity gets a 60-second timeout and up to 5 retries with exponential backoff. If it hits a rate limit on attempt 2, Temporal waits and retries. If it gets an InvalidInputError, it stops immediately — no point retrying bad input.
If the worker process crashes between step 2 and step 3, Temporal knows step 2 completed (it's in the history). When the worker restarts, it picks up at step 3. The LLM call doesn't re-run. That's the part that actually saves you.
Why Temporal fits AI workloads specifically
Timeouts that don't lie. Temporal's start_to_close_timeout is per Activity execution. You set it at the call site, close to the code making the HTTP request. When something times out, you know exactly which Activity timed out.
Long waits without blocking. Some AI pipelines need to wait on an async process — a batch job, a human review, an external enrichment service that takes hours. Temporal Workflows can sleep using workflow.sleep(). The worker process isn't blocked during that wait.
Audit history for free. Every Activity, every input, every result is recorded in the workflow history. When something goes wrong, you open the Temporal UI and read what happened. No reconstructing from logs across six services.
Before and after: one Fivetran connector step
Before (Argo YAML):
`yaml
- name: enrich-metadata
retryStrategy:
limit: 3
backoff: "2m"
container:
image: connector-worker:latest
command: [python, enrich.py, "{{inputs.parameters.connector_id}}"]
outputs:
artifacts:
- name: enriched-metadata
path: /tmp/enriched.json
s3:
bucket: connector-artifacts
key: "{{workflow.name}}/enriched.json"
`
Failure here meant the artifact might or might not exist in S3. The next step would try to read it. If the artifact was corrupt or half-written, the next step would fail with a cryptic S3 error, not an enrichment error.
After (Temporal Activity): The Activity function returns the enriched data directly. Temporal stores it in the workflow history. The next Activity receives it as a typed Python argument. No S3. No artifact path. No ambiguity about whether the file exists.
When the Activity fails, you see a Python exception in the Temporal UI, not an S3 key-not-found error that sends you on a 45-minute debugging detour.
What I found hard
Local dev is a chore. You need to run the Temporal server locally (or use their cloud). Docker Compose with a separate service, a UI, and a database. It's overhead on every developer's machine.
Workflow determinism is a constraint you feel. No datetime.now(), no random number generation, no direct I/O in Workflow code. Push all side effects into Activities. This is the right design, but it takes a mental shift.
Signals and Queries. The pattern for implementing them in the Python SDK is not obvious the first time. I spent a day building a signal handler before I understood that the Workflow had to explicitly loop and wait for signals using workflow.wait_condition.
When I'd reach for Temporal vs something simpler
Reach for Temporal when:
- The pipeline has 5+ distinct steps with external dependencies
- Any step calls an LLM, a slow API, or something that fails unpredictably
- Partial failures leave your data in inconsistent states
- You need long waits (hours, days) inside the pipeline
Reach for a simpler queue (Celery, BullMQ, SQS + Lambda) when:
- The workflow is really just "run this job and retry on failure" — one or two steps
- The team doesn't have bandwidth to learn a new SDK and run a new service
- Idempotency is easy and failures are recoverable by re-running from scratch
The migration from Argo to Temporal wasn't free. It took time to learn the SDK, time to set up the infrastructure, and time to rewrite the workflows. The bet is that the operational reliability pays back that investment when you're not debugging silent partial failures at 3am.
So far, that bet is paying off.