Design decision - correct approach for costly pod starts #11719

RonaldGalea · 2023-08-30T13:13:36Z

RonaldGalea
Aug 30, 2023

Hello,

I wish to better understand the design decisions behind Argo. Specifically, the trade-offs between creating a completely separate pod instance to run each task as opposed to having a warm (but still dynamically scalable) pool of worker pods ready to accept requests (in some form).

Here are the points I can see for each approach.

Pod per task

Pros:

simple implementation
simple monitoring & error handling (watch what happens to the pod running the task)
strong isolation (no two tasks can interfere with each other)
very generic (a task just needs a container image and command to execute)

Cons:

start-up overheads: the pod start up time must be paid for each and every task

Pod worker pool

To keep things simple, let's assume single-threaded workers, so task concurrency per worker is just 1.

Pros:

warm instances are available, smaller latencies, better scalability

Cons:

complex implementation
monitoring and error handling becomes more complicated (clients probably will have to have some logic to support this)
though less likely when concurrency=1, tasks could still interfere with one another in some cases
harder to make things generic, one might need to implement clients in multiple languages.

My particular use case

I have a number of services that I need to chain together in some logical way which fits a DAG, so a Workflow Management tool like Argo seems to fit my use case well. However, some of these services have a relatively high start-up time (large container images, preparing machine learning models, etc.), and for these it would be really painful to tear down the warm environment and recreate it constantly.

Argo, as well as other Workflow Management systems that I've seen only support the Pod per task approach - but surely I'm not the only one with a use case that doesn't quite fit that well due to costly start-ups. So I have the following questions:

Why does Argo not support pod worker pools? Is this just not common enough to make it worth the implementation complexity or is there some fundamental anti-pattern that makes it unfeasible?
What is the correct system design in the case where some services are too expensive to recreate the environment for? One idea that comes to mind is to replace the actual service in the workflow with a "shell"/"dummy" task that calls the actual service which is backed by a pool of workers elsewhere. However, I'm not sure if such gymnastics are a clean system design.

Edit: The idea of a "warm pool of workers" is very general and well-known, so I find it counter-intuitive that it's just not there for Kubernetes pods. For instance, there could be a lightweight client that listens for requests or messages and calls a user-defined callback. Is there some fundamental reason/limitation why this is not done?

I would be very thankful for insights regarding the above highlighted considerations.

MrMikeFloyd · 2025-01-17T07:47:35Z

MrMikeFloyd
Jan 17, 2025

Hi @RonaldGalea,

thanks for raising these points and very interesting question indeed. Did you find any answers or workarounds that helped you?

I have a similar issue that sounds a lot like your points. We are running Argo Workflows in our org and are generally happy with it. However, we have a bunch of use cases where we need to keep the overall runtime of a workflow as low as possible, while the startup time of the containers for tasks participating in that workflow is substancial. If there was a way to ensure that a set of container instances for that type of task is always "hot" would cut our overall runtime substantially, but this is nothing that Argo Workflows supports in any way.
One idea for us to tackle this is to encapsulate this "complex, heavyweight container" in a long-running service outside of Argo, and then call it from a lightweight task in Argo that just sends an API request. That way, we would cut down the time a bit, but would still "pay" for the startup time of the lightweight pod.

So yeah, in a perfect world, we could just ensure that we have a hot pool of pods for complex execution steps and be happy forever. I'd therefore also like to understand what the design decisions were and how others tackled this issue

2 replies

RonaldGalea Jan 17, 2025
Author

Hey, ultimately, I could achieve what I needed using Temporal, a durable execution platform: https://temporal.io

Best of luck!

MrMikeFloyd Jan 17, 2025

Thanks Ronald! I also looked at Temporal and Windmill, and they both seem to be able to solve this issue. Not get sure though if I want to maintain another workflow engine, but that'll be part of a longer discussion.
Anyways, thanks for your input, and all the best for you.

jswxstw · 2025-01-21T13:14:53Z

jswxstw
Jan 21, 2025
Collaborator

There have been multiple discussions regarding the issue of pod reuse, #7144 and #12255 etc.

0 replies

Gryhyphen · 2025-05-29T00:43:27Z

Gryhyphen
May 29, 2025

@jswxstw

This isn't necessarily related to pod reuse. I.e. could costly pod starts not also be solved by using wasm containers, which are built for fast start? (likely not as fast as a pod pool, but still faster than linux based containers afaik)

I.e. using wasmtime? https://docs.docker.com/engine/daemon/alternative-runtimes/

Been poking around to see if argo can support different container runtimes to containerd, ideally we could use the same...

ahhhh I see #12255

OH SO THAT'S WHAT THE WASM-workflows-plugin does! I was dismissing it because I didn't understand that it was actually utilizing a different container runtime. I thought it was running the wasm code in a runtime on the argo agent, not actually creating new (wasm) containers in Kubernetes... that's pretty sick that it does that! It's actually leveraging a way to run pods where the pod is a WASM runtime, not a linux container that contains a WASM runtime.

tbh the whole user story around costly pod starts in argo could really use an uplift. WASM is very cutting edge so I can understand wanting to wait for it to be stable - but the vision of cloud native and making argo faster and more performant could really unlock it's potential for applications in new domains like streaming workloads (like financial transactions) which would let it compete with temporal and windmill. (Yes I know numaflow exists but from a cursory look it's not great at tracking, retrying and human-in-the-loop scenarios with suspensions, and argo events already can consume kafka messages so it's not like it's not a path that argo hasn't invested anything into.).

Personally, I'd rather see how far container-first approach can go because imo it's a simpler model than managing a pod pool and really leans into the 'serverless' paradigm. And in absence of wasm containers, it would be nice if we could at least 'attach' to a hot pod and capture the logs of a run (tho that's probably not possible without a worker sdk... man I wish there was a standardized 'worker'... wait, numaflow sdks...? That would be exciting).

At the end of the day I might still end up just choosing argo simply because the strong container-based isolation makes me happy and I can reuse my knowledge of Kubernetes. (not to mention it's cncf status makes me feel safer than options like temporal and windmill).

Also, I am mostly deluding myself and daydreaming about tech stacks as currently I don't get enough greenfield projects where I have the authority to make architectural decisions, so I just imagine setting up enterprise-grade open-source stuff on a spare laptop. 🥲

One day. one day...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design decision - correct approach for costly pod starts #11719

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Design decision - correct approach for costly pod starts #11719

Uh oh!

Uh oh!

RonaldGalea Aug 30, 2023

Pod per task

Pod worker pool

My particular use case

Replies: 3 comments · 2 replies

Uh oh!

MrMikeFloyd Jan 17, 2025

Uh oh!

RonaldGalea Jan 17, 2025 Author

Uh oh!

Uh oh!

MrMikeFloyd Jan 17, 2025

Uh oh!

jswxstw Jan 21, 2025 Collaborator

Uh oh!

Uh oh!

Gryhyphen May 29, 2025

RonaldGalea
Aug 30, 2023

Replies: 3 comments 2 replies

MrMikeFloyd
Jan 17, 2025

RonaldGalea Jan 17, 2025
Author

jswxstw
Jan 21, 2025
Collaborator

Gryhyphen
May 29, 2025