Open-sourcing Clusterman, a cluster autoscaler for Kubernetes and Mesos

ones_and_zeros · on Nov 15, 2019

What led to the decision to not contribute enhancements to the k8s cluster autoscaler? Seems like a lot of the features would be useful and in scope of the cluster autoscaler...

ssk2 · on Nov 15, 2019

Former Yelp engineer here (I left recently). Clusterman was built about ~18 months ago for our primarily Mesos and Marathon based workloads. At the time, the Kubernetes cluster autoscaler was in its infancy and there was no comparable solution that worked for Mesos.

Yelp is currently migrating to Kubernetes which is why support for Kubernetes autoscaling was recently added to Clusterman.

jacques_chester · on Nov 16, 2019

I call this phenomenon "Not Invented Yet Syndrome".

zapita · on Nov 16, 2019

More like “real production workloads, can’t wait for vendors to catch up syndrome”.

jacques_chester · on Nov 16, 2019

That's more or less what I was trying to convey. Responsible engineers will look around for mature solutions to their problems. Sometimes there aren't any and sometimes it will be the most economically-defensible decision to roll your own.

Of course, we notoriously underestimate the costs of roll your own. So a fair and common question is: why did you roll your own? What cost was so compelling?

What sometimes happens though is that the roll-your-own decision, made at time A, is later critiqued based on the options available at time B. If the decision was being made at time B, then the roll-your-own decision might be a symptom of "Not Invented Here Syndrome". But accusing someone of NIHS without accounting for when the original decision was made is unfair.

Hence my joke name, "Not Invented Yet Syndrome". Why didn't you use the alternative? Because it didn't exist or wasn't applicable.

treve · on Nov 15, 2019

They don't use kubernetes.

filereaper · on Nov 15, 2019

>Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka, among others) that were once difficult to manage under Mesos (due to local state requirements).

Just wanted to clarify that running Cassandra & Kafka on Mesos is much easier now with the DC/OS Commons SDK.[1] Spark has always been supported on Mesos.[2]

[1]https://github.com/mesosphere/dcos-commons

Cassandra:https://docs.d2iq.com/mesosphere/dcos/services/cassandra/2.7...

Kafka:https://docs.d2iq.com/mesosphere/dcos/services/kafka/2.8.0-2...

[2]https://docs.d2iq.com/mesosphere/dcos/services/spark/2.9.0-2...

elvinyung · on Nov 15, 2019

Yes -- in fact, the original Spark implementation ran on Mesos only, as per the original RDD paper by Zaharia et al.: https://www.usenix.org/system/files/conference/nsdi12/nsdi12...

ssk2 · on Nov 15, 2019

Small point of clarification: the DC/OS SDK works well for these applications when running on DC/OS. It did not work well when run on open source Mesos, which is what Yelp uses.

perfect_kiss · on Nov 17, 2019

DC/OS is also open source, Apache 2.0, same as Mesos.

In fact DC/OS includes open source Mesos with no modifications, what they add is packaged bunch of different API providers running as Mesos apps: Marathon, DNS, etc., and also a somewhat dumb installer.

Auto-scaling was long possible for Marathon/Mesos, and is described at DC/OS website, though does not involve any of DC/OS additionals: https://docs.d2iq.com/mesosphere/dcos/2.0/tutorials/autoscal... .

In my org, we run open source Marathon/Mesos deployments for years without any DC/OS additionals, and also a slightly modified versions of tooling below:

https://github.com/mesosphere/marathon-autoscale https://github.com/mesosphere/marathon-lb-autoscale

There are also available several open-source auto-scalers for Marathon/Mesos apart from these two.

This new one from Yelp seems to be promising, since it's battle tested within organization which has many kinds of different workloads, so I would give it a try for some new cluster.

gravypod · on Nov 16, 2019

It would be really cool to see the features in this mainlined.

Side question: I know in the article it was mentioned that they liked their signal approach to scaling because it allows them to preemptively scale. I'm just not sure why that wouldn't be achievable by scaling the replica counts of your deployments based on signals.

If scaling happens based on pending pods then just scale your pods so that they're configured to handle your predicted traffic. Then the cluster will obtain your desired state. Am I missing something?

ssk2 · on Nov 16, 2019

Scaling happens at two levels: both for the deployments and for the size of the cluster itself. Clusterman operates on the cluster itself.

If you scale just the number of pods, if there isn't enough available capacity in the cluster, they are unable to deploy until that capacity brought online. By emitting a signal to increase the number of nodes in the cluster just before we think that capacity is about to be needed, we can ensure that the new pods are launched near instantly when the deployment is actually scaled up. This is mostly useful when you're scaling by large increments (i.e. hundreds of pods) that far exceed the spare capacity available in your cluster.

gravypod · on Nov 16, 2019

What benift does that distinction give you? If the cluster scales to be able to run all of your pending pods, and scales down when you have extra room, and you scale your pods correctly (preemptively & with custom metrics), what do you gain?

ssk2 · on Nov 18, 2019

Mostly time. For workloads that are pseudo interactive (e.g. continuous integration), you're saving developers a 10-20 minutes here or there. That can add up to be quite a bit for a medium to large organisation.

madsbuch · on Nov 15, 2019

Initial commit 18 days ago: I wonder if they did not use git before open sourcing, or of they wanted to clean the commit log before publishing.

sixo · on Nov 15, 2019

Everyone cleans commit logs before open sourcing

ramilexe · on Nov 15, 2019

trevor-e · on Nov 15, 2019

Because who knows what kind of private company information might possibly be leaked otherwise in commit messages. From a legal point of view, it's a lot easier to start with a clean slate.

sargram01 · on Nov 15, 2019

Not everyone does, the UnrealEngine doesn’t for example, you can see project code words in the commit history.

o-__-o · on Nov 15, 2019

Not as sensitive as accidentally dropping information about your internal network. then take the long-troll method of infiltrating an upstream provider to attack a juicy target (build system of fortune 500? yes please)

or maybe catching wind of some dev keys that really are root keys..

many reasons to sanitize git history before open sourcing. in fact many organizations i have worked with still maintain two separate repos, one internal and one open source using fancy magic (either with git or with additional tools) to sanitize and sync commits between the two. i've seen code commits to a large organization that are then packaged up and inspected for license and security violations in an untrusted environment.. many reasons to keep two (or more) running copies

czbond · on Nov 15, 2019

Security wise: From a tactical standpoint - your obvious ones are access keys, etc. From a strategic standpoint, I can tell who did what - and the people actually become great attack vectors.

paxys · on Nov 15, 2019

Also great recruitment vectors for competitors

jkrems · on Nov 15, 2019

Argument in the past: Because the best alternative is a complete audit of all past commits for IP issues, trade secrets, keys, network internals, ... It's just much easier to remove that hurdle from the process and only look at the current state of the code.

joshmn · on Nov 15, 2019

    git commit -am "hope this works"

xeromal · on Nov 15, 2019

or my favorite

git commit -am "i did a stupid"

kleton · on Nov 15, 2019

Does yelp use a monorepo?

itake · on Nov 15, 2019

sargram01 · on Nov 15, 2019

So when does Yelp see spikes in traffic needing a custom Autoscaler? I guess just the daily time zone based sine wave type load?

rujekskd · on Nov 15, 2019

I guess there are also spiky offline jobs, like Apache spark as mentioned on the article. Likely for things like search indexing, training ML models, etc.

ssk2 · on Nov 15, 2019

Usually around dinner time in our big metro areas :)

sargram01 · on Nov 16, 2019

That makes sense, do you have any annual spikes too? Like Uber has Halloween, which I wouldn’t have guessed as being the worst.

ssk2 · on Nov 16, 2019

Mother's Day was a surprising one for me! A lot of people take their parents out for brunch. Same goes for Valentine's Day.

therobot24 · on Nov 15, 2019

yelp doing something not shady or bullying? i'm very skeptical...

dang · on Nov 15, 2019

"Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents."

https://news.ycombinator.com/newsguidelines.html

Btw, that rule doesn't mean tangential topics aren't important—often they're more important than the thread topic. We have the rule to prevent interesting/unusual discussions from getting supplanted by boring/predictable ones.

skyyler · on Nov 15, 2019

were software engineers ever the ones at yelp doing shady things or bullying?

ar_lan · on Nov 15, 2019

I'm not a Facebook fan, but I am hugely grateful for their contributions of React and GraphQL.

I think engineers should be considered dis-associated.

voidfunc · on Nov 15, 2019

Engineers are special innocent snowflakes that would never do a bad thing?

threeseed · on Nov 15, 2019

Well they generally aren't responsible for the strategic decisions of the company.

But nobody said they are somehow immune from doing bad things. So nice strawman.

strig · on Nov 15, 2019

Engineers on the ground developing technologies generally aren't responsible for overarching business decisions, no.

InterestBazinga · on Nov 15, 2019

I feel like this is an intricate argument.

Why are german soldiers held in a negative light even though the war decisions were by its commanders?

itake · on Nov 15, 2019

They aren't held in a negative light... After the war, they were forgiven and I think society in general has accepted that they were just following orders. The leaders definitely thought in a negative light.

lxw · on Nov 15, 2019

This is not true. "Just following orders" was held to be not a valid defense in the Nuremberg trials: https://en.m.wikipedia.org/wiki/Superior_orders

voidfunc · on Nov 15, 2019

[flagged]

threeseed · on Nov 15, 2019

What the hell is wrong with you ?

Nazi soldiers killed innocent men, women and children en masse. The association with engineers who have done nothing unethical is really inappropriate and unnecessary.

weego · on Nov 15, 2019

So in your opinion if someome works on the infrastructure for surveillance capitalism that doesn't amount to tacit approval of the model?