What led to the decision to not contribute enhancements to the k8s cluster autoscaler? Seems like a lot of the features would be useful and in scope of the cluster autoscaler...
Former Yelp engineer here (I left recently). Clusterman was built about ~18 months ago for our primarily Mesos and Marathon based workloads. At the time, the Kubernetes cluster autoscaler was in its infancy and there was no comparable solution that worked for Mesos.
Yelp is currently migrating to Kubernetes which is why support for Kubernetes autoscaling was recently added to Clusterman.
That's more or less what I was trying to convey. Responsible engineers will look around for mature solutions to their problems. Sometimes there aren't any and sometimes it will be the most economically-defensible decision to roll your own.
Of course, we notoriously underestimate the costs of roll your own. So a fair and common question is: why did you roll your own? What cost was so compelling?
What sometimes happens though is that the roll-your-own decision, made at time A, is later critiqued based on the options available at time B. If the decision was being made at time B, then the roll-your-own decision might be a symptom of "Not Invented Here Syndrome". But accusing someone of NIHS without accounting for when the original decision was made is unfair.
Hence my joke name, "Not Invented Yet Syndrome". Why didn't you use the alternative? Because it didn't exist or wasn't applicable.
>Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka, among others) that were once difficult to manage under Mesos (due to local state requirements).
Just wanted to clarify that running Cassandra & Kafka on Mesos is much easier now with the DC/OS Commons SDK.[1]
Spark has always been supported on Mesos.[2]
Small point of clarification: the DC/OS SDK works well for these applications when running on DC/OS. It did not work well when run on open source Mesos, which is what Yelp uses.
DC/OS is also open source, Apache 2.0, same as Mesos.
In fact DC/OS includes open source Mesos with no modifications, what they add is packaged bunch of different API providers running as Mesos apps: Marathon, DNS, etc., and also a somewhat dumb installer.
In my org, we run open source Marathon/Mesos deployments for years without any DC/OS additionals, and also a slightly modified versions of tooling below:
There are also available several open-source auto-scalers for Marathon/Mesos apart from these two.
This new one from Yelp seems to be promising, since it's battle tested within organization which has many kinds of different workloads, so I would give it a try for some new cluster.
It would be really cool to see the features in this mainlined.
Side question: I know in the article it was mentioned that they liked their signal approach to scaling because it allows them to preemptively scale. I'm just not sure why that wouldn't be achievable by scaling the replica counts of your deployments based on signals.
If scaling happens based on pending pods then just scale your pods so that they're configured to handle your predicted traffic. Then the cluster will obtain your desired state. Am I missing something?
Scaling happens at two levels: both for the deployments and for the size of the cluster itself. Clusterman operates on the cluster itself.
If you scale just the number of pods, if there isn't enough available capacity in the cluster, they are unable to deploy until that capacity brought online. By emitting a signal to increase the number of nodes in the cluster just before we think that capacity is about to be needed, we can ensure that the new pods are launched near instantly when the deployment is actually scaled up. This is mostly useful when you're scaling by large increments (i.e. hundreds of pods) that far exceed the spare capacity available in your cluster.
What benift does that distinction give you? If the cluster scales to be able to run all of your pending pods, and scales down when you have extra room, and you scale your pods correctly (preemptively & with custom metrics), what do you gain?
Mostly time. For workloads that are pseudo interactive (e.g. continuous integration), you're saving developers a 10-20 minutes here or there. That can add up to be quite a bit for a medium to large organisation.
Because who knows what kind of private company information might possibly be leaked otherwise in commit messages. From a legal point of view, it's a lot easier to start with a clean slate.
Not as sensitive as accidentally dropping information about your internal network. then take the long-troll method of infiltrating an upstream provider to attack a juicy target (build system of fortune 500? yes please)
or maybe catching wind of some dev keys that really are root keys..
many reasons to sanitize git history before open sourcing. in fact many organizations i have worked with still maintain two separate repos, one internal and one open source using fancy magic (either with git or with additional tools) to sanitize and sync commits between the two. i've seen code commits to a large organization that are then packaged up and inspected for license and security violations in an untrusted environment.. many reasons to keep two (or more) running copies
Security wise: From a tactical standpoint - your obvious ones are access keys, etc. From a strategic standpoint, I can tell who did what - and the people actually become great attack vectors.
Argument in the past: Because the best alternative is a complete audit of all past commits for IP issues, trade secrets, keys, network internals, ... It's just much easier to remove that hurdle from the process and only look at the current state of the code.
I guess there are also spiky offline jobs, like Apache spark as mentioned on the article. Likely for things like search indexing, training ML models, etc.
"Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents."
Btw, that rule doesn't mean tangential topics aren't important—often they're more important than the thread topic. We have the rule to prevent interesting/unusual discussions from getting supplanted by boring/predictable ones.
They aren't held in a negative light... After the war, they were forgiven and I think society in general has accepted that they were just following orders. The leaders definitely thought in a negative light.
Nazi soldiers killed innocent men, women and children en masse. The association with engineers who have done nothing unethical is really inappropriate and unnecessary.