It's not so much about having a "sufficiently smart compiler" in the case of GPU...

It's not so much about having a "sufficiently smart compiler" in the case of GPUs doing compiler assisted scheduling. It's about not having to implement that logic in hardware at all. The more smarts they push into the core hardware, the more silicon each core needs, the less cores you can fit, and more power you spend on figuring out what to run rather than crunching numbers.

Doing the work in the compiler may produce less optimal scheduling than what is theoretically possible, but with the number of "cores" in a GPU you would spend a lot of power doing it in hardware for each one.