The perils of Optimization

A tale of 2 orgs: As Q2 wraps up, two very senior managers, Alice and Bob, at two similar and competing companies (companies A and B) are deciding on what their team should move towards for the next two quarters. Their departments deals with systems and performance aspects of a deep learning application that is fairly central to the company's core product, and has tight latency requirements. Both Alice and Bob has been receiving numerous complaints from the department that they are working with, the "research" team, and the executives (with whom this "research" team is closest with, as measured by org-chart and influence), that the latency is not up to snuff, the performance needs to be higher, the software is too slow, etc.

They've been hearing complaints from their teams too. The tools that they use are unstable, horrendous error messages, weird edge cases and even compiler bugs appear on a regular basis, grinding production to a halt. Their company mainly uses python for the machine learning, with a rust service layer that wraps the models. Unfortunately, to match the idea of python with the idea of "performance", their code is painfully contorted. The half-assed compiler they use demands code of such a quality - shoehorning a blasted ambiguous, unpredictable and possibly unsound type system into python - defiling the very Idea of "pythonic". Needless to say, Alice and Bob and looking for options.

Bob's performance review is due next quarter, he decides that the expectations on his team is perfectly justified, if only slightly unreasonable. After all, that's what he and his team is hired to do, performance optimization? He puts his head down, signals to his team that he understands their pain, but work is work and the work needs to be done. His team continues to use the same unpredictable tools, sometimes delivering non-negligible speedups, but always at the cost of much debugging and frustration.

Alice, however, decides to take a stand against the executive's demands. "Yes," she, tells them, "that we need the latency to be down," and that's her team's responsibility. She understands that the company's core value is the R&D department producing features (hence their closeness to the CEO), and her team's ancillary duty is helping the needs of the research team. But even then, she heard of the story about the seat belts, how adding seat belts to cars didn't make the probability of accidents go down, because people it seems has a stable level for risk, so naturally compensated by driving faster. The result is about the same level of accidents but each one more fatal, she recalls vaguely. Alice's intuition tells her that something similar might be happening here. So she pushes back against the demands for more performance, tells her manager and the research team to not expect drastic magical speedups (which has happened occasionally in the past, when the unpredictable tooling worked), that the research team must also be performance orientated, producing models that are faster by design, lighter weight, not just aiming for better accuracy metrics. She moves her team to make tooling more predictable, not relying on magical compilers that only works some of the time, whose performance output suffers from high variance.

Bob got his pay raise by the start of Q4, by a lot of elbow grease (in this case sore butts and blood-shot eyes), his team managed to deliver speedups by massaging their hoge-poge of tools and compilers. By then, however, the research team, given this new latency budget began to add more features. The difference is that they now depend on the unpredictable tools to work every time a new model is pushed to production. Production grinds to a halt; the research team produces some new model, asks Bob's team to "optimize" it, the tools fail more often than not, but now the base model without optimizations is unacceptable for the required latency budget. There are back-and-forths between Bob's team and the research team - the compilers don't work on this model architecture, performance is degraded, the architecture needs to change.

Alice's team through Q3 spent their time on predictable, if not boring and manual optimizations. Which is to say, nothing on the level of Bob's team. The research team reluctantly acquiesced and allocated time towards efficiency as well. Removing layers, reducing parameter count, finding faster operations, quantizing some parts, stuff of this nature. The feature output of the research team wasn't as rapid as Bob's company, the CEO is getting antsy as investors are eyeing company B for their next funding round. B's outputs are more visible, but they made themselves more fragile to randomness. A and B's value resides in their R&D, being highly sequential processes, with many levels of dependence. As Q1 of next year rolls around, the success of B's R&D department falters as the latency of the research pipeline itself is magnified heavily by their sensitivity to noise in dependent subprocesses - Bob's team. While A's R&D team marches on steadily, overtaking B.

Don't trust magical compilers

From my observations as an outsider to GPU systems research, in deep learning, much of the work has been on simplifying GPU programming for researchers. On compilers that take a high-level description of a program that then tries to optimize as much as possible to produce efficient executables. Nvidia's TensorRT is one such tool used in industry that takes in a high-level tensor program, on the level of individual operators used in deep learning like matrix multiplication, convolution, softmax, etc. and optimize that in a black-box way to get an executable in their custom format. Documentation is sparse and compile times are horrible (10-30min for a medium-sized model). More over, performance is very unpredictable. Sometimes on the order of 2x, sometimes 10x, sometimes it fails completely (segfault), or produces incorrect results. This does not disparage the quality of the team behind TensorRT in any way as the task of optimizing for GPUs in such a high-level and expressive description is truly herculean. XLA is another compiler from google which I've observed to be more stable but delivers more modest speedups.

The true difficulty in my opinion is due to the non-linearity of GPU performance. Naive and simple optimizations can get you to (say) 80% of the theoretical device limit, but past that subtle changes to your program, simple changes, can result in highly non-linear performance metrics. For example, reducing register usage by a couple units can result in an integral jump in the number of warps per warp scheduler, increasing occupancy and hiding memory latencies, and thus lead to an discrete jump in performance. This non-linear effect is exacerbated by Nvidia adding tensorcores, async instructions and just more features that must be used in specific ways to approach the theoretical maximum. The task of taking a high-level description to the most optimal assembly that describes the GPU kernel is intractable - there is no algorithm that does this correctly and in reasonable time for all possible programs. In practice, this means that any compiler that purports to take your high-level description to close to the device maximum, the road from 80% to 100%, necessarily relies on heuristic algorithms, and is necessarily non-deterministic.

This difficulty is one of the reasons why I think polyhedral and fully automatic methods are not used in industry; their papers reports good results but they either do not reach great performance occasionally or reach good performance consistently. Triton took off because it is more deterministic as the language is lower-level - it guarantees certain device semantics, like the location of variables in the memory hierarchy, that makes its output more predictable. But it still suffers from non-determinism and sub-optimality due to the complexity Nvidia introduced to GPU programming (since Turing).

If my admittedly non-rigorous reasoning is accepted - that I can formulate the law that the further the compiler tries to push to the theoretical maximum, the more non-deterministic its resulting performance - then I can suggest the following corollaries:

Do not depend on magical compilers in latency sensitive applications: robotics, autonomous vehicles, edge devices. Do not rely on the device's 90-100% performance using only these compilers. Do so and the research team will get you...
Invest in deterministic tooling, and deterministic if boring optimizations. I think the insights from "The Germ Theory of Management" is very applicable here - this entire post could be said to be an elaboration of their ideas in the field of performance optimization.

That is not to say there is no value to magical compilers - when they work it's magical after all. Every compiler engineer wants to work on the magic compiler, on their DSLs, but no one wants to use another person's DSL or magic compiler, just by default (certainly no other compiler engineer who thinks their DSL is superior). I think they build a reliance and fragility on them functioning, that when they don't work, the team will not have spare reserve capacity to deal with their failure, introducing latency stalls that compound in the highly sequential R&D process.

To a lesser degree this is relevant to compilers for CPUs and the high-level languages surrounding them. C++, Rust and lower-level languages have better specified semantics that maps to the underlying execution model, thus the compiler output becomes predictable. Haskell on the other hand has non-strict execution semantics where the final program's performance can vary by orders of magnitude depending on the success of compiler optimizations triggering (loop fusion, strictness analysis for example). Haskell however, is still valuable for the individual or small team, as it is a joy to write, so the analogy can be transferred to the realm of GPU languages and GPU programming. Doubly so as CPUs are much more forgiving than GPUs and GPU programs have strict resource constraints that do not bound CPU programs.

To this end, I think systems researchers should focus on predictable, deterministic abstractions for GPU programming, higher-level than CUDA but less ambitious than the level of "magic", that it seems gets you into top conferences. Triton was a great start, but it is surely not the end?