Modern GPUs are powerful beasts increasingly capable of taking on tasks that were previously not their forte. It may sound like a gross oversimplification to equate GPUs with high performance (and also unfair towards FPGAs). However, I still advise companies dealing with complex computations to investigate whether GPUs could have an application in their workflow. Often, we are talking about an order of magnitude potential speedup or even more compared to multi-threaded CPU programs, nothing to scoff at.
Throughout my career, I was lucky enough to work with companies that were well aware of the potential advantages. However, on many occasions, my involvement started after a failed initial attempt. In my experience, most of these failures originate from misguided expectations of GPU-naïve companies, as well as incomplete scopes for initial investigations leading to design choices severely hindering the chance of unlocking the full potential of GPUs. In this article, I tried to list some of my favourite pet peeves that can easily lead to disappointment.
1) You Didn’t Optimise the Rest of Your Application Enough
Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences. CPU cores are so-called latency-oriented designs to give results faster, whilst GPUs are throughput-oriented designs, meaning that they are intended to provide more results over the same time. Albeit the two things may seem very similar, let me illustrate the difference with a practical example of our other favourite performance beasts: cars!
Let’s say we have a 100km distance between cities A and B. Should we allow for a speed limit of 100km/h, and we’d have a single lane, that would mean that from one car’s point of view, the drive would take 1 hour, so this is our latency figure. If we reduced the speed limit to 50km/h, it is easy to see that it would double our latency. However, if we had four lanes instead of 1, then assuming heavy traffic, even though each car’s drive would take twice as long. Still, twice as many cars could get from city A to B over time; hence our throughput of vehicles would be twice as much.
Understanding this example would probably give you an idea that any expectations about keeping the same overall program design and expecting it to be just magically faster are a bit naïve, as we are not speaking about generally quicker hardware, but of a different kind.
 
  Returning to our previous example, having four lanes between City A and City B will not help our poor commuters much if the roads of City A are so jammed that they cannot fill up all four lanes with cars. It is usually the “highway” part that is a target of optimisation, and the rest of the program, which may deal with seemingly uninteresting things like reading/writing files of standard formats etc. could easily become the new bottleneck as a result.
Asynchronous, pipeline-oriented designs of (almost) the whole applications, which would otherwise be considered nice-to-haves or “advanced optimisations” for CPU programs, are the 101 prerequisites for software to stand a chance to utilise ever-more powerful discrete GPUs fully.
2) You Expected Too Much From GPU-backed Libraries, Tools and Wrappers
Originally GPU programming used to be an arcane art of the lucky few who could pull it off, the first being graphics programmers who could achieve beautiful things by bastardising the shading languages that were available back then. With the advent of GPGPU platforms like CUDA and OpenCL, the programming model became more direct and closer to the hardware, allowing for more optimisation opportunities. Since then, the focus has been on expanding the ecosystem and increasing the accessibility of GPU programming. Although it has many advantages from the perspective of democratising access to technology, at the same time, it is also creating the false impression as if GPU programming isn’t a pain & gain genre anymore.
 
  For instance, many libraries have some level of GPU acceleration under the hood. Still, they don’t always come with a consistent design on the API level that could enable the user to at least minimise transactions over the PCI-Express bus. Even if that was taken care of, kernel fusion opportunities reducing the number of video memory accesses are way out of reach. Since, by and large, most optimised GPU programs are ending up being bandwidth limited, that’s quite a bit of a hit in itself.
An even harsher version of the illusion of accessibility is the one that is offered via wrappers enabling the access of GPUs from languages that (in the context of high-performance computations) are mainly meant for fast prototyping rather than operational code. Of course, accessing CUDA from an existing Python code is a fantastic way of accelerating early development. Yet, realistically, how much can you conclude about the final performance from an environment so easily bottlenecked by the Global Interpreter Lock, if not from the painful raw performance of Python alone? Still, many engineering teams attempt to do just that, to determine if further work would be worthwhile or if the initiative must be shot down.
 
  Car analogy again: moving intermediate results unnecessarily between computation steps (and, to top it off, with Python/OpenCV) is not far off from towing a Lamborghini with a horse.
3) You Did Only Half of the Job
It might seem tempting to see GPU optimisation as a natural extension of the iterative optimisation process that started on CPUs, always driving attention to the hottest hotspot only. The problem with this attitude is that it might lead to a fragmented design, with a sub-optimal number of CPU-GPU back-and-forth and synchronisation points. There might indeed be parts of every algorithm that in itself would not make much sense to port to the CPU, as they are more sequential in nature (remember, one fast lane may win the race!), yet, sometimes a locally suboptimal solution can lead to better overall performance for the whole system. Under ideal circumstances, GPUs should be fed from and fed into pipelines asynchronously. Every diversion from this logic may have unexpected consequences. I mean, not that unexpected; otherwise, I would not have raised this point…
Am I advocating porting as much from the core calculations as possible to GPUs? Yes, but not in the ordinary sense. Let’s imagine you had a sorting algorithm in the middle of the original design, but that wasn’t really a problem on a CPU either. The exact same algorithm may not be the best fit on GPU. Should you turn back? No, you should just ask more questions. In this particular case, you may drop quick sort and use a massively parallel version of merge sort instead after some research. Still, the ultimate takeaway is that as long as you can keep an open mind and not aim to port everything the same way but allow yourself to reinvent bits and pieces of the original concept, you stand a chance to end up with a pure GPU-friendly design, 100% utilising even the most giant GPUs out there. No car analogies here; seriously, just do the whole job properly!
After reading this article, it may have become apparent that GPU programming is not as much of a problem of learning new tools and new languages but being intricately familiar with hardware architectures and our old friends: algorithms and data structures. This is the main reason we, at TechnoLynx, unlike most companies, do not separate the roles of an algorithm researcher and an embedded programmer. Instead, we are trying to recruit/train people capable of confidently addressing both angles of the usual acceleration problems. Should you have any difficult (over here at TechnoLynx, also known as “fun”) acceleration problem to solve, we’d be more than happy to hear about it, and in the meantime, please follow us on LinkedIn and Medium.com. We are only getting started on sharing our learnings with the community!
 
     
			 
			 
			 
			 
			 
			 
			 
			 
			 
			 
			