This article is part of the VB Lab Microsoft / NVIDIA GTC insight series.

With the rapid pace of change taking place in AI and machine learning technology, it’s no surprise Microsoft had its usual strong presence at this year’s Nvidia GTC event.

Representatives of the company shared their latest machine learning innovations in multiple sessions, covering inferencing at scale, a new capability to train machine learning models across hybrid environments, and the debut of the new PyTorch Profiler that will help data scientists be more efficient when they’re analyzing and troubleshooting ML performance issues.

In all three cases, Microsoft has paired its own technologies, like Azure, with open source tools and NVIDIA’s GPU hardware and technologies to create these powerful new innovations.

Inferencing at scale

Much is made of the costs associated with collecting data and training machine learning models. Indeed, the bill for computation can be high, especially with large projects — into the millions of dollars. Inferencing, which is essentially the application of a trained model, is discussed less often in the conversation about the compute costs associated with AI. But as deep learning models become increasingly complex, they involve huge mathematical expressions and many floating point operations, even at inference time.

Inferencing is an exciting wing of AI to be in, because it’s the step at which teams like Microsoft Azure are delivering an actual experience to a user. For instance, the Azure team worked with NVIDIA to improve the AI-powered grammar checker in Microsoft Word. The task is not about training a model to offer better grammar checking; it’s about powering the inferencing engine that actually performs the grammar checking.

Given Word’s massive user base, that’s a computationally intensive task — one that has comprised billions of inferences. There are two interrelated concerns: one is technical, and the other is financial. To reduce costs, you need more powerful and efficient technology.

Nvidia developed the Triton Inference Server to harness the horsepower of those GPUs and marry it with Azure Machine Learning for inferencing. Together, they help you get your workload tuned and running well. And they support all of the popular frameworks, like PyTorch, TensorFlow, MXNet, and ONNX.

ONNX Runtime is a high-performance inference engine that leverages various hardware accelerators to achieve optimal performance on different hardware configurations. Microsoft closely collaborated with NVIDIA on the TensorRT accelerator integration in ONNX Runtime for model acceleration on Nvidia GPUs. ONNX Runtime is enabled as one backend in Triton Server.

Azure Machine Learning is a managed platform-as-a-service platform that does most of the management work for users. This speaks to scale, which is the point at which too many AI projects flounder or even perish. It’s where technological concerns sometimes crash into the financial ones, and Triton and Azure Machine Learning are built to solve that pain point.

Making ML model training across on-premise and multi-cloud, or hybrid and multi-cloud, easier with Kubernetes

Creating a hybrid environment can be challenging, and the need to scale resource-intensive ML model training can complicate matters further. Flexibility, agility, and governance are key needs.

The Azure Arc infrastructure lets customers with Kubernetes assets apply policies, perform security monitoring, and more, all in a “single pane of glass.” Now, the Azure Machine Learning integration with Kubernetes builds on this infrastructure by extending the Kubernetes API. On top of that, there’s native Kubernetes code concepts like operators and CI/CDs, and an “agent” runs on the cluster and enables customers to do ML training using Azure Machine Learning.

Regardless of a user’s mix of clusters, Azure Machine Learning lets users easily switch targets. Frameworks that the Azure Machine Learning Kubernetes native agent supports include SciKit, TensorFlow, PyTorch, and MPI.

The native agent smooths organizational gears, too. It removes the need for data scientists to learn Kubernetes, and the IT operators who do know Kubernetes don’t have to learn machine learning.

PyTorch Profiler

The new PyTorch Profiler, an open source contribution from Microsoft and Facebook, offers GPU performance tuning for popular machine learning framework PyTorch. The debugging tool promises to help data scientists and developers more efficiently analyze and troubleshoot large-scale deep learning model performance to maximize the hardware usage of expensive computational resources.

In machine learning, profiling is the task of examining the performance of your models. This is distinct from looking at model accuracy; performance, in this case, is about how efficiently and thoroughly a model is using hardware compute resources.

It builds on the existing PyTorch autograd profiler, enhancing it with a high-fidelity GPU profiling engine that allows users to capture and correlate information about PyTorch operations and detailed GPU hardware-level information.

PyTorch Profiler requires minimal effort to set up and use. It’s fully integrated, part of the new Profiler profile module, new libkineto library, and PyTorch Tensorboard Profiler plugin. You can also visualize it all Visual Studio Code. It’s meant for beginners and experts alike, across use cases from research to production, and it’s complementary to Nvidia’s more advanced NSight.

One of PyTorch Profiler’s key features is its timeline tracing. Essentially, it shows CPU and GPU activities and lets users zoom in on what’s happening with each. You can see all the operators that are typical PyTorch operators, as well as more high-level Python models and the GPU timeline.

One common scenario that users may see in the PyTorch Profiler is instances of low GPU utilization. A tiny gap in the GPU visualization represents, say, 40 milliseconds when the GPU was not busy. Users want to optimize that empty space and give the GPU something to do. PyTorch Profiler enables them to drill down and see what the dependencies were and what events preceded that idle gap. They could trace the issue back to the CPU and see that it was the bottleneck; the GPU was sitting there waiting for data to be read by another part of the system.

Examining inefficiencies at such a microscopic level may seem utterly trivial, but if a step is only 150 milliseconds, a 40-millisecond gap in GPU activity is a rather large percentage of the whole step. Now consider that a project may run for hours, or even weeks at a time, and it’s clear why losing such a large chunk of every step is woefully inefficient in terms of getting your money’s worth from the compute cycles you’re paying for.

PyTorch Profiler also comes with built-in recommendations to guide model builders for common problems and possible. In the above example, you may simply need to tweak DataLoader’s number of workers to ensure the GPU stays busy at all times.

Don’t miss these GTX 2021 sessions. Watch on demand at the links below:

VB Lab Insights content is created in collaboration with a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. Content produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact


By admin