Using AI-Enabled Speech Control to Increase Immersion for XR Design Review

Description

When using immersive systems such as virtual reality (VR), augmented reality (AR), and mixed reality (MR) for design review, a key attribute of the system is a seamless interaction with at-scale, realistic digital models. VRED software is a well-accepted, manufacturing design-review application offering close-to-truth photorealism and a highly immersive extended reality (XR) interface. Using artificial intelligence (AI)-enabled speech control with VRED software can increase the level of immersion, allowing a user to have direct interaction with the digital model without the need for scene-occluding graphical user interfaces (GUIs)—and also allowing the user to naively interact with VRED, enabling more users to perform unassisted design reviews in XR. Project Mellon is NVIDIA's internal speech-enablement project that uses the innovative Riva automatic speech recognition (ASR) software with a prototype dialogue manager and 0-shot learning natural language processing (NLP) to achieve a developer-friendly integration of AI-enabled speech. In this session, we’ll show Mellon and VRED, and we’ll discuss how Mellon is used to easily update command lists without the need for extensive NLP training.

Key Learnings

Learn about how speech was used to drive immersion with VRED in XR.
Learn about how NLP uses an architecture of intents and slots to understand system commands and command variables.
Learn how AI is used in DM and NLP models.
Discover how a unique user experience can be built using variant sets in VRED combined with NVIDIA's Project Mellon.

Speaker

GJ
Greg Jones
Greg Jones is the director of global business development and product management for XR at NVIDIA. He focuses on partnerships and projects that use NVIDIA’s professional XR products to bring the benefits of real-time, immersive rendering to enterprise environments. Before coming to NVIDIA, Greg was the associate director of a 200-person academic research institute focused on scientific computing, image analysis, computer graphics and data visualization at the University of Utah.

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

Transcript

GREG JONES: Hey, welcome to Autodesk University. I'm Greg Jones with NVIDIA's XR Team. And I'm going to be talking about using AI-enabled speech control to increase immersion for XR design review. So really looking at how you can control graphics applications with voice. And the AI builds that we put together to show some examples of that and where we're heading with that. So this idea of AI assistants, artificial intelligence assistants in XR in spatial computing type of environments, is part of our NVIDIA platform for XR. And that platform is supported by three pillars. And I want to really quickly drive over those three pillars.

So the pillars of XR for our strategy at NVIDIA, it won't surprise you that photorealism is one of our primary pillars. We're known for photorealism from our gaming routes, all the way up to our pro-professional visualization work that we've done for years and now. Of course, RTX ray tracing is the ultimate in photorealism. We think photorealism in XR is extremely important, whether it be in design review where it's quite obvious or architectural review. But even in training where you want to train in the most photo real, most realistic physics environment you can to transfer that learning to the real job in the real world.

The second pillar that we discovered with our tech demo a couple of years called Holodeck was collaboration and XR. XR by yourself can be a lonely place, but XR with collaboration is really a power, right. This collaboration, this ability to come in from across the world and work together or just in the same room and work together around a virtual digital object or a digital twin is really a stunning capability and a key value add to XR.

And I might add now, going to our third pillar of artificial intelligence, is that collaboration doesn't just have to be with colleagues. It can also be with artificial intelligence assistants. And so we feel AI is a significant pillar coming up for the next generation of XR. And this idea of speech AI controlling a application and application with your voice is just an example of the various AI assistants you will end up working with in XR and AR. And you can imagine AI assistants looking out from the world or out from the virtual world into our real world, taking data back and analyzing that, and providing us with contextual information as we navigate our real world or our real builds. So AI, huge pillar in the strategy for XR for NVIDIA.

And then finally, to get to the compute you need for photorealism and artificial intelligence, you really need to be on a large compute system. And that means streaming. And it just is really timely that the mobile headsets have come out the Oculus Quest, the Focus 3, and we've created Cloud XR to stream to those from data centers, CSPs, and EDGE compute for instance on the 5G Telco Edge. This streaming not only gives you a great render from a large GPU at its remote, but it also allows you to get information back into compute for these AIs, these AI assistants, to work with you.

So those are our pillars. And with that, I want to jump into a problem statement about where we think VR is right now. So this is what we think of when we think of VR. We think kind of this person-- this is a VRED scene. This person is wearing a Varjo headset. And when they look in this car, it's a beautiful scene. They're totally emerged in a photo real, full scale, digital model that looks real.

But the real VR, and this is where the problem comes up, this is what we think of when we say jumping in a virtual world, the real problem is-- this is a shot from Holodeck a couple of years ago-- is a lot of my time in the virtual world is spent actually not working or looking at my data or my digital model or this beautiful photoreal scene I've created, it's looking at GUIs, right. Graphical user interfaces with these clergy controllers that are great, but that's where we're at now. That's not a native or natural interaction. And these GUIs occlude the view of our data. They break the immersion. So one of the things we think speech AI can help with is removing the GUI and getting a more natural interaction with our digital objects, our digital twins, our virtual worlds, per se.

So I'm going to show an example. And this is we're using VRED right now. And I'll go through later on in the talk, I'll talk about why we're using VRED. And I do want to mention this is some work that Sean Wagstaff has done. This is Project Mellon, which I'll be talking about a lot during this talk. The car model we're going to use in this video, I'm about to show is by Florian Cohen of Autodesk. I appreciate the loan of this model, Florian. And with that, let me pull up a video and show you the thread with a speech AI-enabled NVIDIA's Project Mellon.

SEAN WAGSTAFF: OK, turn on RTX ray tracing.

VOICE ASSISTANT 1: OK. RTX enabled.

SEAN WAGSTAFF: OK. Enabled depth of field.

VOICE ASSISTANT 1: OK. Enabling depth of field.

SEAN WAGSTAFF: OK. Change the camera to front end.

VOICE ASSISTANT 1: Changing view to front end.

SEAN WAGSTAFF: OK. Enable the fully textured model.

VOICE ASSISTANT 1: OK. Enabled textured model.

SEAN WAGSTAFF: OK. Change the view to rear end.

VOICE ASSISTANT 1: Changing view to rear end.

SEAN WAGSTAFF: OK. Look at the rear wheel.

VOICE ASSISTANT 1: Rear wheel.

SEAN WAGSTAFF: OK. Change the view to left side.

VOICE ASSISTANT 1: Sure. Left side.

SEAN WAGSTAFF: OK. Switch the display mode to virtual reality.

VOICE ASSISTANT 1: Switching mode to virtual reality.

SEAN WAGSTAFF: Scale the car to 50%.

VOICE ASSISTANT 1: Changing size to half size.

SEAN WAGSTAFF: Switch display mode to augmented reality.

VOICE ASSISTANT 1: Switching mode to augmented reality.

SEAN WAGSTAFF: OK. Open the hatch.

VOICE ASSISTANT 1: Opening the hatch.

SEAN WAGSTAFF: OK. Close the cockpit.

VOICE ASSISTANT 1: Closing hatch.

SEAN WAGSTAFF: Switch the wheels to generative design. Switch the wheels to generative design.

VOICE ASSISTANT 1: Switching to generative design.

SEAN WAGSTAFF: OK. Paint the car blue.

VOICE ASSISTANT 1: Which color or material?

SEAN WAGSTAFF: OK. Paint the car blue.

VOICE ASSISTANT 1: Blue.

SEAN WAGSTAFF: Try painting it silver.

VOICE ASSISTANT 1: Sure. Silver.

SEAN WAGSTAFF: OK. Change the logo to NVIDIA.

VOICE ASSISTANT 1: Switching to NVIDIA.

SEAN WAGSTAFF: OK. Try changing the logo to Lenovo.

VOICE ASSISTANT 1: OK. Lenovo.

SEAN WAGSTAFF: OK. That was great. Thank you.

VOICE ASSISTANT 1: Thank you. I am trying my best to help you.

SEAN WAGSTAFF: Reset the scene.

VOICE ASSISTANT 1: OK. Resetting scene.

GREG JONES: So that was NVIDIA Project Mellon in action with VRED. And again, thanks to Florian for that car. It's a gorgeous model.

I want to next talk about not just a VR application or XR application, but what happens when you look at a flat screen application. So this is Omniverse Create. And we've hooked Mellon to this also. And Sean Wagstaff, again, is going to run this through. And he's going to basically do a review of the scene with some camera movements, some focus. And he's going to do this in VR or in flat screen. And let me pull that up.

So in this video, what I'm going to show, this is Create from Omniverse, from NVIDIA Omniverse. And Sean is going to move some scene around. He's going to basically do some camera adjustments, some focal length changes, some aperture changes. And he's using the GUI. And the Omniverse GUI is great. It's a rich GUI. It has a lot of features. Sean is quite astute at using this GUI. But you can see, it's taking some time.

He's also going to zoom in to the horse figure over on the left hand side of this video and take the horse out of the scene. So watch for those actions. And again, this isn't XR. This is a regular scene. This is a flat screen navigation.

And he's going to grab the horse. Few mouse clicks around. And that horse is going to go away. And I'll stop the video actually there once he takes the horse out. Going to focus in on it first.

But what I want to ask you to do is remember the time it took and the steps and the expertise it took of navigating this GUI. So I'm going to stop the video right there. And basically bring up another video that shows Sean doing the same actions only-- this is with the voice assistant.

SEAN WAGSTAFF: Update metadata. Look at the wall--

GREG JONES: I'll stay quiet during this one while Sean talks to Omniverse.

SEAN WAGSTAFF: Zoom to 40 millimeter. Push in 5 units. Look at the TV. Set the focus distance to horse. Change the aperture to 0.1. Hide the horse. Hide the blocks. And hide the horse.

GREG JONES: I might point out that over here in the screen, I'll move my mouse there where it says Show Horse, I'm circling that. This is the dialogue that Sean's giving the system. And you're seeing essentially NVIDIA Riva recognized the words in the sentence. And type it out is--

SEAN WAGSTAFF: Switch to a small camera.

GREG JONES: So that's a place to watch the commands actually happen as Sean navigates this scene.

SEAN WAGSTAFF: Set the exposure to 4. Set the exposure to 12. Switch to the boom camera. Reset the scene. Thank you.

GREG JONES: Very good. Let me switch back to my slides quickly.

So what I want to talk about now is how we actually did that. So that was Project Mellon, hooked to both VRED in the first film with Florian's car. And then a couple of examples using Omniverse Create.

So the speech API this slide, I'm showing right now talks about a general case. So this is the most general build of speech AIs. And this could be a speech AI, that's a chat bot kind of a call center menu. Just a variety of things. If you're talking to a kiosk, they're going to have very similar builds.

And what they're going to have is fundamentally they're going to have a system doing automated speech recognition. And the way to think of that is speech to text. So we're going to have an utterance here on the left hand side, auto audio. It's going to go into this automatic speech recognition engine. And then it's going to dump out some text over here on the right hand side. And what it's doing is it's basically running an FFT, a fast Fourier transform, on the sounds. It's convoluting, deconvoluting, adding to words or sounds and then words. And then it's building a sentence with punctuation and capitalization. And it's trying to build a whole sentence. It may not understand what the sentence means, but it's building a text sentence from that speech utterance.

And then the other part of a-- and this is Riva, basically. Our automatic speech recognition. The other part of Riva is this text to speech. So now I'm going to have text. I'm going to give that text to some kind of dialogue manager, some kind of natural language understanding algorithm. And it's going to give me text out.

So I'm going to ask it. I'm going to give it some text, expect a response. That response is going to come out as text. And then I'm going to put that text back into the Riva Engine. Out, it's going to convert that text basically into speech. And so that's fundamentally Riva in a nutshell. The fact it can do this with multiple languages, it's a huge model. It's very accurate, it's very fast. All those things aside, the fundamental piece of Riva is automatic speech recognition and text to speech. And that's the function it plays in one of the speech, the conversational AI type of applications.

So I'm going to basically hit the slides and talk about the Riva components and the fact that they're extremely flexible. So you can take Riva out of the box and use it. Or you can use a variety of features it has for keyword boosting or special speech help. And this is just to help it build that sentence, that text, that much better. And you can see there's plenty of places where AI experts can come in and fine tune this, train it some more. So there's just a variety of use cases for Riva.

For Project Mellon, we use it out of the box. And the ASR part and also the text to speech. Same thing. We use it the straight Riva model, Riva system, straight out of the box. No extra training and no extra boosting for our use case.

Now the part that's not Riva in that general speech AI thing is this idea of the natural language understanding or natural language processing, NLU or NLP, that library or libraries and the dialogue manager. These are often very specific to use cases. So if I've got something like a really large chat bot or an avatar that's going to help me answer questions, you can imagine that my dialogue manager or my NLU are going to be attached. Or my NLU is going to contain a really large knowledge base.

And so that really brings me to the statement of this is the general build of basic conversational AI, right. We've got the speech to text, the text to speech, and then the dialogue manager that really is the quarterback that manages all these things working together. Manages handing that text to the NLU. The NLU understands that text, gives a response. Dialogue manager gives that response to the text to speech. And you get your response from the engine.

And so this whole engine contains basically these parts. There are obviously more complex, far more complex than I understand. But this is the basic schema.

So what I want to do next is talk about a fairly complex example of one of these conversational AIs. And this is our tiny Jensen which was built by with using our Omniverse Avatar Creation Engine, or ACE. So tiny Jensen was featured in a keynote by Jensen. And I'm going to switch screens and move over to a video of Jensen having a dialogue with tiny Jensen, this speech enabled avatar.

JENSEN: Let me Welcome back to Toy Jensen, an application we built with Omniverse Avatar. Remember, Toy Jensen is not recorded. He's completely real time. Making eye contact, making conversation, and fully animated.

Hey, TJ. I got a few questions for you.

TOY JENSEN: Sure. Ask me anything.

JENSEN: What is synthetic biology?

TOY JENSEN: Synthetic biology is about designing biological systems at multiple levels from individual molecules up to whole cells and even multicellular assemblies like tissues and organs to perform specific functions. It is also known as synbio, which stands for synthetic life.

JENSEN: Can you give me some examples?

TOY JENSEN: Some current uses of synthetic biology involve creating biofuels, engineering bacteria with useful properties not found in naturally occurring organisms, developing biosensors capable of detecting environmental toxins, producing biopharmaceuticals, and manufacturing chemicals via fermentation rather than petrochemicals.

JENSEN: What are its medical applications?

TOY JENSEN: Synthetic biology has many medical applications. For example, synthetic biology is being used to develop new vaccines, new antibiotics, and new treatments for cancer.

JENSEN: TJ, that was great. Can you tell us how you were made?

TOY JENSEN: Avatar is a framework built on the Omniverse platform. It lets you quickly build and deploy avatars like me for a wide range of uses. It starts with my voice. It is fully synthesized by Riva's text to speech rights and is also used to drive my facial animation using Omniverse Audio to Face.

Next, Omniverse Animation's animation graph is used to define and govern my movement from hand and arm movements to subtle head and body motion. NVIDIA's open source material definition language, MDL, adds the touches that make my cool jacket look like synthetic leather and not just plastic while the RTX Renderer brings me to life and high fidelity in real time.

Finally, I can listen and talk to you. Thanks to the latest in conversational AI technologies from Riva. And our Megatron 530 billion parameters in NLP model, one of the largest language models ever trained. Megatron helps me answer all those tough questions Jensen throws at me.

What's also exciting is that I can be run from the cloud, the data center, or any other disaggregated system all thanks to Tokyo. Tokyo is an application built with Omniverse Avatar, and it brings customer service AI to retail stores, quick service restaurants, and even the web. It comes to life using NVIDIA AI models and technology, like computer vision, Riva speech AI, and NVIDIA NeMo. And because it runs on our Unified Computing framework, or UFC, Tokyo can scale out from the cloud and go wherever customers need helpful avatars like me with sensors that are fully acute responsive and above all natural. I hope you enjoyed a quick overview of how I was made.

GREG JONES: That's great. So let me bring back the slide deck. So if we look at what it took to make tiny Jensen, and let's say a schematic of the tiny Jensen conversational AI, what we see is that section right up here on the upper right hand where I talk about the Riva ASR, the natural language processing system, the dialogue manager. And then coming out with the text to speech and we get the speech. And that speech is then used to drive the 3D model through audio to face, the audio to gesture libraries. These are all part of our unified compute foundation or format, a UCF. And so these are all components available to people to use to build avatars in this ACE system.

But as you can tell, it's a relatively complex system, full of microservices and a really fascinating use case. But this is really what I think of as the penultimate of building a conversational AI, so a pretty high bar. A lot of expertise needed to work with it. And the reason for that is, let's look at the component.

I'm going to focus in on this NLP and dialogue manager. So first, I'm going to talk about this natural language processing. So in that video, tiny Jensen was answering questions about synthetic biology, right. So if I'm just going to randomly ask an avatar about synthetic biology, and I'm going to get reasonable answers back from that AI, that's going to have to have a pretty broad knowledge base. So that NLP is going to have to really kind of have a trained system with a very large set of knowledge, basically annotations and such. Now transformers have made this easier, but it's still a pretty large lift and you need a pretty big model for that knowledge base to be represented within that NLP. So training these NLPs, it's pretty expensive. It takes time and it takes some expertise.

The second thing I want to look at is that dialogue manager. That dialogue manager, I have it in a linear box with the ASR and TTS. But the dialogue manager, as I mentioned earlier, is the quarterback of the system. So really when all this is going on, you saw the little snippets of architecture or schematic in between Jensen when Tony Jensen was describing how he was made, all those actions are being managed by the dialogue manager.

And additionally, the dialogue manager is pretty aware of state. It's maintaining state so it can maintain the flow of that conversation. So that's a pretty heavy piece of software state management, management of all these functions and organization of this system. So the dialogue manager is very specific to use case and it can be very complex. And again, that means expertise and implementation.

Now, if I want to create a speech API or a conversational AI for a specific 3D graphics application like BeRead, that big use case offers some challenges. And I mentioned it earlier, training conversational AI NLP can be really expensive. It can be time consuming. And it requires a certain degree of technical expertise. It's not it's a pretty high bar. Dialogue managers also can be pretty heavy, pretty significant pieces of code. And that's great if you're building a bot that knows synthetic biology. But if just trying to get your application to be a voice activated, maybe we can make these problems a little bit more practical track trackable. And that's what we've done with project Mellon here at NVIDIA.

So for graphics applications, we can look at this expense of MLPs. And for Mellon, we actually use something with a really constrained knowledge base. So instead of having to know about quantum mechanics and synthetic biology, our knowledge base really needs to just know what the commands are in that application. So if I can change paint, I can change tires, that's really my whole set of knowledge I need to have for that.

What that also means is now my dialogue manager, it really doesn't have a huge role. It needs to be pretty lightweight. It can take kind of these text snippets that come out of my speech, it can compare those to commands and with the NLP doing most of the heavy lifting there, and then all it has to do is actually how to write that command for my application to understand and be activated.

So Project Mellon uses a lightweight dialogue manager. It's a Python. It's a code based in Python. And it's really built to inform project Mellon's NLP, which is a zero shot NLP, and I'll talk about that in a second. And with this limited application knowledge base, this Kb, and that again the Kb is really just commands and the command parameters and maybe a little information about the scene graph.

So Project Mellon is part of the work we're doing is developing APIs, where we can give you a code sample and Project Mellon can go in and scrape your command registry, scrape your scene graph, and really learn the knowledge base automatically but that's future work down the road. I'll show you another way that we're inputting these type of knowledge bases into the system right now.

So I want to talk about the NLP, the natural language processing library. We're using a zero shot text classification. Zero shot text classification in this case has been trained to imply basically statements implying other statements. And in this case, we're using it to have an utterance from a user, a command from a user, does that imply a command in the system?

So an example here, and I've printed these out or drawn these out on this slide. So people formed a line I'm right here on, this box. This is the information that's trained right. So at the other end of Pennsylvania Avenue, people began to line up for a White House tour. And that implies people formed a line at the end of Pennsylvania Avenue. So I can say yes, train, the system with a bunch of these types of statements, and then also train a system with ones that don't imply one another. And I get a system that looks for does x imply y and it's using the whole statement. So it's not just using a single word or trying to understand the word, that word and the context within a statements using a whole statement.

So if I move that into my application space, and that's what I'm showing down here at the bottom of the slide, And I have a command of turn on the lights. And now it's going to score different commands in the system. So let me say, I have a system that have commands that I can turn a sound on or off. I can turn the lights on or off. Or maybe it's not a command at all. And those are my choices.

Now, of course, an application will have many more commands. So does turn on the kitchen lights imply sound on sound off, light on light off. My NLP, my zero shut NLP is going to score that implication, right. So here's a relatively low score for turn on the lights does not imply sound on. It does imply a light on. Turn on the lights implies that with a 0.8 score.

And so the dialogue manager, the project dialogue manager, is going to take that high score and build an application command on this light on command. And I want to introduce some nomenclature. When we talk about commands, in the nomenclature of AI or at least the zero shot model, the commands are intense. The command parameters, for instance, change paint would be an intent and blue would be a parameter. We're going to start talking about intense and slots.

So slots are parameters, intense are the verbs. And here again is that utterance turn on the kitchen lights. Compare them to all the commands in the registry or the commands that the NLP knows about and score those and then give the dialogue manager that answer, this is the command that this person's most likely trying to use.

s how do we get these intense and slots into the NLP. Well, one way and a pretty mechanical way right now is we can actually-- it's a fairly limited amount of commands. We can actually put them in a spreadsheet. This ends up being a JSON file. We're also scraping the command registry. That ends up in a JSON file, also. And it informs the dialogue manager what the potential intents and slots are, what the potential verbs and adjectives are of this system.

So for Autodesk, we read for instance, move car, turn wheels, open door, change paint, and you can see the intent labels are over here. And then the slots are what direction, slot question might be which direction, the slot question helps the NLP understand what question to answer if someone left a slot out. Said move the car, but didn't give a direction. It would ask to what direction would you like to go and it would ask it in a natural language, we can add other material into the other metadata into the system. Like we can let the NLP know that you can or the dialogue manager know that you can move the car and we can give it a variety of slot values. You can change the paint, we can give it a variety of slot values. And again, these are just parameters within the command system of the application.

So this spreadsheet goes in, the dialogue manager takes it in, informs the NLP, helps it use these commands to do its comparison of implying does this imply that score that. And the dialogue manager takes that answer and turns it into a command. And that is basically what I'm going to show in this.

This is another video that I'm on our show, of VRED with project. Mellon. What I'm going to show here is some of that metadata. So not just command do something, command do something. But asking about the commands, asking what the slots are. And so this natural language understanding can work with questions, it can work with commands. And I'll give you a little sample of that right now. I'll switch screens and give you show you what that means.

So this is work that Hayden Wolff did. And oddly enough, he started this as an intern. Now, he's a full time tech marketing person at NVIDIA . So let me hit. This is Hayden Wolff creating some use case with metadata and some really interesting interactions with this model in VRED.

HAYDEN WOLFF: Tell me a bit about what colors I can paint the car.

VOICE ASSISTANT 2: White, yellow, red, black, chocolate, and more colors.

HAYDEN WOLFF: OK. I want to see the car in a chocolate color.

VOICE ASSISTANT 2: Painting car and chocolate color.

HAYDEN WOLFF: Actually, please show me the car in black.

VOICE ASSISTANT 2: Painting car in black color.

HAYDEN WOLFF: Could you tell me what this paint is?

VOICE ASSISTANT 2: Metallic black.

HAYDEN WOLFF: How can the car be moved?

VOICE ASSISTANT 2: You can move the car forward and backward.

HAYDEN WOLFF: Move the car forward.

VOICE ASSISTANT 2: Moving car forward.

HAYDEN WOLFF: Move the car backward instead.

VOICE ASSISTANT 2: Moving car backward.

HAYDEN WOLFF: Can I open the driver door?

VOICE ASSISTANT 2: You can open and close the driver, passenger, and all doors.

HAYDEN WOLFF: All right. Open door. Could you tell me--

VOICE ASSISTANT 2: Opening driver door.

HAYDEN WOLFF: Now let me see the passenger door open.

VOICE ASSISTANT 2: Opening passenger door.

HAYDEN WOLFF: Open all of the doors.

VOICE ASSISTANT 2: Opening open all of the doors.

HAYDEN WOLFF: What is on top of the roof right now?

VOICE ASSISTANT 2: Surfboards.

HAYDEN WOLFF: What can I change the interior to?

VOICE ASSISTANT 2: You can change car interior colors to white, red, black, red-white, white-black, and more.

HAYDEN WOLFF: Sounds good. Please make the seats red.

VOICE ASSISTANT 2: Changing car interior color to red.

HAYDEN WOLFF: Is ray tracing on right now?

VOICE ASSISTANT 2: Ray tracing is set to false.

GREG JONES: Very good. So I'm going to switch back over to my slides now.

So one of the questions that might be popping into people's heads are why VRED and how hard would this be to do with something else? Why VRED? Omniverse is obvious, it's one of our internal software programs and we can work with it directly. The VRED system, we really wanted to do an external program and see what we could do with the actively use design review program now. So VRED offers some really compelling pieces that make this work pretty straightforward and what I would say our template for other applications we'd like to help do this with.

So first off, it's widely used. It's a de facto piece of software in the automotive OEM business. I don't think there's an automotive group that doesn't use VRED for design review. Internally, VRED has this idea of variant sets. And I'll touch base on what variant sets are in my next slide. But also, it has Python scripting to trigger these variant sets. That Python scripting layer lets our dialogue manager actually build a Python script, hand it to VRED, and VRED just executes that Python script. So that interface is a really handy interface, that Python interface.

VRED fits with our pillars. It's got beautiful graphics. It's got a rich UI. And I'll say this when VRED is used in the wild, right, a VRED expert will design these variant sets and will build a design review and then you'll hand it to someone that may be a relatively naive VRED user, maybe even an executive to review. So now the expertise level with that GUI, with that deep, rich GUI, as that lessens, it gets harder and harder to just hand that package and let someone do a design review without the experts in the room. So this rich UI but also can we use voice to simplify this UI. And VRED is just a great piece of software to do this use case with. Plus, the VRED team has been really in line with this since the beginning of Project Mellon. It's actually the first piece of software we started working with and they've been great to work with on this.

So let me talk about those variant sets in VRED. So the variance steps are basically just defined changes in the scene graph. So here's three examples. So I've got my car down on the left hand side. It's a dark color, it's in a mountain environment. I can make a variant set where I change the car color, I change the environment. That's the same scene graph, just a variant of it. And then my lower right hand, again, I've changed the color of the car. Just another variant in my scene graph.

And so I can use Python scripts to trigger these variants. And that's a really nice interface for us to think about building standard APIs for this kind of Python scripting layer. So this is what that-- I've shown this diagram this schematic of the conversational AI flow. And this is what it looks like with VRED.

So we've actually modified Riva a little bit. We've put a speech client. We're streaming AR and VR as Sean Wagstaff demonstrated. So we've actually put a speech client in our streaming library and Video Cloud XR. And that speech then gets into Riva through that speech client. Riva does the automatic speech recognition so we have text coming out. We then have the dialogue manager handing that to the zero shot NLU, where we're scoring that utterance relative to the available intents and slots. And those intents and slots mapped to the command set and command parameters of VRED.

Then the dialogue manager takes that output from the zero shot NLU and creates a Python script using variant sets and values. And then command fulfillments done through that Python scripting activation of the variant sets and then you see the change in scene. And also the dialogue manager is pulling that variant set key and metadata, the intents and slots. It's either pulling out or we're putting it in through that spreadsheet. So that's the basic schematic for Project Mellon working with VRED.

Now, I showed it with Omniverse. And I want to pop this up also to talk about, especially this section right here. So the cloud XR, the language, and the use of Riva, all the same. It's essentially identical. The zero shot NLU is still doing utterance relative to available intents and slots. But the dialogue manager is now working to get these from what we call in Omniverse action graph nodes and, again, another Python script.

So we have they've built an action graph layer. That's Python script executed. So we pull node data that are equivalent or similar to variant sets. We pull that scene graph data. We actually in this case scrape the scene graph. And then we create the Python scripts using these action graph nodes. And then the fulfillment is through that Python script triggering these action graph nodes. And then you see the design, the changes in the scene graph. And that's a very similar schematic with a very similar function here. And this dialog manager and this fulfillment type is where we're really looking at building standard APIs and offering sample code as we move this into maturity and develop further In Project Mellon.

So when we scraped that scene graph, I wanted to show this image one more time and remind everybody when we targeted that horse and said hide the horse or bring the horse back, we'd learned horse from having Project Mellon, the dialogue manager going in and scrape that scrape that scene graph and find a geometric and a data entry in that scene graph that said horse, right. So there's other ways to get into the scene graph. We could even do spreadsheets on a scene graph. But that's the idea is the more you can tell that natural language understanding or natural language processing library, the more you can do with voice in these systems.

That wraps up the Mellon presentation. I want to kind of close it out by saying Project Mellon enables voice control of scene graphs and graphics applications. And we're trying to really make it a general system so many people can use it with their applications.

The reason we're doing this, we really think the voice interaction is more efficient than GUIs, especially in XR applications. But also, it lowers the expertise needed for people to use the software. And I think in this pipeline of building the design experience with experts and then letting relatively naive users experience those design with voice is just a really clean way to start thinking about how to get people into more rich graphics, more design reviews, more XR environments.

Project Mellon uses a zero shot NLU and lightweight dialogue manager which we think will make it really transferable for people to use on their own. Bring in these lightweight APIs. They can even get into the dialogue manager, modify that Python code to do very specific things.

It employs NVIDIA Riva. And this Riva, there's just a ton of work going on in Riva at NVIDIA. And so it's going to be just one of the best speech recognition items out there. It's highly accurate, it's super flexible. You can customize Riva to a really large degree if you so choose. But we use it, like I said, out of the box. And I think that's great.

So future, I've mentioned it a couple of times, we're working on APIs both for scraping command registry and scene graph data, but also for building the Python script command fulfillment piece. We've also, I haven't mentioned this, we've created a generic VRED sample with their genesis model. And what this sample does is it runs in AWS. We'd love to have users play with it, especially design review experts in automotive. And what you can do is you actually run a scene and do a design review.

When it works, great. You can say feedback and Mellon will stop and say give me your feedback. You can say it was great. Or you can run a section and say feedback, that didn't work as well. Here was the problem. And then we're going to ask you for those logs so we can see really where this lightweight dialogue manager and a zero shot NLU is accurate, where it's doing well, and where we need to dress it up a bit. So that's our call to action is if you want to be in this test system group, you'll let me know, jjones@nvidia.com.

And then I want to end this by saying thanks. The team that's been working on this has been great. Sean Wagstaff, I mentioned his name a bunch of times. Vlad is the Project Mellon architect. He's figured out this lightweight dialogue manager, the zero shot NLU usage. He's really a mad genius in this conversational AI. Hayden Wolff came to us as an intern and penned the name Mellon. If you're curious where Mellon came from, do a search on Mellon and Gandalf and you'll find that.

Prasoon was an intern. He just left going back to Carnegie Mellon for his master's degree. And he did a lot of the recent Omniverse work. Just excellent work. Christopher Parisien has been an advisor. He worked on the speech client. And he's just been a great guide.

Tian has done the fundamental Omniverse work in teaching us how to use Omniverse. Damien is part of that group also and has guided us did a couple Omniverse demos that really woke us up and then. Purmendu left NVIDIA it has his own startup now, but he did some of the early zero shot work for us. And then Florian Cohen at Autodesk, thanks for the car model.

And with that, I'll say thank you to Autodesk for having me. It's been great enjoy the rest of your AU experience on the on demand stuff. And thank you for paying attention and listening to me. Cheers.

Industries	Product design and manufacturing
Topics	Visualization AR/VR Software Development Software Training Animation & Visual Effects

Using AI-Enabled Speech Control to Increase Immersion for XR Design Review

Description

Key Learnings

Speaker

Tags