Description
Key Learnings
- Learn about the Azure OpenAI services.
- Learn how to prepare and augment your data set for prompt-to-code generation purposes.
- Learn how to design the proper context for prompt engineering.
- Learn how to maximize custom structures using embeddings for more-complex use cases.
Speaker
- BRBruno RoyBruno Roy is a Principal Research Scientist at the Autodesk AI Lab in Montreal. His research focuses primarily on current challenges in computer graphics by combining machine learning with numerical methods. Prior to joining Autodesk, Bruno's involvement in the computer graphics community was reflected through various experiences in industrial and academic research in the media and entertainment (M&E) space. Bruno holds a Ph.D. in Computer Science from the University of Montreal, where he explored ways to improve numerical simulations for particle-based fluids by leveraging hybrid and data-driven methods using iterative solvers.
BRUNO ROY: All right. Hello, everyone, and thanks for attending our session on AI-powered workflows in Maya. In this talk, actually, we're going to discuss how we leverage large language models for content creation and how powerful these models are, even when scratching the surface of their potential.
In this session, we will present our work from last year in collaboration with the Acceleration Studio team at Microsoft. Our first proof of concept was presented during the Microsoft Build conference in 2022. Seeing our work included into the CEO keynote was great for the visibility of this project and to give a rough idea of what these foundation models are capable of.
In this video, we show a brief demo of how we manage to interact with Maya using exclusively English sentences as inputs. Using your own words, you can control, edit, create content in Maya without the requirement of knowing the software or any technical language used in the field.
So here's our agenda for this presentation today. First, I will briefly cover the context and motivations to better position this project. A brief architectural overview of the approach will be presented. Then, we will discuss in more details about several experiments and the data used with them. We will showcase a quick demo of how it is used in Maya. And lastly, we're going to highlight a few promising avenues for future work.
So let me first introduce a fundamental concept and a few of them behind this project. Our goal is to leverage natural language processing and AI to offer more intuitive input methods, such as text and voice, for content creation. These two combined are extremely powerful when used with our expert knowledge, such as in 3D. With recent advances on large language models, such as GPT and ChatGPT, we can finally develop more generalized and flexible solutions.
Although GPT was initially intended to generate text, other derived models, such as DALL-E, have been introduced to combine different modalities, such as text and images. And more closely to our work, and as opposed to GPT, Codex is a model that has been trained and purposed to translate natural language to code. GitHub Copilot is the first commercial use of that model.
Our work, which was initially called Maya Codex, lies between DALL-E and Codex in a sense that we're taking advantage of the ability of these models to generate code in order to create content. So in summary, our goals with this project are to, first, generate more intuitive tools using text-based methods, as opposed to traditional ways with mouse and keyboard.
Second, encourage learning along the way. So with this plugin, you can also show the generated code by our model, and then reuse it in different projects. And lastly, it's to enable automation with this kind of model. So basically, to facilitate repetitive task and tedious task and focus on creating content.
So how are we connecting the dots to make all of this happen? In this project, we demonstrate how you can combine your cloud-based service to Azure's and offer capabilities in your products. For instance, in this project, through a Maya plugin, we're able to interact with a service layer, linking the Forge microservice and that of Azure OpenAI.
So let's quickly deep dive into the Azure OpenAI services to better understand how we use, evaluate, and update our model. So here's an overview of the pipeline we use in this project to adapt our model to the content creation task in Maya. Based on the feedback service at the entrance stage, we update our data sets to improve our model. That updated data set is actually used with different experiments to evaluate the performances of the model.
Once we are satisfied with the model's performance, we stage the new version of it and replace the current one in place. This part can also be automated to improve low-performing use cases not covered in the deploy model. So essentially, to make our model more resilient and flexible to unseen scenarios using, for example, different phrasing and technical jargons.
One simple example of the ins and outs of this project. So it goes something like this. The user types in add a cube to the scene. Our model returns to the polyCube command, which is actually the one executed by our plugin to create geometry. As our follow up action, the user types rotate the cube 45 degrees on the x-axis, and our plugin returns and execute the corresponding command with the proper arguments.
Without any surprise high-quality data is crucial for the success of this type of capability. Creating a proper evaluation data set is important to measure your progress with these models. Our evaluation set is composed of pairs of natural language prompts and commands, covering supported use cases, which are, essentially, camera, materials primitive, and what we call whole-object manipulations.
These sample pairs are naturally reviewed by field experts to make sure that they produce the expected results. And as shown on the top right table, most of our collected samples are one-to-one, meaning that for one prompt, we have a single command generated.
Naturally, for validation and fine tuning purposes, we need to grow our data set with more variations. For that, we use the synthetic data generation framework, where samples are parsed and converted as more generic and customizable expressions. These generic expressions are then used to generate synthetic data of the natural language prompts, using synonym, different values, and so on. Corresponding code is generated alongside them to produce new pairs of samples for training.
So in the next few slides, we will discuss throughout experiments how we achieve an acceptable level of accuracy for production and also to highlight what can be accomplished with these large language models without the need of fine tuning. Providing meaningful samples along the prompt, often called few-shot learning, has proven to improve the accuracy of the response at the inference stage.
Since finding the meaningful ones can be tricky in some cases, we also provide a previous request as what we call a historical context. Something interesting to highlight in this table is the fact that, regardless of the model used, historical context seems more impactful than few-shot's on the model's precision. And when combining the few shot examples in historical context, you get an even more performing model, as shown at the bottom part of this level table.
As shown in this chart, we achieve above 80% precision in all supported categories, compared to 50% with the out-of-the-box GPT-3.5 for the supported categories and also in this specific task. Being able to measure how good your model is performing in production is essential. So in this project, we use the exact match and other distance metrics to measure the similarities between the expected and generated code.
As shown in this example, these two significantly different commands can produce the same outcome showing how important text normalization is. Meaning that, even with a low score on the exact match metric, you might end up with the right outcome. It also shows the importance of sample and realization for training and validation.
Also, as we use few-shot learning to guide the predicted command, we have looked into different experiments to better identify the closest samples to the one requested by the user. Without surprise, high-level categories have been revealed to be too broad for such a task. Command-level categories are also biased as our data set is extremely imbalanced. This is mostly due to the differences in complexity of each command.
As a result, we've used a density-based spatial clustering method to encourage natural grouping in the data. So that way, we end up with a more fitting clusters for this task, especially to find the closest, actually, command to the one requested by the user.
Remember when I mentioned that most of our samples are one-to-one? Well, it turns out that, in many cases, a task can be too complex to be represented with a single command, even when enriched with few-shot samples. So instead of predicting code using the Maya API, we learn on using what we call the high-level APIs to encapsulate more complex tasks. In short, we add these high-level API samples to the few-shot bank.
So like in this example on the right, where you can replace a three command task with a single call from our high-level API. So for example, on this task of changing the color of an object, which usually requires, one, creating new material, two, assigning that material to the object, and lastly, changing the color of the assigned material. We can only predict the call to the high-level API change color with the proper arguments,
Something else we did to improve the precision of our model is to use Docstrings as context. Docstrings provide the function signature, description, flag descriptions for a particular command. Similarly to few-shot learning, we provide the information with the request.
So to summarize, here's an overview of how we design the prompt sent to our inference model. The high-level context is inserted on top with the prompt to focus on Maya Python API code, making sure that our model predicts the right API and the latest version of it.
In addition, we use an extended context layer as guidance with our prompt. The extended context includes task descriptions, such as command Docstrings, few examples similar to the request task of the user prompt, and lastly, is to recall contexts, such as previous commands, session history, and scene description.
So in this demo, we will show a few capabilities of our current plugin on a complex scene graciously provided by KitBash3D. Almost every interaction in Maya presented in this demo are performed by typing natural language inputs in our plugin. So everything I'm going to describe is, basically, how I phrased it using this plugin. So as I will show a few lightning-related features, let me first enable the light in the scene. So we're going to be in pitch black first before adding any light sources.
But first, let me add a sun by typing into our plugin, add a sun in the scene, and rename it sun. Something pretty cool about this is the fact that I don't need to know what you're actually asking for, which is like a directional light simulating the very distant light source. You can also ask to rotate that light source 90 degrees to make that nice lighting on the facade and also to make it down a little bit 15 degrees to simulate how the sun would rise on this scene.
And also, it can ask something like, can you reduce the brightness to 50%, without knowing where to go to change these parameters and to highlight other light source later. So let me create a camera and show you how I created it. So I ask, actually, create a camera from the current viewpoint, without knowing what are the actual manipulation to do so in Maya, which is pretty cool.
And I'm going to ask them to move it in front of the streetlamp on the ground, just in front of this building. There you go. And then I'm going to ask this plug-in to point that camera towards that front door structure, so I can have a nice frame to play with. There you go. So now let's see through it to see if we have something good. Seems good. All right. And I might ask to hide that streetlamp in front of us just because it's occluding my current view. I might replace it later.
And then, of course, I'm going to add more light sources, especially on these walls. So I'm going to add light sources, especially spotlights directly on the lamp on the wall. All right. Adding the spotlight. And then I'm going to ask to duplicate that spotlight, so I can add a different one on the other side of that column, the right. And rename it spotlight on this corner right. Then move it on the right position here, without knowing the actual position of this object. There we go.
And it's already pointing up. So that's good. And then I'm going to ask to change the color of the left. I'm going to go look back at the camera and ask to change the color of the left spotlight to red. This one. All right. Cool. And do the same. And ask exactly do the same for the right spotlight, so it knows what to do here. All right. It works.
Maybe make them 10 times brighter, without knowing where to actually change this. OK. It seems to work. I'm going to show you in the Attribute Editor here where the change has been made. We can see it worked. So you can see the value 10. So basically, the default value is 1 for these slides. And then I'm going to ask maybe to change the material of the front faces here.
So basically, what I want to do is assign a material existing in this scene from the column to those faces. And there you go. Just something pretty. This is a very good example of a few lines generated out of a single prompt.
And lastly, I'm going to bring back the streetlamp. Maybe move it a little bit on the left and rotate it 90 degrees on its right just to frame that door and duplicate that light to add the same in front of the right column, the right, on the right of the door, right.
You can probably see here in the explorer of the objects. And then I'm going to ask move it to the right. There we go. Again, maybe a little bit more. And there you have it. And lastly, rotated 180 degrees on its left. And there you go. Let me show you the scene here. Yep. So here you have it, like a fully edited scene using only natural language, without any knowledge of this software, which is something very cool to do with these foundation models.
All right. So let me give you a brief recap and open up on promising avenues. So we presented a generative approach tailored especially for content creation. But this kind of model can be applied to different use cases, which is something very interesting with these models. Our approach is fully controllable with natural language. And again, we only scratched the surface of what is possible for creators. Much more to come. It shows how powerful our tools can become when combined with data-driven methods.
And yes, Maya was an excellent first candidate to demonstrate a concrete use case of the Azure OpenAI services. We also showcased the benefits of using custom data structures to avoid the necessity of fine tuning with these models, which can be pretty expensive.
As future work, we will actually look into different ways to infer a better understanding of the 3D scene to our model. Among other things, we plan to use the USD format to learn the scene hierarchy. And of course, by leveraging various modalities available to us, such as geometry and various viewport buffers. By combining our current model with such data, we can enrich the possibilities for creation.
We also plan to propose multiple completion in form of similar suggestions, so basically providing different historical use cases that other user did or how they phrased it. We also want to enable autocompletion to provide insightful suggestions as you type in, so basically knowing what would be the natural next step in your workflow.
And for now, as I mentioned earlier, the supported use cases referred to as phase 1 and extended phase 1.1 mostly cover the camera, the lighting, materials, and object manipulation, which is basically moving, using rigid transformation on these objects in the scene, or browsing these objects, or knowing where they are compared to a different hierarchy.
But in future phases, we plan to include more complex use cases, such as rendering UI layouts and, hopefully, animation and modeling. And yeah, something cool is that we already have a functional plugin available through a private beta program. So please come talk to me or reach out to get any more details on this if you're interested.
And yeah, with that, thank you very much for joining us today, and I'm now ready for your questions.