说明
主要学习内容
- Master the utilization of graph databases on Azure.
- Develop expertise in mapping ontology in graph databases within the AEC industry.
- Familiarize yourself with the new Autodesk Platform Services AEC DM API.
- Explore how machine learning can optimize and automate the standardization process.
讲师
- Amy CaiI'm Amy Cai, Arcadis' lead software developer, boasting 3 years of cloud expertise and holding the title of Microsoft Certified Azure Solution Architect Expert. I have a passion for reading books and traveling, immersing myself in local cultures, and connecting with people from all walks of life.
- Josha Van ReijInnovation is important, but is it not just as important that these innovative ideas are properly implemented? An implementation that means a gain from ordering to delivery and from cooperation to maintenance. I am a consultant and my passion is helping companies in their search for innovative techniques and ideas for their business. It is my goal to investigate, implement, and optimally integrate it into the entire business. Some of the subjects that I engage with are enterprise architecture (EA), building information modeling (BIM), IT Governance, and data management. In combination with practical solutions like virtual reality, augmented reality, 3D printing, and machine learning.
AMY CAI: Welcome to the digital adventure. Please grab your favorite caffeinated beverage and find a comfy chair or a beanbag if you're feeling extra fancy. Let's dive headfirst into the wonderful world of online learning.
Today, in this class, we will explore the data standardization in AEC industry with new AEC data model API from Autodesk and discover how machine learning can help to handle the more tedious aspect of this whole progress. I trust this will provide you with fresh insights into this API. And most importantly, I hope you will enjoy this journey we are about to embark on together.
So let's begin. Before we dive into the class, let's have a quick round of meeting the speaker. I'm Amy. And I'm the Technical Lead for the Digital Solution in Arcadis.
I relocated to Netherlands four years ago. I love the culture, heritage, and the tulips in Netherlands. My passion for coding matches my love for reading, but I don't think I've earned the title of nerd just yet. Together with my colleague, Josha, who is the product owner of the OTL solution, we will guide you through this class.
So this class, it's about data standardization. And we keep talking about data standardization, data quality. Why is it so important to standardize the data?
So let me share with you a painful story that happened in 1999. It's so-called the costly conversion mistake from NASA. So in September 1999, NASA's Mars Climate Orbiter was launched on a mission to study the Martian atmosphere and climate. The spacecraft was expected to enter orbit around Mars and provide crucial data for the future mission in the-- including the Mars Polar Lander.
However, the mission ended in failure. And Mars Climate Orbiter was lost in the space. The root cause of this catastrophic failure was a data quality issue, specifically a unit conversion error.
So the NASA Propulsion Lab was responsible for sending those commands to the spacecraft. And they used the metric system, like meter, kilometers, for its calculation and measurements. Well, the contractor who built the spacecraft is using the imperial system, like feet, pounds. So throughout the mission, those two units of measurement were not reconciled properly, which rendered the mission a total loss. So as we quote here, "you can have all the fancy technology you want, but if the data quality is poor it will lead to a train wreck."
So with that said, let me briefly present today's agenda to you. First, Josha will give you a demo about the solution we built for the data standardization. Then, I will walk you through the data architecture behind this solution, how we connect the ontology into the solution database. And next to that, I will share with you the implementation of the AEC Data Model API in OTL, see how it helped to improve the workflow and extract the quantity data from the model. Finally, I will touch on the machine learning, see how that drive this solution further.
So from this class, what you will acquire is-- you will learn how to utilize the ontology in AEC with the graph database on Azure, and then you will also learn to use the new Autodesk Platform Services with AEC Data Model API. And there will also be hands-on experience with the machine learning. So, Josha, could you give us a bit background of this application and show us how it's working?
JOSHA VAN REIJ: So maybe to start with Arcadis as a company. Very short. Of course, I'm not going to go through all these digits.
But most of you might know Arcadis, a large engineering company. But maybe a fun fact to mention is that Arcadis stands for-- comes from Arcadia, which originates from the Greek methodology, and which stands for well-being and wellness. And it's also one of the reasons, of course, why our company keeps growing.
And due to the size of our company, I can imagine many of you will also encounter these challenges, is we have a lot of fragmented data standardization, which results in not having cross-project data insights. And to get those data out of there, we have a lot of manual extractions of the data, so that we can use it for analytics and for client delivery. And with all the different engineers having different local engineering applications, we have a huge amount of different applications and high time investment to actually process low quality of data.
So this is the main challenge we were encountering the past years. And as Amy mentioned, I'll do a quick demo of the application where we try to solve these challenges before Amy goes into technical depth.
First maybe, what the OTL? What is the Arcadis OTL? Well, actually, it's an ontology. And you could compare it to basically a recipe. So you could see a lasagna, these are there different asset types within the company. And you have the different ingredients, which are the information requirements and the structure of the recipe.
Well, within Arcadis, we have the Object Type Library, which is very similar. It contains different asset types. And those different asset types contains different information requirements and structures.
So in this example, you can see the bridge decomposition, the bridge ontology, where you see that we have defined certain object types and information requirements for that bridge. And in our case, information requirements is used for sustainability and cost management. So that's the standardization within Arcadis, but of course, we apply that project specifically. And for that, we created the OTL Optimizer, formerly known as PIMS, where you can make that ontology project specific. And that's where a part of this class will be about, how Amy applied to different technologies of Autodesk in this optimizer solution.
And looking in more depth to this optimizer solution, and specifically the part where this class is about, that is-- this example shows the bridge model inside our cloud-based optimizer solution, where this model has been OTL-standardized. So geometry in this model has been classified with certain object types from the Arcadis ontology. And by using the APIs of Autodesk, we're extracting or analyzing this model and extracting the different geometry elements. And we do that based on a parameter in the model called the Arc OTL object type.
And based on the information that we find, we create different instances within this optimizer project. So that's what you currently see happening is creating the different instances, and applying the project structure, and assigning the different object types to those instances. So if you click on one of those instances, you'll see that the information requirements are gathered and you could actually fill that information inside, in this case, curve one.
But, of course, we want to automate that. So inside the edit project settings, we have a calculation configuration, where we configure the mapping between the attributes from our ontology from our project specific decomposition and information requirements and map it to the parameters inside the 3D model. So in this sample, we have mapped a couple. And it's a standard that you can reuse for upcoming projects.
And after you have accomplished that mapping, we can cloud-based extract all the quantities from this model inside the different instances as you can see right here by clicking on calculate button. So you see it's now extracting all the quantities. So why do we need this information?
Well, that's basically for using the data for all kinds of purposes within Arcadis, so we connected this data through our common data platform within Arcadis to have sustainability insights, cost estimates, asset management, and many more. And with these estimates, we can do project overarching and cross project analytics and do actually the client delivery itself to the client. With how we have done that technically is something that Amy will now show in the next part of this session.
AMY CAI: Now, you have seen the application and how it extract the quantitative value from the model and feeding those model information in the right block. But how is the data mapping behind the scene? How do we map the ontology into the solution?
Here is an architect overview showing how the ontology was being constructed. So within Arcadis-- in the previous video, you probably would see that within Arcadis, we have different assets that's being created, like a bridge, road, and building. And here is a simple bridge ontology showing you how everything is connected with each other.
So for example, suspension bridge, it's a type of bridge. It can connect to a certain project and has multiple documents. And it has its own function. And every bridge would have their own part, like a deck, foundation, beam. And each part would have their own sub part, like a pavement for example. And each part would also have their own other attributes like, a sensor, the movement data, and would have its own geometry 3D shape.
And then, each part would also have its own material. And the material can have quantity attributes, like volume for example. So with this bridge ontology, we're also mapping this ontology in our database, so then we can support the solution to be project specific.
So here is the database architecture we have for the OTL solution. In OTL solution, we're using the Azure Cosmos DB together with the [INAUDIBLE] API. So here is where the data start. First, we have a project node. This project node, it contain project-related information, like a project name and version.
And then on the left side here, you see two blue node here, that stands for the calculation configuration that you see in the previous video. So these two will configure how the attribute would actually map in with the model parameters.
And then, connected with the project, you also see these two green nodes. That's related to the project unit system basically to configure a certain attributes, like length, width, which metric system-- which unit system they would use. They could be meter, centimeter, or the other units.
And then every project, we also have the option of model storage. So the model storage is the purple node here that you see. It can be the ACC storage. It can also be the local storage. We also support the project wise storage. And each storage can also have multiple models within that storage.
Now on the yellow nodes, you see here that's related with the project decomposition. So each project can have the project attributes, as you can see here on the button. And each project will contain a list of objects. They are ordered by the project decomposition. And every object would have their own parts, which is another object, as you can see-- you can compare that from the previous ontology architecture.
And every object would have their own material and their own attributes, like length, width for example, and the activity, more like a transportation information. So that's the data architecture overview of it.
Now, I want to expand the database structure that support the calculation functionality, especially related with the 3D geometry. So how do we establish the connection between OTL instance and model? So in the previous video, you see that we have the connection established. But how do we manage it to do that?
We're using the unique ID from the object to maintain the relationship between model and OTL instance. And reason being is because this unique ID here, it will not change even when the object is updated. So let's say if a model object is updated, the object information, of course, it's also updated. Because we have this connection here, it will also update the attribute fill for those OTL instance.
Yeah, so that was the data architecture behind the solution. And how did we actually extract those quantity value, quantity information from the model and input into the attribute field for each object in the OTL project? Before I start introduction about the AEC Data Model API, which is what we use to extract the quantity value, I would like to share with you the outflow, how it actually work before we integrate this API.
So here is the OTL old architecture overview before we integrate the AEC Data Model API. In the past, we have this website on the cloud that include more general functionality, like user management, project configuration, and project decomposition. We didn't have the calculation functionality on the cloud.
And that's because this calculation functionality can only be developed on the local to extract the information from the model and do the calculation work. And that's why you see on top here, we have a different connector developed. Now with the integration of AEC Data Model API, we are able to bring this local functionality to the cloud, which will reduce our development effort on maintaining the different connectors separately, as well as improve the user experience. And based on this, we can build more functionality on the cloud now.
So yeah, I've briefly mentioned that we integrate AEC Data Model API. But what it is-- AEC Data Model API, it is an API that allows developers to read, write, and extend the subset of models through the cloud-based workflows. So in other words, you can query the model information by however you want.
And here is this architect overview of this API. You will find the same overview from the API documentation. So from this overview, you can see that it starts from AEC Data Model. It starts from a design. And each design that's stored in ACC will have a different version, which means that you can also create a different version as well.
And each design, of course, they contain many, many objects. Here, they also called it element. And each element would have its associated property and a property definition. So by writing the query, you can query either one element-- however you want the subset of the model.
And together with the API, they also develop this Data Model Explorer. It's an interactive browser-based user interface, which I use very often to generate my own query and validate the result before I implement it in the code. I believe there is another class to talk more detail about this API, so I won't dive further here. But before you start using the API, I'd like to mention about the API preparation itself.
So this API works for the Revit 2024 model that is uploaded to the ACC. So before you use API, you might need to have the model ready on the ACC first, and then you can retrieve the model information. So yeah, that was the introduction about the API itself.
And about OTL tooling, where do we use this API? And how do we use it? So within OTL, we developed three queries for retrieving the model information.
The first query you see here is basically retrieving the project models, or in other words called project designs, with the project ID. So on the left [INAUDIBLE], you see that I have a project ID provided. And then on the right side, you see that I retrieved the model information with the model ID. So with this model ID, I can go to the second query, which use the model ID that I retrieved from the previous query. I can get all of the elements from this specific model.
So if you remember, recall back from the overview of the design, now I'm retrieving all of the elements from the certain design of certain model. And one of the important information here is this external ID. Because in the data architect part, I mentioned that we establish the connection between object-- between OTL object and model with this unique ID. So that's the ID that we're retrieving here and use it to establish the connection.
Now with this ID, I can go to the third query here. Under the filter, I can filter down to a specific element and retrieve all of the quantity value that bonded to this object as you can see on the right side, which is the result. Now, what I really like about this AEC Data Model API is its powerful filtering functionality.
You already see me using it in the previous query. With this filter functionality, you can narrow the model information to any subset type you want. And you can also create your own searching functionality with this filter functionality.
So I've show you how we extract the model information and mapping them on the right place for each object. Is that it for this whole data standardization progress? No, we also went further with the automation together with the machine learning and the AEC Data Model API. We come to the next step.
So in a previous video, you see that we have used-- we have two manual step. One is basically input a customized parameter in a model to specify the specific OTL type. And the other manual step is the calculation mapping. Now with a machine learning, we have developed an AI model-- type recognition model that we can predict a certain project to be a certain OTL type based on the project feature. How do we train this type recognition model?
So here is how we did it. So first, we gather the standardized model data with the AEC Data Model API. So those manual input data that you see in the model, we get all of the object with its type, and then we utilize the machine learning to train the model for recognizing the type based on what it studied, and then we deploy this model and use it in the other solutions.
So to show you to-- to share with you the first implementation of this AI model-- we are still learning it by the way. But currently, we have integrate the first version of it. I'll share with you in this video.
So the flow works same as what you see in the previous video. But with this type of recognition model integrated, we now can go to a project, and then select the model that we want to use this type recognition. So for example, I'm selecting a building here. And in this building, I don't have any manual step to input a specific parameter type to say, OK, this object should belong to this OTL type. That step, it's now replaced by this AI model recognition.
So with that, it scans the whole model, and then determine which object to be which type of OTL type, and then it creates the whole list of instance based on the information it retrieved from the model and established the connection automatically between the OTL object and the model. So all of the green color you see here is basically that we have this model recognized the type from this object. And the rest, it remained the same.
So then, the next step for us is to also automate this calculation configuration mapping, so we can continue with the machine learning. So how do we actually train this type recognition model? Here is the process I use. First, I ask myself, OK, for this type recognition model, what's my goal? What's the purpose for this model?
And once the goal is clear, I start to retrieve the data from the different platform, gather all the data information. Once the data information is there, I select a certain algorithm fitting the algorithm with the data that I prepared. And then, I start training the model, and then test model to see if it's indeed accurate or close to the accuracy that I set from the beginning.
So asking the right question, what to ask. Here, I have three steps to help you also define your own goal. So first, I ask myself, what's the main target? So for this type recognition model, my target is to generate a predictive model that's capable of forecasting the data type.
And next question is, what is the scope? What is the scope of those type? So for me, in my case, it's the scope is just to recognizing the OTL ontology object types. And then the last question is, what is the performance target you want to achieve? So for me, it was at least 80% of occurrences when forecasting.
So with all of the question being asked, I come to the training statement, which is utilizing the machine learning workflow to preprogress the Arcadis 3D model data, generating a predictive model that's capable of forecasting OTL types with accuracy of at least 80%. This is a statement that it's providing a guidance throughout this whole model training progress. So once the goal is set, the next step is to prepare the data.
At the beginning, I didn't believe that the data preparation would consume anywhere from 50% to 80% of the time. But now, I can be assured that the estimation is correct. So here is the data preparation progress. First, I collect all of the data from the different platform.
And in my case, I just store all the data in the Excel sheet first. And then next to that, I start the data cleaning, because all of the data that we collected-- all of the data that we collected, it's a contained error data or no information. Those has to be removed from the model, from the data sheet first.
And then, next step is data transformation, which is basically encoding those categorical, variable, and performing other data transformation to make it suitable. So once the data is cleaned up, we will move to the next step. But before move to next step, I would like to mention about two principle for data cleaning up.
The first one is dealing with the new data. And the second one is about the co-related data removal. So dealing with the co-related data, it is an important consideration in the data analytics and modeling. As highly co-related feature will impact the performance of the model that you are training, there are numerous of techniques to talk about this [INAUDIBLE] How to handle this co-related data. So I won't dive into the great detail here.
However, I would like to share with you some common methods of handling the null data. So here, I have an example of the numerical data. As you can see in the second row here, Bob's age is null. So there are three common way to handle this null data.
First way is you can remove the entire row here, so then it won't appear in your data set. Second way is if you have a domain specific expert with you, they can advise a certain value for those new data. And the third way is if you don't have the expert with you, you can replace those missing value by mean, or median, or mode imputation. So how does that work?
Take this as an example. If I take the mean imputation for replacing this new value, it will basically taking all of the known age, and then divide by 3. In my case, I have three known age, so I add up 25, 28, 30, and then divided by 3, I get this result of 27.67. So that's the value I can use to replace the null value.
And another way to deal with it is using the median imputation, which is you list-- you sorted all of the known age in an order. And then, the middle value, for me, it's 28, so then I will replace 28 in this null space. Now, this is the example of a numerical data. What about the categorical data? How do we deal with those null values?
So in this example, you see that I have two columns, name and gender, on the left side table. Bob's gender was missing, so we can use the mode imputation, which checking this whole gender column, see which value actually happened more often. So in my case, it's female, so then we fill in the female in the gender column for Bob.
Yeah, so that was about the data preparation. And now with the data collected and prepared, we come to the third progress in this whole machine learning. We need to select the candidate algorithms. And there are so many algorithms already exist. How do we select that?
I believe there are many different opinions talking about which factors are more important when it come to the algorithm selection. I'm sure you will also develop your own factors by experience. And from my experience, I have three factors help me to narrow down the algorithm selection.
So the first factor here is the learning type. When I start the training of this type recognition, I asked myself, which learning type it is, it a supervised learning type or it's unsupervised learning type? So for the type recognition model, we provide a list of data with a certain OTL type, so that is supervised learning type.
And then, second factor is I ask myself, what is the result I want to get from this predictive model? Is it a Boolean or is it a category? So in my case, it was category.
And the third factor is the complexity of the algorithm. You can start with the basic algorithm. You can also start with an enhanced algorithm. So for me, I start with a basic algorithm. With all these factors taking into consideration, I narrow down my algorithm to three candidate algorithm.
So once the algorithm-- candidate algorithm is certain, I can start training the model. And here is the training progress. So first, I need to split the data that I prepared and cleaned into two sets. For me, in my case, I split the data training set of data into 70% and testing into 30%.
However, you can also divide the data set into 80% to 20% or 90% to 10% depending on your scenario. And you can also have an extra data set for the validation. And after that, I start training the model, and then start to evaluate this model if it's close to the goal that I set from the beginning.
So here is the code that I have for this whole model training. I have-- I trained this model in the Jupyter Notebook. So for those people who are new to machine learning, Jupyter Notebook might sound very new to you. It's basically an open source web application that allows you to create and share documents containing the live code or virtualization.
And it's widely used for tasks like data cleaning, data transforming, and machine learning. So JupyterLab, which I'm using, it's an evolution of Jupyter Notebook. You can find more information on their official documents.
So let's have a look at the code here. So in my data set, I have this Excel that consists of four column-- name, parameter, type name, and OTL object type. The first three column is basically the feature of each object. And the feature defines which type it belongs to. So then with that, I would call this function called get_dummies to do one-hot encoding for the feature column.
Once that's done, I split the data into 70% and 30%. That's why you see the test size equals to 0.3 means the testing data set is 30%. And then, I have this random state equals 42 is set up the random state which will be used for the reproducibility. Once that's done, I also reshape my training data, because that was required for the algorithm that I selected.
So once the data split is done, I have the other actual code to verify if the data was actually split in the percentage I wanted. And once that's done, I start with the first candidate algorithm. In my case, it's a RandomForestClassifier. I use this algorithm to start with the training. And yeah, once the training process is done, I come to the final part of validating this whole type of recognition model, see if the occurrence is close to the goal I set from the beginning.
So we have come to the end of this class now. You now learn how to utilize the graph database on Azure for the ontology in the AEC. And you now also able to apply the AEC Data Model API in your own solution. And you also know where to start with machine learning.
I hope this class was helpful. And I look forward for a future session with you. Thank you.