Introduction
2024 has really felt like the year of Generative AI. Every day use of tools like Copilot has massively increased and ChatGPT has become a household name. I have certainly embraced Gen AI to increase my general efficiency, both at work and in my personal life. However, something I had never previously been involved with, was building a Gen AI platform - until now!
Whilst my company have been specialists and pioneers in the AI space for a number of years now, recently, more and more clients have been seeking the productionisation of their AI platforms (and specifically of Gen AI platforms). My personal opinion is that the introduction and use of Gen AI in every day life has generally increased people’s confidence in AI, and businesses are feeling more comfortable embedding it into their day-to-day processes.
This year I have had the awesome experience of collaborating closely with my company’s AI team to build out an FMOps* (Foundational Model Operations) platform for a client, for the purpose of hosting a user-facing Gen AI data interrogation tool. Think ChatGPT, but for your own data!
*FMOps is essentially an extension of MLOps (Machine Learning Operations) and is used to productionise Gen AI platforms.
As a Data Engineer, this was a huge learning curve for me. However, I have a passion for Platform Engineering, Infrastructure, and Security, so why shouldn’t this extend to AI platforms? Whilst I learned a tonne, I also brought a fresh perspective to this AI project - the engineer’s perspective! So, here are 5 tips on designing Production-Ready Gen AI Platforms, from an engineer.
1. Understand Internet Access Requirements
Every environment’s security is different, and this could be due to a variety of reasons. People have different security postures, and may have to adhere to different regulations due to the data they carry. However, as a general assumption, Production environments have increased network security. Increased network security and internet access do not easily coincide, however of course it is not impossible. From my limited experience, it appears quite likely that Gen AI platforms will want to make use of publicly available data in order to enrich or enhance private data. This could be making use of Azure services like Bing Search Service. Another requirement could be utilising open source resources, such as tiktoken encodings which are stored and updated in a managed Azure Blob Storage Account, and therefore access to specific public blob.core.windows.net
domains is required.
When designing a production data platform, my first instinct is to lock everything down. Anyone who knows me knows I love a hub-and-spoke architecture and a private endpoint! If internet access is something that is required, it is important to bake this into the architecture design from the start, as trying to add it in afterwards will likely lead to bad practices and shortcuts.
My recommendation is that Gen AI platforms should slot into a hub-and-spoke topology in the same way any other platform or application should (with the caveat described in point 2). In this design, all ingress and egress is controlled via the hub by use of a central firewall; no internet access should be allowed directly from the environment spoke network. In practical terms, this means:
- No Allow Outbound Internet Access rules directly on the platform’s Network Security Groups (NSGs).
- All platform subnets should have a route table attached, which routes all traffic to the IP address of the firewall. Don’t override or remove the routes.
- Restrict which internet domains can be accessed on which ports on the firewall. Be sure to document these required firewall rules, alongside their business justification.
2. Using Prod Data in Lower Environments
A huge difference between AI platforms and Data Platforms used for engineering such as lakehouses, is the use of production data in lower environments. Whilst this is sometimes a requirement for data engineering, it almost always is in the AI and ML space. This is because in order to train your models effectively, they need to be trained on the correct data, regardless of environment. This is not a new FMOps specific concept; it has been integral to MLOPs for years.
There are multiple ways to address this problem, however the main takeaway from this point is to plan for it in the architecture. How you decide to share data across environments could have a significant impact on the design of the platform’s infrastructure. In a traditional hub-and-spoke network architecture, compute in Dev cannot (and should not) access data in Prod. This becomes a problem if you need to process Prod data in your Dev environment.
There are more qualified people out there than me to tell you about MLOps best practices, however what I will say is this:
If you are using Production data in an environment, then that environment should be considered a Production environment from a security perspective and adhere to Production security controls.
This means that if you have 3 environments (perhaps labeled Dev, Test, and Prod) which all store and process Production data, then you have 3 environments with production-level security requirements.
In terms of how your compute can access Production data, there are obviously multiple ways to handle this from an infrastructure perspective. Some options are:
-
You may decide to separate the infrastructure and model lifecycles, which is a common approach within MLOps. This could mean having separate Dev, Test, and Prod environments (e.g. with separate spoke Virtual Networks (VNets) and resource groups etc) with standard CI/CD processes in order to test platform-level changes, whilst also having an additional Dev and Test “mini environment” within the Prod platform in order to test and train models. These Dev and Test mini environments will be duplicate versions of whatever tool is being used for the platform (Azure AI Services, Databricks, an application hosted in a Function App), but they live inside the main Production spoke and therefore crucially, inside the Production network with access to Production data and secured with Production level controls. This approach may be overkill for a Gen AI platform however, since mostly these make use of pre-trained models such as GPT-4.
-
You could duplicate the data across all three environments. Before doing this, consider if it is necessary, as it will duplicate your storage costs. The answer may well be yes. For example, you may need Production data in Dev in order to test your code correctly interprets results. You may need Production data in Test in order to perform a business-facing User Acceptance Testing (UAT) phase. If you choose this approach, you must consider within the design how you are going to ingest the data into all of the environments. Often these platforms require huge amounts of data to be effective, and moving that amount of data can be expensive. Additionally, if your environments are networked correctly, then they will not have network access to each other. Therefore, moving data from Dev to Prod directly, or vice versa, may not be possible. Additional Data Engineering work may be required here (this isn’t an upsell, I promise!).
A similar but cheaper and more streamlined approach, would be to use targeted sample data in lower environments, combined with a set of targeted test prompt questions. This is usually sufficient for both technical testing and UAT.
-
The other option is to only store your data in the Production environment, but allow network access to the Production data from your Dev and Test environments. If doing this, I recommend the following:
- Ensure the Dev and Test environments are still treated as Production-level environments, since they have access to Production data. Carefully consider user access.
- Restrict the firewall rules as much as possible. Consider the source compute required to access the data. If possible, restrict to specific subnets or even IP addresses, so that not everything within the Dev and Test networks can access the Production data by default.
3. Consider End-User Front-End Application Access
Another difference between AI platform and Data Platforms is front-ends! Occasionally in the Data Engineering space there are requirements for User Interfaces (UIs), however I have never had to build a web app*. For Gen AI platforms this is quite a crucial component, since for the most part, these platforms are designed for end-user interaction.
The introduction of a web app sounds cool and definitely appeals to clients, however deploying and hosting production web apps is not a walk in the park. Some considerations:
-
It’s important to understand if your web app will be containerised. It is my understanding that in the AI space, they mostly are. Consider the infrastructure implications:
- Where will the image be hosted? Do you need to introduce a container registry into the architecture?
- How will the image be built and pushed to the registry? Do you need to create a CI/CD pipeline for the image? Consider that if your platform is within a private network, self-hosted build agents are required for deployments. If using docker, do you have a Linux agent that you can use?
- How do you ensure the web app can pull the image? Consider whether you will need a Managed Identity (MI) for the web app, what Role-based Access Control (RBAC) roles it will need in the registry, and whether it has network access.
-
In a production setting, it’s crucial that Authentication and Authorisation are handled correctly for the web app, so that only genuine users within the business can use it, and can only perform the operations they are allowed to. Consider the infrastructure implications:
- If using Azure App Service, there are properties that can be set on the app service itself which specify the identity provider. I recommend using Entra. If using Entra, you must then configure the required Application Registration which will be used by your web app to authenticate users. This is required on a per environment basis.
- For Authorisation, you must define roles within your Entra App Registration and assign users to these roles (ideally via Entra Security Groups). Then, these roles need inspecting and handling within the web app application code itself.
*I should clarify here that I most definitely did not build a web app, I just deployed an empty app service and configured the authentication!
4. Plan Your Network Address Space
Planning your network address space is not something that is unique to Gen AI platforms; it is an important part of designing any production platform. It is crucial to know which components will be making up the platform during the design phase, so that the address space can be allocated appropriately. Remember; all compute needs address space! So the first step is to understand how much compute is required, which can be a tricky task. This could involve right-sizing your Databricks clusters, or deciding how many scale units you may need for your Function Apps. Whatever the technology, if it uses compute, we must understand the required size.
Whatever the estimated sizing requirement, always plan for scaling.
As an example:
Function Apps always have one inbound IP address and, by default, a small handful of outbound IP address (roughly 4-8). Therefore let’s assume a maximum of 9 IP addresses is needed. When scaling out Function Apps, Azure may need to add more outbound IP addresses. Assuming we want to scale out by a handful of units, let’s triple the required outbound IPs. This gives us 25 IPs (24 outbound plus 1 inbound), meaning a
/27
sized subnet - giving 30 usable IP addresses - would give us enough space for the app plus a decent amount of space for scaling without being too wasteful.
That was just an example of one app; it is important to consider the platform as a whole: perhaps you’re using Function Apps, Databricks, and everything is Private Endpoint-ed (each Private Endpoint is one IP address). Understanding the sizing requirements of all of this holistically enables you to get a good estimation of the required VNet size, as well as an appropriate subnet split.
This is my favourite tool for subnet splitting.
5. Don’t Neglect the DevOps Part of FMOps
As with any platform, DevOps is a key part of productionisation. My top tips are:
- Use a good branching strategy and stick to it. I recommend GitFlow: isolate feature code into feature branches and maintain a clean main branch. Use a development branch as a gateway between features and main.
- Use a good naming convention for branches, such as including your initials and work item ID to increase auditability.
- Implement CI/CD pipelines and prevent local deployments. All deployments should be monitored and auditable. This goes for all data plane application code as well as infrastructure.
- Implement security controls such as 2 minimum required reviewers on Pull Requests into the main branch and approval gates on deployments to Production.
Conclusion
These are my 5 tips for designing Production-ready Gen AI Platforms! Gen AI is flashy and fun, but - unfortunately - a production platform always needs the “boring” foundations of robust security and solid processes.
P.S. I’d like to thank the Advancing Analytics AI Team for teaching me so much about Gen AI over the last few months!