As AI and machine learning capabilities are increasingly viewed as critical to driving innovation and efficiency at companies of all sizes, organizations must determine how best to provide their application developers with access to those capabilities. In many instances, an AI inference service will be used—that being: “a service which invokes AI models to produce an output based on a given input” (such as predict a classification or generate text), most typically in response to an API call from other application code.
These inference services may be built and consumed in a wide variety of ways (even within a single organization). These services may be developed entirely in-house or be consumed as a SaaS, leverage open-source projects, and use open or commercial models. They may come as part of a fully functional “Machine Learning Platform” that supports wider data science and model creation efforts, or they might provide you with nothing but an API endpoint to invoke the models. They might run in a private data center, on a public cloud, in a colocation facility, or as a mix of these. Larger organizations are likely to rely on more than one single option for AI inference (especially at lower levels of organizational maturity in this space). While there’s no one-size-fits-all approach that will work for everybody, a handful of common patterns are emerging.
We’ll examine three broad AI inference consumption patterns, as well as their various strengths and weaknesses:
When evaluating the tradeoffs between the following operating patterns, there are a few key factors to keep in mind:
Organizations may choose to consume services from a dedicated SaaS provider which focuses on hosting AI models for inference (such as OpenAI or Cohere). In this case, workloads running on an organization’s infrastructure (be it private data center, colocation, or public cloud) leverage APIs published by the SaaS provider. With this pattern, the burden associated with operating the infrastructure required for hosting the models falls entirely on the SaaS provider, and the time it takes to get up and running can be very brief. These benefits, however, come at the cost of control, with this pattern being the least “flexible” of the ones we’ll cover. Likewise control over sensitive data is typically lowest with this model, where the public internet is usually used to connect with the service, and the external SaaS vendor is less likely to be able to address stringent regulatory compliance concerns.
In this pattern, organizations leverage managed services provided by public cloud providers (for example, Azure OpenAI, Google Vertex, AWS Bedrock). As with the first pattern, the operational burden for deploying, scaling, and maintaining the AI models falls on the Cloud Provider rather than the consuming organization. The major difference between this pattern and the SaaS pattern above, is that in this case the inference API endpoints are generally accessible via private networks, and AI consumer workloads can be collocated with the AI Model service (e.g., same zone, region). As a result, the data typically doesn’t transit the public internet, latency is likely to be lower, and mechanisms for connecting to these services are similar to those for other cloud provider managed services (i.e., service accounts, access policies, network ACLs). Organizations leveraging multi-cloud networking may be able to realize some of these same benefits even in cases where workloads and AI inference services are hosted in different clouds.
For the self-managed model, we’ll first discuss the case where AI model inference is run as a centralized “shared service,” then examine the case where no centralized service exists, and operational concerns are left up to individual application teams.
In the “Shared Service” variant of the Self-Managed pattern, an organization may have dedicated team(s) responsible for maintaining inference infrastructure, the models that run on it, the APIs used to connect to it, and all operational aspects of the inference service lifecycle. As with the other patterns, AI application workloads would consume inference services through API calls. The model serving infrastructure might run in private data centers, public cloud, or at a colocation facility. The teams responsible for operating the inference service may take advantage of self-hosted machine learning platforms (such as Anyscale Ray or MLFlow). Alternatively, they might rely on more narrowly focused inference serving tools like Hugging Face’s Text Generation Inference server or software developed in-house. With self-managed inference, organizations are limited to using models that either they’ve trained themselves or are available to run locally (so proprietary models from a SaaS oriented service may not be an option).
Organizations without a central team responsible for running AI inference services for applications to consume is the other variant of the Self-Managed pattern. In this version, individual application teams might run their models on the same infrastructure that’s been allocated to them for the application workload. They may choose to access those models via API or consume them directly in an “offline” fashion. With this pattern, developers may be able to realize some of the same benefits as the “Shared Service” self-managed model at a smaller scale.
At their core, AI applications are a lot like any other modern app. They’re built with a mix of user-facing and backend components, often rely on external data stores, make heavy use of API calls, etc. These applications inherit the same security challenges common to all modern apps and need the same protections applied in the same manner (e.g., AuthNZ, DDoS protections, WAF, secure development practices, etc.). However, AI apps, and especially those leveraging generative AI, are also subject to a number of unique concerns that may not apply to other applications (for example, see OWASP Top 10 For LLMs). The generally unpredictable nature of these models can lead to novel problems, especially in use cases where the models are given significant agency.
Fortunately, due to the heavy reliance on API-driven interactions with AI model inference in the patterns discussed above, many organizations are already well positioned to implement these new security measures. For example, an API gateway that’s providing traditional rate limiting, authentication, authorization, and traffic management functionality might be extended to support token-based rate limits to help with cost control (either outbound at the app level to a SaaS/provider managed service or inbound RBAC authorization as a protection in front of a self-managed inference service). Likewise, infrastructure that’s currently performing traditional Web Application Firewall (WAF) services like checking for SQL injection or XSS, might be a logical place to add protections against prompt injection and similar attacks. Most modern observability practices are directly applicable to API-driven AI inference specific use cases, and teams should be able to make good use of existing tooling and practices in this space.
While the buzz surrounding AI powered applications and inference services is certainly new, the foundational principles at play when deploying and securing them have been around for a long time. Organizations will need to weigh tradeoffs as they determine how best to leverage AI capabilities, much as they do when consuming any other technology; based on their specific use cases, available resources, and long-term goals. Similarly, even though AI (and particularly generative AI) use cases do present some novel security challenges, the needs they share with other modern applications and the heavy reliance on API driven model interaction presents a great opportunity to use and extend existing patterns, practices, and tooling. Whether self-hosted or managed service, ensuring developers have access to a secure, scalable, performant solution that meets your specific business requirements is critical to maximizing the value and impact of your AI investments.