Backend#6: AI inference, AI gateway, Cloudflare Workers AI, serverless observability issues and more
Cloudflare Workers AI
Cloudflare recently leveled up their Workers AI and AI gateway, making them generally available and augmenting them with new capabilities.
Until now, to leverage the power of LLMs (Large-language models), we either had to train and deploy our models ourselves, managing the infrastructure or we had to leverage a proprietary AI platform running their closed models, subsequently moving our data over to their servers.
Cloudflare Workers AI is a serverless GPU-powered AI inference service running on Cloudflare's global network that empowers devs to run open-source AI models (for tasks such as image classification, text generation, object detection, etc.) on the Cloudflare network of GPUs. You can browse the available models here.
What is AI inference?
AI inference
AI inference refers to a deployed pre-trained AI model producing predictions or conclusions. For instance, a text-based LLM creating a new poetry pre-trained on hundreds of thousands of poetries. This is what primarily all popular large-language models do.
Training is the first phase of an AI model. After being trained on extensive data and going through the process of trial and error and fine-tuning, the model becomes ready to infer data. The better trained the model is, the better it will infer.
Similar to data processing, there are two approaches to getting inferences: real-time and in batches.
Real-time AI inference models provide output in real-time, like ChatGPT, card payments fraud detection models and such. In contrast, in batch inference, the inferences are generated offline using batches of data, for instance, movie recommendations based on what the user has watched over a period of time, business analytics and forecasting, etc.
Running both inference models may require different system designs where the real-time inference would typically require super low latency, data caching and such.
Cloudflare AI Gateway
Cloudflare AI Gateway becomes generally available as well, along with the Workers AI.
The AI gateway, just like an API gateway, acts as a gateway or a single point of entry for requests to interact with the AI models. It intercepts requests to cache responses, apply limits and retries, provide analytics and more.
An AI gateway acts as a single consistent control plane for the AI service regardless of the number of varied models used on the backend.
Next up:
Serverless observability is hard. Why?
Serverless observability comes across as challenging due to the inherent nature of serverless architectures. With the infrastructure abstracted away, traditional infrastructure-level metrics (CPU, memory usage, etc.) are mostly unavailable. This limits our visibility into the system's performance.
The functions are primarily stateless and short-lived, making it difficult to track their state over time. In addition, services are composed of hundreds of thousands of serverless functions, making it further challenging to trace a request flow. Cold starts can further affect performance metrics.
Imagine 2000+ serverless functions talking to each other through HTTP calls, queues, notification systems, event buses and database streams, all in a giant system impossible to map out.
Baselime, an observability and error tracking platform, share their insights on why serverless observability is so hard. Do give it a read.
Baselime is designed for high cardinality and dimensionality data, from logs to distributed tracing, helping engineering teams understand the behavior of their cloud applications by implementing observability as easily as possible.
Btw, what is high cardinality and dimensionality in observability data?
High cardinality
Cardinality in observability refers to the number of unique values of a certain metric. For instance, if our system is hit with requests from a large number of unique users, it will log unique userIDs, IP addresses, sessionIDs, transactionIDs, etc. This is high cardinality data. Metrics like userIDs, IP addresses, sessionIDs, etc., are all unique.
In system observability, data cardinality significantly impacts the cost and performance of monitoring systems.
For a detailed read on observability, check out my newsletter post on observability-driven development
How does high cardinality data impact costs and performance?
High cardinality metrics require significant storage space. Each unique value generates a new time series, thus requiring more storage space.
A time series is a sequence of data points recorded over a period of time. With unique users, every userID will have a separate time series. And each unique time series requires its own storage. Thus increased storage costs.
Also, the increase in data will put an additional load on the monitoring system, subsequently increasing complexity and requiring more memory and compute, in addition to making observability queries slower.
Systems built to handle high cardinality data leverage different techniques such as collecting a subset of data as opposed to all and processing it, aggregating data points over time to reduce the number of unique time series, filtering data to pick only the key metrics, structuring related metrics in a hierarchy, etc, to efficiently handle big observability data.
Though high cardinality data impacts performance, it makes debugging easy, providing a more vivid picture of the system. When an incident happens, the speed of locating and fixing it is imperative. For this, we need deep system visibility and this is where high cardinality data shines.
High dimensionality
High dimensionality represents different dimensions of data. Dimensions provide additional context to metrics, logs, and traces. They help in filtering, grouping and analyzing data more effectively.
For instance, we can add a few dimensions to every metric of a microservice to make it high dimensional, such as instanceID, data center, cloud region, environment, HTTP method, endpoint, transactionID, etc.
The pros and cons of dealing with high dimensional data are the same as high cardinality data.
I'll wrap up the post here. If you found it insightful, please do share it with your network for more reach. You can connect with me on LinkedIn & X, in addition to replying to this email.
If you wish to learn software architecture from the bare bones, check out the Zero to Software Architecture Proficiency learning path I've authored that educates you, step by step, on the domain of software architecture, cloud infrastructure and distributed system design.
I'll see you around in my next post. Until then, Cheers!