Photo by https://unsplash.com/@shanerounce

How we got here

LLMs are not the be-all & end-all of artificial intelligence, though that's pretty much how it's been hyped in recent memory (from around 2022/2023 when chatgpt became synonymous with everything AI in common parlance).

We got an interface to communicate in natural language, where the system creates contextual answers from a foundation model that has scoured the internet, trained on it's data to provide probabalistic appropriate responses on a best effort basis.

This is almost the best way to achieve simplicity in terms of how to use a product, extracting out complexities, the knobs & params, and just interact in plain language without any domain specific incantations.

And so began its evolution in business.

Initially it was about how performant and correct the incremental models became.

Most users were just wanting that novel feeling of driving a conversation on the (T/G)UI.

The AI behemoths started focusing on building AI infrastructure to provision for it's heavy demand and expected widespread usage.

Meanwhile companies have been embedding LLM capabilities in their products and creating greenfield business propositions sometimes solely banking on AI play.

Something missing

Eventually, we began noticing limitations of a purely static & central knowledge system.

Enterprises need to be selective what data could be shared with these models, if any at all.
How to integrate prompt calls in different applications, is the interfacing too custom, is it secure enough?

LLMs can suggest action items, but no scope of executing autonomously.
Most practical use cases need to break the loop to make external interactions and if anything requires follow up, must be fed back to the LLM again to continue on.

If the model hallucinates, what can be done to reaffirm and align with the user's requirements/intention reliably.

Through prompts (text) how do we map it to correct programmatic actions, intended targets or APIs reliably.

Protocoling the way to develop AI apps

Somewhere down the line, it was a natural step:
A standard process for allowing models to iteratively take predefined actions, and thus become a native way to fit LLM utilization in AI applications.

MCP (Model Context Protocol) provided that: JSON-RPC comms for AI applications to utilise specific tools/functions provided from target systems.
This paves the way for further automation where AI is not dependent on just static information/training and can access stuff externally.

However, building complex workflows can become quite token heavy since there are set instructions at client that load valid, available tools from server into the LLM context, even if the workflow may not need it upfront or use all of them.
Then there are intermediate results/processing from one MCP tool to another that are again passed through the context.

The way forward?

To overcome this problem, Anthropic proposed an approach wherein the agent can decide what tool/code is needed to interface with server capabilities at any given point, via 'skill' based intents mentioned in a file.

It banks on the premise, that agent/LLMs are good enough to generate code, traverse files/folders to realise which tool/s are needed at any juncture and process the interaction on it's own (albeit with some human oversight as the disclaimer always is).

This models what the request/response is going to be about, not how it must be done. This intent based abstraction gives the idea and guardrails for agents to proceed and take tool action when necessary.

https://platform.claude.com/docs/images/agent-skills-architecture.png

This seems closer to the ideal concept of intent based API (in natural language too).

What is the desired state/action is more apparent and modelled as a contract of sorts, instead of procedural/rigid steps where client code is tightly coupled with the domain logic.

So this might be where we are heading to, more natural pragmatic contract designs for LLMs to base their actions on, chained as a series of capabilities left to the agent to decide when to use/create.


Elephant/s in the room 🐘🐘

There are elephant/s in the room though in general for all agentic automation.

Since we expect instructions/intents to be in natural language mostly, what is instruction, and what is content (data) gets more blurred than it usually is when dealing with ex - say HTML / javascript where there are atleast some measures built over the years to distinguish executables.

This leaves fundamental security risks from potential hidden instructions masquerading as content in some website/github repo etc, allowed to be accessed via the flexible agents.

Here again, agent will try to honour the 'intent' (even from a malicious external prompt).

It will be interesting to see how this space evolves.

While the 'skills' model seems a great way of adding flexibility and cleaner intent instead of following predefined logic,
the same move reopens the data-instruction boundary problem,
and securing such systems become a huge challenge, especially with the industry pushing to make autonomous, privileged agents widely used in tech asap.