Observations on Siri, Apple Intelligence, and hiding in plain sight (Interconnected)

Observations on Siri, Apple Intelligence, and hiding in plain sight

18.11, Tuesday 11 Jun 2024 Link to this post

Apple launched â€œApple Intelligenceâ€ yesterday â€“ their take on AI.

I want to zoom in on the new Siri but first hereâ€™s my mental model of the whole thing.

Overview

Hereâ€™s the Apple Intelligence marketing page. Lots of pics!

Hereâ€™s the Apple Intelligence press release. Itâ€™s an easy read too.

Apple Intelligence is (a) a platform and (b) a bundle of user-facing features.

The platform is Appleâ€™s take on AI infra to meet their values â€“ on-device models, private cloud compute, and the rest.

The user-facing features we can put into 5 buckets:

Generation/Summarisation. Bounded to avoid hallucination, deepfakes, and IP risks (no making a picture in the style of a particular artist).
Agents. This is what underpins Siri: on-device tasks using high personal context. (They call it â€œorchestration.â€)
Natural interfaces. Voice, handwriting, nodding/shaking the head with AirPods Pro.
Do what I mean. This is a combination of gen-AI and traditional ML: recognising people in photos, knowing which notifications are important, spotting salient data in emails.
World knowledge. Cautiously delivered as an integration with ChatGPT, think of this is web search++. Also used to turbo-charge text and image generation, if the users opts in.

Bucket 1-4 are delivered using Appleâ€™s own models.

Appleâ€™s terminology distinguishes between â€œpersonal intelligence,â€ on-device and under their control, and â€œworld knowledge,â€ which is prone to hallucinations â€“ but is also what consumers expect when they use AI, and itâ€™s what may replace Google search as the â€œpoint of first intentâ€ one day soon.

Itâ€™s wise for them to keep world knowledge separate, behind a very clear gate, but still engage with it. Protects the brand and hedges their bets.

There are also a couple of early experiments:

Attach points for inter-op. How do you integrate your own image generation models? How could the user choose their own chatbot? Thereâ€™s a promise to allow integration of models other than OpenAIâ€™s GPT-4o.
Copilots. A copilot is an AI UX that is deeply integrated into an app, allowing for context-aware generation and refinement, chat, app-specific actions, and more. Thereâ€™s the beginning of a copilot UX in Xcode in the form of Swift Assist â€“ Iâ€™d love to see this across the OS eventually.

A few areas werenâ€™t touched on:

Multiplayer. I feel like solving for multiplayer is a prerequisite for really great human-AI collaboration. I feel like their app Freeform is a sandbox for this.
Long-running or off-device agent tasks. Say, booking a restaurant. Thatâ€™s where Google Assistant ran to. But having taken a stab at this in old client projects Iâ€™m of the opinion that weâ€™ll need whole new UX primitives to do a good job of it. (Progress bars??)
Character/vibe. Large language models have personality, and people love chatting with them. ChatGPT has a vibe and character.ai is hugely popularâ€¦ but nobody really talks about this. I think itâ€™s awkwardly close to virtual girlfriend territory? Still, Anthropic are taking character seriously now so Iâ€™m hopeful for some real research in this area.
Refining, tuning, steering. Note that Appleâ€™s main use cases are prompt-led and one-and-done. Steering is a cutting-edge research topic with barely-understood tech let alone UX; there are hard problems.

Gotta leave something for iOS 19.

Architecture

Someone shared the Apple Intelligence high level architecture â€“ I snagged it went by on the socials but forget who shared, sorry.

Hereâ€™s the architecture slide.

The boxes I want to point out so I can come back them in a sec:

Semantic index. This must be something like a vector database with embeddings of all your texts, emails, appointments, and so on. Your personal context. I talked about embeddings the other day â€“ imagine a really effective search engine that you can query by meaning.
App Intents toolbox. Thatâ€™s the list of functions or tools offered by all the apps on your phone, and whatever else is required to make it work. Apple apps now but open to everyone.
Orchestration. Thatâ€™s the agent runtime, the part that takes a user request, breaks it into actions, and performs them. I imagine this is both for generation tasks, which will take a number of operations behind the scenes, and more obvious multi-step agent tasks using Siri.

Whatâ€™s neat about the Apple Intelligence platform is how clearly buildable it all is.

Each component is straightforwardly specific (we know what a vector databases is), improvable over time with obvious gradient descent (you can put an engineering team on making generation real-time and theyâ€™ll manage themselves), and itâ€™s scalable across the ecosystem and for future features (itâ€™s obvious how App Intents could be extended to the entire App Store).

A very deft architecture.

And the user-facing features are chosen to minimise hallucination, avoid prompt injection/data exfiltration, and dodge other risks. Good job.

Siri

Siri â€“ the voice assistant that was once terrible and is now, well, looking pretty good actually.

Iâ€™ve been immersed in agents recently.

(Hereâ€™s my recent paper: Lares smart home assistant: A toy AI agent demonstrating emergent behavior.)

So Iâ€™m seeing everything through that lens. Three observations/speculations.

1. Siri is now a runtime for micro agents, programmed in plain English.

Take another look at the Apple Intelligence release and look at the requests that Siri can handle now: Send the photos from the barbecue on Saturday to Malia (hi you) or Add this address to his contact card.

These are multi-step tasks across multiple apps.

The App Intents database (the database of operations that Siri can use in app) is almost good enough to run this. But my experience is that a GPT-3.5-level model is not always reliableâ€¦ especially when there are many possible actions to choose fromâ€¦

You know what massively improves reliability? When the prompt includes the exact steps to perform.

Oh and look at that, Siri now includes a detailed device guide:

Siri can now give users device support everywhere they go, and answer thousands of questions about how to do something on iPhone, iPad, and Mac.

The example given is Hereâ€™s how to schedule a text message to send later and the instructions have four steps.

Handy for users!

BUT.

Look. This is not aimed at humans. These are instructions written to be consumed by Siri itself, for use in the Orchestration agent runtime.

Given these instructions, even a 3.5-level agent is capable of combining steps and performing basic reasoning.

Itâ€™s a gorgeously clever solution. I love that Apple just wrote 1000s of step-by-step guides to achieve everything on your phone, which sure you can read if you ask. But then also: Embed them, RAG the right ones in against a user request, run the steps via app intents. Such a straightforward approach with minimal code.

i.e. Siriâ€™s new capabilities are programmed in plain English.

Can I prove it? No. But Iâ€™ll eat my hat if itâ€™s not something like that.

2. Semantic indexing isnâ€™t enough. You need salience too and we got a glimpse of that in the Journal app.

Siriâ€™s instruction manual is an example of how Apple often surfaces technical capabilities as user-facing features.

Hereâ€™s another one I canâ€™t prove: the prototype of the â€œpersonal contextâ€ in the semantic index.

Itâ€™s not just enough to know that you went to such-and-such location yesterday, or happened to be in the same room as X and Y, or listened to whatever podcast. Semantic search isnâ€™t enough.

You also need salience.

Was it notable that you went to such-and-such location? Like, is meeting up in whatever bookshop with whatever person unusual and significant? Did you deliberately play whatever podcast, or did it just run on from the one before?

Thatâ€™s tough to figure out.

Fortunately Apple has been testing this for many months: Apple launched their Journal app in December 2023 as part of the OS, and it includes Intelligently curated personalised suggestions as daily writing prompts.

Like, you had an outing with someone, that kind of thing, thatâ€™s the kind of suggestion they give you. Itâ€™s all exposed by the Journaling Suggestions API.

Imagine the training data that comes from seeing whether people click on the prompts or not. Valuable for training the salience engine Iâ€™m sure. You donâ€™t need to train with the actual data, just give a signal that the weights are right.

Again, nothing I can prove. But!

3. App Intents? How about Web App Intents?

AI agents use tools or functions.

Siri uses â€œApp Intentsâ€ which developers declare, as part of their app, and Siri stores them all in a database. â€œIntentâ€ is also the term of art on Android for â€œa meaningful operation that an app can do.â€ App Intents arenâ€™t new for this generation of AI; Apple and Android both laid the groundwork for this many, many years ago.

Intents == agent tools.

It is useful that there is a language for this now!

The new importance of App Intents to AI-powered Siri provokes a bunch of follow-up questions:

What about intents that can only be fulfilled off-device, like booking a restaurant? In the future, do you need an app to advertise that intent to Siri, or could Siri index â€œWeb App Intentsâ€ too, accessed remotely, no app required?
How will new intents be discovered? Like, if I want to use the smart TV in an Airbnb and I donâ€™t have the app yet? Or book a ticket for the train in a country Iâ€™m visiting for the first time?
When there are competing intents, how will Siri decide who wins? Like, Google Maps and Resi can both recommend restaurants â€“ who gets to respond to me asking for dinner suggestions?
How will personal information be shared and protected?

I unpack a lot of these questions in my post about search engines for personal AI agents from March earlier this year. Siriâ€™s new powers make these more relevant.

On a more technical level, in the Speculations section of my recent agent paper, I suggested that systems will need an agent-facing API â€“ we can re-frame that now as future Web App Intents.

In that paper, I started sketching out some technical requirements for that agent-facing API, and now I can add a new one: in addition to an API, any system (like Google Maps for restaurant booking) will need to publish a large collection of instruction cards â€“ something that parallels Siriâ€™s device guides.

Good to know!

Iâ€™m impressed with Apple Intelligence.

It will have taken a ton of work to make it so straightforward, and also align so well with what users want, brand, and strategy.

Let me add one more exceptionally speculative speculation, seeing as I keep on accusing Apple of hiding the future in plain sightâ€¦

Go back to the Apple Intelligence page and check out the way Siri appears now. No longer a glowing orb, itâ€™s an iridescent ring on the perimeter of the phone screen.

Another perimeter feature: in iOS 18, when you push the volume button it pushes in the display bezel.

I bet the upcoming iPhones have curved screens a la the Samsung Galaxy S6 Edge from 2015.

Or at least it has been strongly considered.

But iPhones with Siri AI should totally have curved glass. Because that would look sick.

If you enjoyed this post, please consider sharing it by email or on social media. Hereâ€™s the link. Thanks, â€”Matt.

Interconnected

Observations on Siri, Apple Intelligence, and hiding in plain sight

18.11, Tuesday 11 Jun 2024 Link to this post

More posts tagged:

Follow-up posts:

Auto-calculated kinda related posts: