Apple introduced its third-generation foundation models at WWDC26, designed to overcome the memory constraints that have kept on-device AI models small. By moving the entire weight set out of DRAM and into NAND flash, the new AFM 3 family allows for far larger on-device models without relying on constant cloud connectivity.
The AFM 3 family includes five models: two on-device and three server-based, all operating within Apple’s Private Cloud Compute boundary. The server-side models—including AFM 3 Cloud Pro for agentic tool use and complex reasoning—run on Nvidia GPUs in Google Cloud, while the on-device architecture is Apple’s own. AFM 3 Core Advanced, a 20-billion-parameter model, stores its weights in NAND flash instead of DRAM, a fundamental departure from previous approaches.
“Instead of forcing the entire model into DRAM, the full model is stored in flash memory,” Apple’s research team wrote. “Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt.” This architectural shift enables more capable on-device agents—such as those performing complex tool use and reasoning—without the latency and privacy trade-offs of cloud-dependent systems.
The move signals a broader industry push to make on-device AI more competitive with server-side deployments. Enterprise architects evaluating agentic workloads have historically had to choose between capable cloud-dependent models and limited on-device ones. By decoupling model size from DRAM capacity, Apple’s approach could reshape what’s possible for privacy-preserving AI applications on edge devices.
The partnership with Google for server-side compute underscores the scale of Apple’s ambition, though Apple did not disclose financial terms or the exact timeline for public release. The AFM 3 family is still early-stage, and it remains to be seen how third-party developers will adopt the on-device architecture for their own agentic workloads.