Microsoft adds three new foundational AI models
Microsoft AI on April 2 unveiled three new foundational models spanning speech transcription, audio generation and image generation, marking a notable step in the company’s effort to deepen its own multimodal AI stack. The launch came from Microsoft’s MAI Superintelligence team, the research group led by Microsoft AI chief executive Mustafa Suleyman and formed in late 2025.
The new releases are MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2. Together they cover three practical model categories that are increasingly central to commercial AI systems: turning speech into text, generating synthetic audio and creating visual content from prompts.
What Microsoft says the models do
According to the supplied source text, MAI-Transcribe-1 supports speech-to-text transcription across 25 languages. Microsoft says it runs 2.5 times faster than the company’s Azure Fast offering. MAI-Voice-1 is positioned as an audio-generating model that can produce 60 seconds of audio in one second and can also create a custom voice. MAI-Image-2 is described in the source material as a video-generating model, though its naming indicates it sits in Microsoft’s image model line.
Microsoft is making all three available through Microsoft Foundry. The transcription and voice models are also available through MAI Playground, a testing environment for large language model experimentation. MAI-Image-2 had already appeared in MAI Playground on March 19 before Thursday’s broader rollout.
A strategic signal beyond product features
The release matters less as a single product update than as a strategic statement. Microsoft remains commercially tied to OpenAI, but these launches show the company continuing to invest in its own foundation-model capabilities rather than relying exclusively on outside partners. That makes the announcement relevant for developers, enterprise buyers and rivals watching how major platform companies are balancing partnership and in-house model development.
Suleyman framed the effort as part of what he called a “Humanist AI” approach, saying Microsoft is training models for practical use and optimizing for how people actually communicate. The source text also says more MAI models are expected soon in Foundry and directly inside Microsoft products.
Pricing and positioning
Microsoft is also competing on price. The company says the new models are cheaper than offerings from Google and OpenAI, an important claim in a market where inference cost is becoming a major differentiator. The source text lists starting prices of $0.36 per hour for MAI-Transcribe-1, $22 per 1 million characters for MAI-Voice-1 and $5 per 1 million text-input tokens plus $33 per 1 million image-output tokens for MAI-Image-2.
Those price points suggest Microsoft is targeting both direct developer adoption and broader enterprise integration, especially through existing Azure and Foundry relationships. Lower-cost multimodal services could be especially attractive for businesses building customer support tools, media workflows, multilingual transcription pipelines and branded content systems.
Why this launch stands out
Foundational model launches are no longer unusual, but this one stands out because it shows Microsoft broadening its capabilities across multiple media formats at once. Instead of releasing a single point model, the company is establishing a portfolio that touches text, voice and visuals in parallel.
That matters in a market moving toward multimodal systems rather than isolated single-purpose models. Developers increasingly want tools that can hear, speak, read and generate visual assets inside one platform. Microsoft’s move suggests it sees that convergence as central to its next phase of AI competition.
It also reinforces a wider industry pattern: the largest AI platforms are trying to own more of the full stack, from model creation to testing environments to enterprise deployment channels. Microsoft’s latest launch fits neatly into that trajectory.
What comes next
The source text does not provide independent benchmarks beyond Microsoft’s own claims, so the real test will be how these models perform once developers begin using them in production. Speed, cost, quality and reliability will determine whether the MAI line becomes a serious alternative in a crowded market.
For now, the clearest takeaway is that Microsoft is not standing still as multimodal AI competition intensifies. By shipping new models across transcription, audio and image generation, the company is making a stronger case that it intends to be both a platform host and a model builder in its own right.
This article is based on reporting by TechCrunch. Read the original article.




