In this article, you will learn how to build multimodal AI capabilities — image classification, image captioning, and speech transcription — that run entirely in the browser using Transformers.js, with no server, no API key, and no data leaving the user’s device.








