Multimodal Browser AI with Transformers.js for Images and Speech

In this article, you will learn how to build multimodal AI capabilities — image classification, image captioning, and speech transcription — that run entirely in the browser using Transformers.js, with no server, no API key, and no data leaving the user’s device.

Topics we will cover include:

  • How to set up and run image classification and image captioning pipelines using Vision Transformer models in the browser.
  • How to implement browser-based speech transcription using OpenAI’s Whisper architecture via the Web Audio API.
  • How to combine all three pipelines into a single multimodal media analyzer that loads models in parallel and presents results in a unified dashboard.
Multimodal Browser AI with Transformers.js for Images and Speech

Multimodal Browser AI with Transformers.js for Images and Speech

Introduction

Most browser AI tutorials cover text because it is a natural starting point, but the applications people actually want to build are rarely text-only. Users take photos, record voice notes, upload screenshots. The data is multimodal and the AI should be too.

Transformers.js handles this natively. It supports computer vision (image classification, object detection, segmentation), audio (automatic speech recognition, audio classification, text-to-speech), and multimodal tasks, all running locally in the browser, with no server, no API key, and no data leaving the user’s device.

This tutorial builds three capabilities in sequence: image classification, image captioning, and speech transcription. Each is a self-contained HTML file you can open in a browser. The final section combines all three into a single multimodal media analyzer.

What You Need

  • A modern browser: Chrome 109+, Edge 109+, or Firefox 90+. These versions support ES modules and WebAssembly, both of which Transformers.js requires.
  • A local web server: Browser security policies block ES module imports from file:// URLs — opening the HTML files directly by double-clicking will not work. You need to serve them over HTTP.

You do not need Node.js, npm, or any build tools. The CDN import handles the library.

Starting a Local Server

Pick whichever option matches what you already have installed:

Once the server is running, open http://localhost:8080 in your browser.

Project Structure

Create one folder for the project. Each task gets its own HTML file:

Models and Download Sizes

Every model downloads once on the first run and caches in the browser. Subsequent loads are instant and work offline. Here is what to expect on the first run:

Task Model Pipeline task First-run download
Image Classification Xenova/vit-base-patch16-224 image-classification ~88 MB
Image Captioning Xenova/vit-gpt2-image-captioning image-to-text ~246 MB
Speech Transcription Xenova/whisper-tiny.en automatic-speech-recognition ~78 MB

The combined app loads all three, roughly 400 MB total on first run. A progress indicator for each model is non-negotiable UX.

Task 1: Image Classification

Image classification assigns labels from a fixed set to an input image. The model used here is ViT-Base/16, a Vision Transformer trained by Google on ImageNet-21k and fine-tuned on ImageNet-1k, converted to ONNX format for browser use. It classifies images into 1,000 ImageNet categories and returns a ranked list with confidence scores.

What the output looks like:

Each object has a label string (the ImageNet class name) and a score float between 0 and 1. By default, the pipeline returns 5 results. Set top_k in the call to get more or fewer.

Full Working Demo

Save this file as image-classifier.html in your project folder. Copy the code below and open it on your localhost.

Browser mockup of the image classifier running at localhost

Browser mockup of the image classifier running at localhost (click to enlarge)

What this code does:

  1. The pipeline() call starts downloading the model immediately when the page opens.
  2. The progress callback updates the status text so the user can see the download progressing.
  3. Once classifier is assigned, the drop zone border turns green as a visual cue.
  4. When a file is dropped or selected, FileReader converts it to a base64 data URL, which the pipeline accepts directly as image input — no manual preprocessing needed.
  5. The classifier returns an array of { label, score } objects, which the rendering loop converts into a horizontal bar chart. The top_k: 5 option limits results to the five most likely classes.

Task 2: Image Captioning

Image captioning generates a natural language sentence describing what is in an image. It is meaningfully different from classification: instead of picking from 1,000 fixed labels, the model generates free-form text. “A golden retriever running through a field of tall grass” versus just “golden retriever.” More descriptive, more flexible, larger model.

The model used here is Xenova/vit-gpt2-image-captioning, a Vision Transformer encoder that reads the image paired with a GPT-2 decoder that generates the caption. The ONNX version weighs in at 246 MB, noticeably larger than the classifier, because the generative decoder is a full language model.

What the output looks like:

The output is an array with one object containing a generated_text string. It is always an array even for a single image, because the pipeline supports batching.

Full Working Demo

Save this file as image-captioner.html. Run it on http://localhost.

Browser mockup of the image captioning running at localhost

Browser mockup of the image captioning running at localhost (click to enlarge)

What this code does:

  1. Both pipelines load in parallel using Promise.all — the downloads overlap, saving time compared to sequential loading.
  2. The onLoaded counter updates the status line as each model finishes, so users know which is still downloading.
  3. When an image is uploaded, both inferences also run in parallel: classification and captioning use different model weights and do not share any state, so there is no reason to wait for one before starting the other.
  4. The side-by-side comparison layout makes the difference between the two tasks immediately visible on the same image. Classification returns labels like “golden retriever, 94.2%” while captioning might return “a dog is playing on a tennis court.”

Task 3: Speech Transcription

Speech transcription converts audio to text using OpenAI’s Whisper architecture. Browser-based Whisper implementations use WebAssembly to run the transformer architecture directly in JavaScript, with the Transformers.js library handling model loading, WASM compilation, and tensor operations. The model used here is Xenova/whisper-tiny.en — English-only, ~78 MB quantized, designed specifically for browser deployment.

How Audio Input Works

The automatic-speech-recognition pipeline expects a Float32Array of audio samples at 16,000 Hz. The browser’s Web Audio API handles the conversion:

AudioContext.decodeAudioData() accepts any supported audio format (WAV, MP3, MP4, OGG, FLAC) and returns an AudioBuffer, and audioBuffer.getChannelData(0) extracts the raw PCM samples from the first channel. If the audio was recorded or stored at a different sample rate, you need to resample to 16,000 Hz. The AudioContext constructor accepts a sampleRate parameter that does this automatically.

Full Working Demo

Save this as speech-transcriber.html. To test, I used this sample audio file. For microphone access, the page must be served over HTTP (localhost counts) — opening the file directly will not work.

Browser mockup of the speech transcriber at localhost

Browser mockup of the speech transcriber at localhost (click to enlarge)

What this code does:

  1. The decodeAudio function is the key step most tutorials skip explaining. new AudioContext({ sampleRate: 16000 }) tells the Web Audio API to resample whatever audio it decodes to 16,000 Hz — Whisper’s required input rate.
  2. decodeAudioData handles WAV, MP3, MP4, OGG, and FLAC transparently.
  3. getChannelData(0) extracts the left channel as a Float32Array, which is exactly what the transcription pipeline needs.
  4. The chunk_length_s: 30 option tells Whisper to process audio in 30-second windows with a 5-second overlap between chunks, which prevents words at chunk boundaries from being cut off.
  5. The waveform visualizer uses AnalyserNode from the Web Audio API to read live frequency data and draws it to a canvas during recording.

The Combined App: Multimodal Media Analyzer

With all three pipelines individually working, this section combines them into a single-page application. It accepts either an image or microphone audio, runs the appropriate AI pipelines, and presents the results in a unified dashboard. All three models load in parallel on page open.

Save this as media-analyzer.html and open it on your localhost:

Browser mockup of the media-analyzer.html page

Browser mockup of the media-analyzer.html page (click to enlarge)

What this code does:

  1. All three pipelines start loading simultaneously on page open, downloading in parallel.
  2. The badge system in the header gives the user a clear visual of which models are ready.
  3. markReady increments a counter and enables the record button only when all three are loaded, preventing partial-state errors.
  4. Image and speech modes share the same results grid but show different cards — the image mode shows classification and caption cards, the speech mode shows the transcription card. This keeps the layout consistent without duplicating DOM.
  5. The analyzeImage function runs classification and captioning in parallel, same as the standalone demo, because there is no dependency between the two inferences.

Performance, Limits, and Next Steps

The application is not perfect, and some parts warrant improvement before a production deployment.

1. Realistic Inference Speed on WASM

Running on CPU via WebAssembly is usable but not instant. On a modern laptop (Apple M2 or equivalent Intel):

  • ViT image classification: 200–500ms per image after model load.
  • ViT-GPT2 captioning: 2–5 seconds per image (text generation is iterative — each token requires one forward pass).
  • Whisper tiny: approximately 2–5x real-time on WASM, meaning a 10-second audio clip takes 20–50 seconds to transcribe.

WebGPU reduces this significantly. On Chrome 113+ with a capable GPU, adding device: ‘webgpu’ to any pipeline can cut inference time by 3–5x. Check support before enabling:

2. Web Workers for Production

All three demos run inference on the main thread. For production, move model loading and inference into Web Workers to keep the UI responsive during heavy computation. The pattern is the same for all three pipelines: send the input via postMessage, receive results in the main thread. Transformers.js tensors are not transferable, so convert them to plain arrays before posting.

3. Model Size Trade-offs

If 246 MB for the captioner is too large for your use case, Xenova/vit-base-patch16-224 (88 MB) for classification is substantially faster and lighter. For speech, Xenova/whisper-base.en (145 MB) produces noticeably better transcription than whisper-tiny.en on accented speech and technical vocabulary — worth the extra 67 MB if accuracy matters.

Only models that have ONNX-compatible weights work with Transformers.js. Many popular architectures like DistilBERT, Whisper, and T5 already have ONNX versions on the Hugging Face Hub. To check which models are available, filter by the transformers.js library tag on the Hub.

Wrapping Up

Multimodal AI in the browser is not experimental. It runs today, on hardware your users already have, using an API that fits in four lines of JavaScript per task. Image classification, captioning, and speech transcription cover a meaningful range of real applications: accessibility tools that describe images for screen readers, voice-driven interfaces that work offline, content moderation pre-screening that never sends data to a server, and media analysis dashboards that run entirely client-side.

Every demo in this article is a single HTML file. Start the local server, open any of them, wait for the first-run download, and they work. Extend the corpus, swap the models, and add the Web Worker pattern when you are ready for production. The Transformers.js documentation and the examples repository are the best next stops — every task listed there follows the same pipeline() pattern you just built.

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.