Narrate your life with Generative AI

A few weeks ago Charlie Holtz (@charliebholtz) showed a pretty cool demo using GPT4’s multimodal abilities to caption and narrate a screenshot of you which was then passed to ElevenLabs to speak out in the voice of David Attenborough with their cloning service.

This demo required you to run the code on your machine using python. I wanted to see if it could be done in the browser since all the generative AI capabilities were done through APIs.

All that was needed was the ability to capture a stream from a users webcam and take a snapshot at required intervals. This is fairly trivial these days in the browser.

Try it out

You can try the demo directly in this post. This embeds a notebook that you can peruse to see how it works. All that’s needed is your HuggingFace & ElevenLabs api keys (both are free with limits).

The tokens are stored in your localstorage of the browser and never get sent anywhere serverside. But don’t trust me go read the code in the notebook!

Example

If you don’t have time to get Hugging Face and ElevenLabs API keys you can check out a recording here of it working on me.

Here's a good demo of it insulting me with a David Attenborough like voice pic.twitter.com/01FjeNysE9
— Ryan Seddon (@ryanseddon) November 29, 2023

How it works

The original demo used OpenAI’s GPT-4 Multi-modal capabilities to describe screenshots. But I wanted to experiment with open source models, breaking down the process into three parts:

Image Captioning: I used nlpconnect/vit-gpt2-image-captioning to generate a simple caption for the screenshot.
Whimsical Text: To infuse some whimsy into the narration, I turned to mistralai/Mistral-7B-Instruct-v0.1 with a David Attenborough prompt.
Audio Integration: To bring it all together, I fed the caption into ElevenLabs for audio synthesis.

Surprisingly, the web technologies I used aren’t cutting-edge. The webcam feed employs getUserMedia, the screenshot leverages canvas, and the audio is a simple audio tag which takes the binary response from ElevenLabs and converts it to a object URL and sets that as the auto tags src.

Some findings

There is, AFAIK (As Far As I Know), no good multimodal open source model that works with the HuggingFace API, though I’m keeping an eye on Open Hermes 2.5 as it’s planning on doing a multimodal version.

In order to have the Mistral-7B-Instruct model not return the prompt in its response and keep the narration snappy we had to pass some parameters that weren’t obvious from the HFInference docs.

hf.textGeneration({
    //…
    parameters: { return_full_text: false, max_new_tokens: 100 }
});

This just caps the response at 100 tokens and instructs the model to drop the prompt.

Good captioning

For a while I was getting wild and very bad captions from the vit-gpt2-image-captioning model but it turns out my screenshots were not quite in the right aspect ratio from the video feed, I also reduced the size to a quarter of the video feed and the model did a much better job after that.

Open source generative AI is moving fast

The generative AI space is moving at an incredibly fast pace, though I lived through being frontend engineer for the last 20 years so I know how to handle break neck change!

I think it’s only a matter of time before we get a good multimodal LLM model running via the free Hugging Face API and even a good voice generation model too.

As an aside to the generative AI space I posted this for fun the other day, didn’t think this meme would be a legitimate series of steps even 1 year ago!

Well that's it folks, this meme can now be retired.

I can now draw a great owl by following these 2 simple steps. 😂 pic.twitter.com/BVOIpS791U
— Ryan Seddon (@ryanseddon) December 4, 2023