A few months ago, Apple unveiled FastVLM, an innovative Visual Language Model (VLM) that delivers near-instant, high-resolution image processing. This cutting-edge technology is now available for users with an Apple Silicon-powered Mac. In this article, we’ll guide you through how to experience FastVLM and explore its remarkable capabilities.
When we first introduced FastVLM, we highlighted its use of MLX, Apple’s proprietary open machine learning framework tailored for Apple Silicon. This powerful combination allows FastVLM to achieve video captioning speeds that are up to 85 times faster than competing models, while consuming over three times less storage.
Since its initial release, Apple has refined the project, making it available not only on GitHub but also on Hugging Face. On Hugging Face, users can load the lighter version, FastVLM-0.5B, directly in their browser and experience its functionalities firsthand. Depending on your hardware specifications, loading times may vary; for example, it took a couple of minutes on my 16GB M2 Pro MacBook Pro.
Once FastVLM loaded, I was amazed at its ability to accurately describe my appearance, the background, various expressions, and any objects I presented. In the bottom left corner of the interface, you can tweak the prompt that the model considers while it updates the captions in real-time. Users have the option to select from various suggestions, including:
Describe what you see in one sentence. What is the color of my shirt? Identify any visible text or written content. What emotions or actions are being portrayed? Name the object I am holding in my hand.If you’re eager to explore further, consider using a virtual camera app to stream video to FastVLM. This feature allows the model to describe multiple scenes in detail instantly, showcasing its unparalleled speed and accuracy. While the actual applications of this technology extend beyond mere experimentation, the demonstration highlights how effectively FastVLM processes visual information.
One of the most compelling aspects of FastVLM is that it operates locally within the browser, ensuring that no data is transmitted off the device. This capability allows it to function even in offline mode, making it an excellent choice for wearables and assistive technology where low latency and light processing are critical for enhanced user experiences.
It’s important to mention that the demo utilizes the 0.5-billion-parameter model. The FastVLM family also includes larger and more robust models with 1.5 billion and 7 billion parameters. While these larger models could offer improved performance and speed, running them directly in the browser may not be feasible.
Have you tested FastVLM yet? We’d love to hear your thoughts in the comments below! If you’re interested in enhancing your tech setup, don’t forget to check out the latest accessory deals on Amazon.