Magma: A Foundation Model for Multimodal AI Agents

[arXiv Paper]   [Project Page]   [Github Repo]   [Hugging Face Model]  

This demo is powered by Gradio and uses OmniParserv2 to generate Set-of-Mark prompts.

The demo supports three modes:

  1. Empty text inut: it downgrades to an OmniParser demo.
  2. Text input starting with "Q:": it leads to a visual question answering demo.
  3. Text input for UI navigation: it leads to a UI navigation demo.
0.01 1
0.01 1
640 1920