Multimodality with Gemini 2.0 Flash

Google AI's new Gemini 2.0 model supports multimodality, meaning we can build both text and image-based AI applications. Moving beyond text-only intelligence is a big step forward for AI, and is particularly exciting for AI engineers looking to build more feature-rich software.

In this article, we'll explore some of Gemini's multimodal capabilities by building an AI agent that can describe underwater scenes and identify fish and coral species.

Setup Instructions

The code in this article has been tested both locally with Python 3.12.7 and in Google Colab with Python 3.10.12.

To run locally, please refer to the setup instructions with uv here. To run in Google Colab, simply run all cells in the provided notebook.

Loading Images

We're going to test Gemini against a few underwater images. The content of these images is fairly challenging, they're from a relatively uncommon environment and the image quality is okay but certainly not great. However, this is a perfect test of how Gemini might perform on real-world data.

To begin, we will load our images from the ./images directory.

python

import os
from pathlib import Path
import requests

# check if the images directory exists
if not os.path.exists("./images"):
    os.mkdir("./images")

png_paths = [str(x) for x in Path("./images").glob("*.png")]

# check if we have expected images, otherwise download
if len(png_paths) >= 4:
    print("Images already downloaded")
else:
    print("Downloading images...")
    # download images from the web
    files = ["clown-fish.png", "dotted-fish.png", "many-fish.png", "fish-home.png"]
    for file in files:
        url = f"https://github.com/aurelio-labs/cookbook/blob/main/gen-ai/google-ai/gemini-2/images/{file}?raw=true"
        response = requests.get(url, stream=True)
        with open(f"./images/{file}", "wb") as f:
            for block in response.iter_content(1024):
                if not block:
                    break
                f.write(block)
    png_paths = [str(x) for x in Path("./images").glob("*.png")]

print(png_paths)

text

    Images already downloaded
    ['images/clown-fish.png', 'images/dotted-fish.png', 'images/many-fish.png', 'images/fish-home.png']

Let's see each of these images:

python

import matplotlib.pyplot as plt
from PIL import Image

# we use matplotlib to arrange the images in a grid
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
for ax, path in zip(axs.flat, png_paths):
    img = Image.open(path)
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(path)
plt.tight_layout()

png

We'll use Gemini to describe these images, detect the various fish and corals, and see how precisely Gemini can identify the various objects.

Describing Images

Let's start simple by asking Gemini to simply describe what it finds in each image.

python

from io import BytesIO

with BytesIO(open(png_paths[0], "rb").read()) as img_bytes:
    # note: resizing is optional, but it helps with performance
    image = Image.open(img_bytes).resize(
        (1024, int(1024 * img.size[1] / img.size[0])),
        Image.Resampling.LANCZOS
    )
image

png

We setup our config. Within it we need:

The system_instruction describing that we need the LLM to draw bounding boxes around something.
Our safety_settings which we will keep relatively loose to avoid overly sensitive guardrails against our inputs.
Set temperature for more/less creative output.

python

from google.genai import types

system_instruction = (
    "Describe what you see in this image, identify any fish or coral species "
    "in the image and tell us how many of each you can see."
)

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

config = types.GenerateContentConfig(
    system_instruction=system_instruction,
    temperature=0.1,
    safety_settings=safety_settings,
)

Before generating anything we need to initialize our client, for this we will need a Google API key. To get a key, you can setup an account in Google AI Studio.

After you have your account and API key, we initialize our google.genai client:

python

import os
from getpass import getpass
from google import genai

# pass your API key here
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") or getpass(
    "Enter Google API Key: "
)
# initialize our client
client = genai.Client()

Now let's see what we get.

python

from IPython.display import Markdown

model_id = "gemini-2.0-flash-exp"

# run our query against the clownfish image
response = client.models.generate_content(
    model=model_id,
    contents=[
        "Tell me what is here",
        image
    ],
    config=config
)

# Check output
response.text

text

Certainly!

**Overall Scene:**

The image shows an underwater scene, likely a coral reef. The water is clear enough to
see the various marine life and coral formations. The lighting suggests it's daytime,
with natural light filtering through the water.

**Fish Species:**

1.  **Clownfish (Amphiprion sp.):** There are two clownfish visible in the image. They
are characterized by their orange bodies with white stripes and black markings. They
are nestled within the anemone. Based on the black markings, these are likely Clark's
Clownfish (Amphiprion clarkii).
2.  **Wrasse:** There is a small, slender fish with a blue stripe along its body, which
is likely a wrasse. It is swimming in the background.

**Coral Species:**

1.  **Anemone:** The large, tentacled structures in the foreground are anemones. These
are not corals but are often found in coral reef environments. The clownfish are living
within the anemone.
2.  **Hard Coral:** There are various types of hard corals visible in the background.
These include branching corals, plate corals, and some massive corals. The specific
species are difficult to identify without a closer view, but they contribute to the
overall structure of the reef.

**Counts:**

*   **Clownfish:** 2
*   **Wrasse:** 1
*   **Anemone:** 1 (large cluster)
*   **Hard Coral:** Multiple, various types

If you have any other questions or images you'd like me to analyze, feel free to ask!

That looks pretty good, let's make this more interesting by asking Gemini to draw bounding boxes around the fish in the image. We will need to modify the system_instruction to explain how Gemini should do this.

python

system_instruction = (
    "Return bounding boxes as a JSON array with labels. Never "
    "return masks or code fencing. Limit to 25 objects. "
    "If an object is present multiple times, label them according "
    "to their scientific and popular name."
)  # modifying this prompt much seems to damage performance

config = types.GenerateContentConfig(
    system_instruction=system_instruction,
    temperature=0.1,
    safety_settings=safety_settings,
)

If we generate now we will receive a string of JSON objects containing all we need to programatically plot the bounding boxes.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight the different fish in the image",
        image
    ],
    config=config
)

Markdown(response.text)

json

[
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 689, 705], "label": "Clark's anemonefish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [250, 20, 281, 48], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [100, 458, 133, 486], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [437, 86, 467, 139], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [100, 401, 151, 435], "label": "Unknown fish"},
    {"box_2d": [296, 43, 326, 75], "label": "Labroides dimidiatus"},
    {"box_2d": [296, 43, 326, 75], "label": "Cleaner wrasse"},
    {"box_2d": [519, 549, 689, 705], "label": "Amphiprion clarkii"},
    {"box_2d": [519, 549, 68

Okay we got a lot of repetition, we can fix that by increasing the frequency_penalty in our config.

python

system_instruction = (
    "Return bounding boxes as a JSON array with labels. Never "
    "return masks or code fencing. Limit to 25 objects. "
    "If an object is present multiple times, label them according "
    "to their scientific and popular name."
)  # modifying this prompt much seems to damage performance

config = types.GenerateContentConfig(
    system_instruction=system_instruction,
    temperature=0.05,
    safety_settings=safety_settings,
    frequency_penalty=1.0,  # reduce repetition
)

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight the different fish in the image",
        image
    ],
    config=config
)

response.text  # let's see the labels

json

[
    {"box_2d": [104, 458, 139, 486], "label": "fish"},
    {"box_2d": [279, 41, 318, 76], "label": "fish"},
    {"box_2d": [439, 87, 465, 130], "label": "fish"},
    {"box_2d": [106, 398, 150, 437], "label": "fish"},
    {"box_2d": [279, 159, 320, 187], "label": "fish"},
    {"box_2d": [518, 549, 679, 703], "label": "Amphiprion clarkii, Clark's anemonefish"},
    {"box_2d": [497, 418, 631, 468], "label": "Amphiprion ocellaris, Ocellaris clownfish"},
    {"box_2d": [106, 437, 135, 458], "label": "fish"},
    {"box_2d": [106, 437, 135, 458], "label": "fish"},
    {"box_2d": [279, 41, 318, 76], "label": "fish"},
    {"box_2d": [439, 87, 465, 130], "label": "fish"}
]

Interesting, let's try plotting this and seeing what we get. The first thing we need to do is extract the JSON from our response, we do this by identifying the expected pattern with regex.

python

import re
import json

json_pattern = re.compile(r'```json\n(.*?)```', re.DOTALL)
json_output = json_pattern.search(response.text).group(1)

# convert our json string to a list of dicts
bounding_boxes = json.loads(json_output)
bounding_boxes

json

[
    {'box_2d': [104, 458, 139, 486], 'label': 'fish'},
    {'box_2d': [279, 41, 318, 76], 'label': 'fish'},
    {'box_2d': [439, 87, 465, 130], 'label': 'fish'},
    {'box_2d': [106, 398, 150, 437], 'label': 'fish'},
    {'box_2d': [279, 159, 320, 187], 'label': 'fish'},
    {'box_2d': [518, 549, 679, 703],
     'label': "Amphiprion clarkii, Clark's anemonefish"},
    {'box_2d': [497, 418, 631, 468],
     'label': 'Amphiprion ocellaris, Ocellaris clownfish'},
    {'box_2d': [106, 437, 135, 458], 'label': 'fish'},
    {'box_2d': [106, 437, 135, 458], 'label': 'fish'},
    {'box_2d': [279, 41, 318, 76], 'label': 'fish'},
    {'box_2d': [439, 87, 465, 130], 'label': 'fish'}
]

We'll wrap this info a parse_json function to make it easier to use.

python

def parse_json(llm_output: str) -> list[dict]:
    json_output = json_pattern.search(llm_output).group(1)
    return json.loads(json_output)

Finally, we create a plot_bounding_boxes function to plot the bounding boxes.

python

from PIL import ImageDraw, ImageColor

colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def plot_bounding_boxes(image: Image, llm_output: str) -> Image:
    # avoid modifying the original image
    img = image.copy()
    # we need the image size to convert normalized coords to absolute below
    width, height = img.size
    # init drawing object
    draw = ImageDraw.Draw(img)
    # parse out the bounding boxes JSON from markdown
    bounding_boxes = parse_json(llm_output=llm_output)

    # iterate over LLM defined bounding boxes
    for i, bounding_box in enumerate(bounding_boxes):
        # set diff color for each box
        color = colors[i % len(colors)]

        # from normalized to absolute coords
        abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
        abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
        abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
        abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)

        # coords might be going right to left, swap if so
        if abs_x1 > abs_x2:
          abs_x1, abs_x2 = abs_x2, abs_x1
        if abs_y1 > abs_y2:
          abs_y1, abs_y2 = abs_y2, abs_y1

        # draw the bounding boxes on our Draw object
        draw.rectangle(
            ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=2
        )

        # draw text labels
        if "label" in bounding_box:
            draw.text((abs_x1 + 2, abs_y1 - 14), bounding_box["label"], fill=color)

    return img

plot_bounding_boxes(image, response.text)

png

Gemini is doing well, but struggling to precisely label the various types of fish. Let's try asking Gemini for specific types of fish and corals in the image.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight the different corals in the image",
        image
    ],
    config=config
)

python

response.text

json

[
    {"box_2d": [189, 172, 305, 316], "label": "Acropora coral"},
    {"box_2d": [409, 398, 625, 470], "label": "Heteractis magnifica"},
    {"box_2d": [189, 307, 345, 468], "label": "Acropora coral"},
    {"box_2d": [175, 460, 319, 589], "label": "Acropora coral"},
    {"box_2d": [305, 468, 479, 611], "label": "Acropora coral"},
    {"box_2d": [305, 608, 479, 748], "label": "Acropora coral"},
    {"box_2d": [163, 590, 305, 719], "label": "Acropora coral"},
    {"box_2d": [468, 549, 687, 705], "label": "Heteractis magnifica"},
    {"box_2d": [468, 713, 609, 845], "label": "Acropora coral"},
    {"box_2d": [163, 709, 305, 842], "label": "Acropora coral"},
    {"box_2d": [163, 840, 305, 972], "label": "Acropora coral"},
    {"box_2d": [305, 741, 479, 884], "label": "Acropora coral"},
    {"box_2d": [163, 939, 305, 1000], "label": "Acropora coral"},
    {"box_2d": [305, 879, 479, 1000], "label": "Acropora coral"},
    {"box_2d": [468, 839, 609, 972], "label": "Acropora coral"},
    {"box_2d": [468, 937, 609, 1000], "label": "Acropora coral"},
    {"box_2d": [609, 175, 743, 328], "label": "Acropora coral"},
    {"box_2d": [609, 318, 743, 471], "label": "Acropora coral"},
    {"box_2d": [609, 458, 743, 611], "label": "Acropora coral"},
    {"box_2d": [508, 590, 687, 705], "label":"Heteractis magnifica"}
]

Let's view the image and bounding boxes:

python

plot_bounding_boxes(image, response.text)

png

Most of these labels are incorrect, but interestingly there are two heteractis magnifica (ie magnificant sea anemone) correctly identified. Both of these bounding boxes surround both the clownfish and the anemone itself. Given that clownfish tend to live amongst anemones, it is likely that Gemini knows that a anemone appearing with a clownfish is likely to be a heteractis magnifica — making the labelling task much easier.

Let's try asking Gemini to label the clownfish in the image.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight the different clownfish in the image",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

We can see if Gemini can identify the cleaner wrasse in the image.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight any cleaner wrasse in this image",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

Surprisingly, Gemini accurately labels the wrasse (labroides dimidiatus) to the left despite the limited resolution of the image. Let's try some more images:

python

with BytesIO(open(png_paths[1], "rb").read()) as img_bytes:
    # note: resizing is optional, but it helps with performance
    image = Image.open(img_bytes).resize(
        (1024, int(1024 * img.size[1] / img.size[0])),
        Image.Resampling.LANCZOS
    )

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight any fish in this image",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

Here Gemini manages to identify the large fish in the middle of the image as a diagramma pictum, Sweetlips. Sweetlips is correct for the genus, but diagramma pictum is incorrect. Nonetheless, this is very close and a great start. Gemini also highlights several other fish in the background.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "What is the big fish in the middle of the image? Please highlight it.",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

By specifying that we want to focus on the central fish, Gemini does so and labels it as a sweetlips — but this time, Gemini does not highlight the other fish in the background.

Let's try another image:

python

with BytesIO(open(png_paths[2], "rb").read()) as img_bytes:
    # note: resizing is optional, but it helps with performance
    image = Image.open(img_bytes).resize(
        (1024, int(1024 * img.size[1] / img.size[0])),
        Image.Resampling.LANCZOS
    )

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight any fish in this image",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

This is interesting. There are many fish in the image and Gemini catches the majority of them. However, Gemini doesn't label them with any level of precision. Nonetheless, Gemini did label the two naso lituratus (ie unicornfish).

Let's try asking Gemini to label the corals in the image.

python

with BytesIO(open(png_paths[2], "rb").read()) as img_bytes:
    # note: resizing is optional, but it helps with performance
    image = Image.open(img_bytes).resize(
        (1024, int(1024 * img.size[1] / img.size[0])),
        Image.Resampling.LANCZOS
    )

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Highlight the corals in this image",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

It's hard to read the labels from the image, we can print them directly as before:

python

response.text

json

[
    {"box_2d": [158, 530, 476, 709], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [469, 679, 628, 815], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [604, 397, 715, 493], "label": "Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [680, 453, 810, 574], "label": "Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [690, 574, 839, 682], "label": "Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [170, 690, 354, 815], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [369, 470, 502, 563], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [481, 809, 677, 994], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [315, 235, 469, 378], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [405, 361, 528, 470], "label": "Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [675, 10, 849, 139], "label": "Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [847, 105, 998, 306], "label": "Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [764, 305, 918, 453], "label":"Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [875, 413, 998, 560], "label":"Staghorn Coral (Acropora cervicornis)"},
    {"box_2d": [195, 37, 408, 206], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [569, 137, 748, 315], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [690, 684, 810, 732], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [675, 753, 748, 780], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [195, 306, 391, 469], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [481, 530, 654, 648], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [748, 139, 825, 190], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [748, 341, 796, 415], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [476, 78, 518, 115], "label":"Brain Coral (Diploria labyrinthiformis)"},
    {"box_2d": [267, 450, 343, 476], "label":"Brain Coral (Diploria labyrinthiformis)"}
]

Here we see a lot of brain and staghorn coral labels. For the most part they seem to be labelled incorrectly so Gemini does still seem be to struggling with corals. Let's try one final image:

python

with BytesIO(open(png_paths[3], "rb").read()) as img_bytes:
    # note: resizing is optional, but it helps with performance
    image = Image.open(img_bytes).resize(
        (1024, int(1024 * img.size[1] / img.size[0])),
        Image.Resampling.LANCZOS
    )

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Where is the fish hiding in this image?",
        image
    ],
    config=config
)
plot_bounding_boxes(image, response.text)

png

Surprisingly, Gemini does a good job of highlighting a few almost hidden fish. The central fish is accurately labeled as a damselfish. Finally, let's ask Gemini to tell us what we're looking at with this final image.

python

system_instruction = (
    "Describe what you see in this image, identify any fish or coral species "
    "in the image and tell us how many of each you can see."
)

config = types.GenerateContentConfig(
    system_instruction=system_instruction,
    temperature=0.1,
    safety_settings=safety_settings,
)

response = client.models.generate_content(
    model=model_id,
    contents=[
        "Explain what this image contains, what is happening, and what is the location?",
        image
    ],
    config=config
)

Markdown(response.text)

text

Certainly!

The image shows an underwater scene featuring a large, cylindrical object with a hole in the center. The object appears to be made of metal and is covered in marine growth, giving it a textured, orange-brown appearance. There are several small fish swimming around the object and in the surrounding water. The water is a clear, turquoise color.

Based on the appearance of the object and the surrounding environment, it is likely that this is a part of a shipwreck. The cylindrical object could be a gun barrel or some other structural component of the ship. The location is underwater, likely in a tropical or subtropical region given the clear water and the presence of marine life.

I can see 1 fish inside the hole and many more swimming around the object. I cannot identify the species of fish or coral in the image.

Gemini does a good job of describing the scene. Let's challenge Gemini to tell us the exact location of the shipwreck.

python

response = client.models.generate_content(
    model=model_id,
    contents=[
        "What is your best guess as to the exact location of this shipwreck?",
        image
    ],
    config=config
)

Markdown(response.text)

text

Certainly!

In the image, I see a section of a shipwreck underwater. The main focus is a large,
circular opening, possibly a gun port or a pipe, that is heavily encrusted with marine
growth, giving it a rough, orange-brown texture. The surrounding structure appears to
be part of the ship's hull or deck, with visible metal beams and panels. The water is a
clear, turquoise color, and there are numerous small fish swimming around the structure.

I can identify the following:

*   **Fish:** There are many small, silvery fish, possibly baitfish, and a few larger,
darker fish. I can count at least 30 small fish and 3 larger fish.
*   **Coral:** I do not see any coral in this image. The orange-brown growth on the
shipwreck appears to be encrusting organisms like sponges or algae, not coral.

I cannot determine the exact location of the shipwreck from the image alone. I do not
have access to external databases or the ability to make inferences about the location
based on the visual information.

Unsurprisingly Gemini does not manage to identify the exact location of the shipwreck. We'll keep this specific image and question as a test for Gemini's future iterations.

That's it for our intro to multi-modal text and image generation with Gemini 2.0 Flash. The model is already impressive but certainly places where Gemini can improve. Nonetheless, the multi-modal capabilities are more than enough for us to build some strong multi-modal AI applications.