Home

Multi-Modal Prompting

Multi-modal prompting encompasses techniques for working with AI models that process and generate images, video, audio, and combinations of media types. As AI systems become increasingly capable across modalities, understanding how to craft effective prompts for visual and multi-modal tasks is essential.

This guide covers prompting strategies for image generation, image understanding, video creation, and cross-modal applications.


Text-to-Image Fundamentals

Text-to-image models translate natural language descriptions into visual outputs. The quality of your prompt directly determines the quality of the generated image.

The Progressive Detail Approach

Start simple and add detail iteratively to understand how each element affects the output.

Level 1 - Basic Concept:

A lighthouse on a cliff

Level 2 - Add Setting and Mood:

A lighthouse on a rocky cliff at sunset, dramatic clouds in the sky, waves crashing below

Level 3 - Specify Style and Details:

A lighthouse on a rocky cliff at golden hour sunset, dramatic orange and purple clouds, powerful waves crashing against the rocks below, cinematic composition, photorealistic style, volumetric lighting

Level 4 - Technical Refinement:

A solitary white lighthouse perched on dramatic rocky cliffs during golden hour, the setting sun casting long shadows and painting the clouds in vibrant oranges and deep purples, powerful Atlantic waves crashing against weathered rocks creating white spray, shot with a 35mm lens, photorealistic rendering, volumetric god rays, 8K resolution, National Geographic photography style


Prompt Structure for Image Generation

A well-structured image prompt typically includes:

[Subject] + [Setting/Environment] + [Style/Medium] + [Lighting] + [Composition] + [Technical Parameters]

Component Breakdown

Component Examples
Subject A woman, a dragon, a coffee cup, an abstract pattern
Setting In a forest, on Mars, in a cozy café, floating in space
Style/Medium Oil painting, 3D render, watercolor, photorealistic, anime
Lighting Golden hour, neon lights, soft diffused, dramatic chiaroscuro
Composition Close-up portrait, wide establishing shot, bird's eye view
Technical 8K, highly detailed, sharp focus, depth of field

Example Compositions

Portrait:

Close-up portrait of an elderly fisherman with weathered skin and kind eyes, wearing a cable-knit sweater, soft natural window lighting, shallow depth of field, shot on medium format film, intimate and dignified mood

Landscape:

Sweeping vista of terraced rice paddies in Vietnam at sunrise, morning mist rising from the valleys, a farmer in a conical hat walking along a path, lush green tones, travel photography style, golden hour lighting

Product:

Minimalist product photography of a ceramic coffee mug on a marble surface, steam rising from fresh coffee, soft morning light from the left, clean white background, commercial advertising style, sharp focus

Abstract:

Abstract fluid art composition in deep ocean blues and metallic gold, organic flowing shapes suggesting underwater currents, high contrast, luminescent quality, suitable for large canvas print


Negative Prompts

Negative prompts specify what you don't want in the image, helping avoid common issues and unwanted elements.

Common Negative Prompt Elements

Negative prompt: blurry, low quality, distorted, deformed, ugly, duplicate, 
watermark, text, signature, out of frame, cropped, low resolution, artifacts, 
noise, oversaturated, underexposed

Use Case-Specific Negatives

For Portraits:

Negative: deformed face, extra limbs, mutated hands, poorly drawn hands, 
poorly drawn face, long neck, bad anatomy, bad proportions, cross-eyed

For Architecture:

Negative: impossible geometry, floating elements, inconsistent perspective, 
warped lines, physically impossible structure

For Product Photography:

Negative: cluttered background, harsh shadows, overexposed highlights, 
visible imperfections, distracting elements, uneven lighting

Model-Specific Syntax

Different platforms have unique syntax for controlling generation parameters.

Midjourney Parameters

/imagine prompt: A cyberpunk street market in Tokyo --ar 16:9 --v 6 --style raw --q 2

Parameters:
--ar 16:9       Aspect ratio (width:height)
--v 6           Model version
--style raw     Less stylized, more literal interpretation
--q 2           Quality level (higher = more detail, slower)
--s 250         Stylization amount (0-1000)
--c 25          Chaos/variation (0-100)
--no trees      Exclude specific elements
--seed 12345    Reproducible results

DALL-E 3 Best Practices

DALL-E 3 works best with natural language descriptions rather than keyword lists.

Less effective: "cat, orange, sitting, window, sunlight, photorealistic, 8K"

More effective: "A fluffy orange tabby cat sitting contentedly on a windowsill, 
bathed in warm afternoon sunlight streaming through the glass. The cat's fur 
glows golden in the light. Photorealistic style with fine detail."

Stable Diffusion Weighting

Syntax: (element:weight) where 1.0 is default

Example: A (beautiful sunset:1.3) over the ocean, with (dramatic clouds:1.2) 
and a (small sailboat:0.8) in the distance

Higher weights (>1.0) = more emphasis
Lower weights (<1.0) = less emphasis
Double parentheses = stronger effect: ((very important element))

Image-to-Text Prompting

Vision-language models can analyze, describe, and reason about images. Effective prompting improves accuracy and relevance of the analysis.

Basic Image Analysis

Analyze this image and provide:
1. A detailed description of what you see
2. The apparent setting and context
3. Notable objects and their spatial relationships
4. The mood or atmosphere conveyed
5. Any text visible in the image

Structured Image Extraction

Extract information from this receipt image:

<output_format>
{
  "store_name": "",
  "date": "",
  "items": [
    {"name": "", "quantity": 0, "price": 0.00}
  ],
  "subtotal": 0.00,
  "tax": 0.00,
  "total": 0.00,
  "payment_method": ""
}
</output_format>

If any field is unclear or not visible, use null.

Comparative Image Analysis

Compare these two images and identify:
1. Similarities in composition, color, or subject matter
2. Key differences between them
3. Which image is more effective for [specific purpose] and why
4. Technical quality comparison (lighting, focus, exposure)

Visual Reasoning

Look at this image of a room and answer:
1. What time of day does this appear to be? What visual cues indicate this?
2. What season might it be? Why?
3. What can you infer about the person who lives here?
4. What is happening or has recently happened in this scene?

Image Editing Prompts

Modern models support various image editing operations through natural language instructions.

Inpainting (Editing Regions)

Edit the selected region of this image:
- Replace the cloudy sky with a vibrant sunset
- Keep the foreground buildings exactly as they are
- Ensure the lighting on the buildings is consistent with sunset lighting
- Blend the edges naturally

Style Transfer

Transform this photograph into the style of [artist/movement]:
- Maintain the original composition and subject matter
- Apply the characteristic brushwork/technique of [style]
- Keep colors within the typical palette of [style]
- Preserve recognizable features of the original subject

Outpainting (Extending Images)

Extend this image to the [left/right/top/bottom]:
- Continue the natural environment seamlessly
- Maintain consistent lighting and perspective
- Add contextually appropriate elements
- Ensure the extension feels like a natural part of the original scene

Video Generation Prompts

Text-to-video models require additional considerations for temporal consistency and motion.

Video Prompt Structure

[Scene description] + [Motion/Action] + [Camera movement] + [Duration/Pacing] + [Style]

Example Video Prompts

Establishing Shot:

A slow cinematic drone shot ascending over a misty forest at dawn, revealing a mountain range in the distance as the camera rises above the treeline, 10-second duration, smooth movement, nature documentary style

Character Action:

A chef in a professional kitchen carefully plates a gourmet dish, hands moving with precision, close-up on the hands transitioning to medium shot revealing the focused expression, warm kitchen lighting, 8 seconds, slight camera drift

Abstract/Motion Graphics:

Flowing liquid metal morphing between geometric shapes, reflecting rainbow iridescent colors, seamless looping animation, 6 seconds, smooth continuous transformation, dark background

Video-Specific Considerations

Motion guidance:
- Specify speed: "slow-motion," "real-time," "time-lapse"
- Describe transitions: "dissolve to," "cut to," "smooth transition"
- Camera movement: "static," "pan left," "zoom in," "tracking shot"
- Subject motion: "walking toward camera," "rotating slowly," "bouncing"

Temporal consistency:
- Maintain character appearance throughout
- Keep lighting consistent across frames
- Specify if environment should change or remain static

Multi-Modal Chains

Combining image and text capabilities in sequence for complex workflows.

Image → Analysis → Generation

Step 1: Analyze this photograph of a room interior
Step 2: Identify the design style, color palette, and key furniture
Step 3: Generate a prompt for a similar room with [specified modifications]
Step 4: Generate the new image

Text → Image → Critique → Refine

Step 1: Generate image from initial prompt
Step 2: Analyze the generated image for issues
Step 3: Create improved prompt addressing issues
Step 4: Regenerate with refined prompt
Step 5: Compare and select best result

Document → Visual Summary

Step 1: Read and analyze this research paper
Step 2: Identify 4-5 key concepts that would benefit from visualization
Step 3: Create image prompts for each concept
Step 4: Generate illustrations
Step 5: Assemble into visual summary/infographic layout

Prompt Templates by Use Case

Product Photography

Professional product photography of [product], centered on [surface material] 
surface, [lighting type] lighting from [direction], [background description], 
commercial advertising style, sharp focus on product, [mood/feeling], 
suitable for [platform: e-commerce/print/social media]

Portrait/Character

[Shot type: close-up/medium/full body] portrait of [subject description], 
[age/expression/pose], wearing [clothing/accessories], [setting/background], 
[lighting type and direction], [art style or photography style], 
[mood/atmosphere], [additional technical details]

Environment/Landscape

[View type: panoramic/establishing/intimate] of [location type], 
[time of day] with [weather/atmospheric conditions], [key elements in 
foreground/midground/background], [color palette], [style: photorealistic/
painterly/stylized], evoking [mood/feeling], [compositional notes]

Concept Art

Concept art of [subject] for [context: game/film/book], [design style], 
[key visual elements and details], [color scheme], [reference influences], 
[functional/narrative purpose], professional quality, suitable for 
production reference

Quality Control Checklist

Before finalizing your prompt:


Quick Reference: Image Generation Models

Model Strengths Best For
DALL-E 3 Natural language understanding, safety General purpose, beginners
Midjourney v6 Artistic quality, aesthetics Art, design, creative work
Stable Diffusion 3 Control, customization, open-source Technical users, fine-tuning
Ideogram Text rendering, typography Logos, graphics with text
Flux Photorealism, speed Realistic images, iteration

Notes

Feedback and suggestions are welcome!

Multi-modal AI is evolving rapidly. These techniques provide a foundation, but experimentation with specific models is essential for mastering their unique capabilities.

Explore further: