How We Generate Art

Building consistent AI-generated assets at scale with Gemini

596+ game assets, all visually coherent, fully AI-generated. This page explains the system that makes it possible.

The Core Problem

Without explicit guidance, AI image models produce wildly inconsistent styles. Here's what happens when you just ask for "cyberpunk character portrait":

Without References
Photorealistic
Anime
Oil painting
3D render

Same prompt, four different styles. Unusable for a game.

With Reference System
Character 1 Character 2 Character 3 Character 4

Same style, different characters. Game-ready.

💡
The Insight: Treat AI image generation as a style transfer problem, not a pure generation problem. Every image we generate receives explicit visual references that define what "correct" looks like.

The Reference Frame Budget

Gemini accepts up to 14 reference images per generation request. This isn't a design choice—it's an API constraint that forces strategic allocation.

Why 14? This is a hard limit in the Gemini image generation API. More references = better consistency but less creative freedom for the model. We treat it as a budget to allocate wisely.
14 REFERENCE SLOTS Typical allocation for a story scene
STYLE Gold standard
SCENE Scene type
CHAR 1 Portrait
CHAR 1 Face
CHAR 1 Body
CHAR 1 Action
CHAR 2 Portrait
CHAR 2 Face
CHAR 2 Body
... 5 left

A two-character scene with full references uses 9 of 14 slots. Single-character scenes have more headroom for additional style or environmental references.

Three-Layer Reference System

1

Gold Standard (Style)

Gold standard reference - age1_b_rooftop from style calibration age1_b_rooftop.png
#0f0f23
#1a1a2e
#00FFFF
#FF0080
#f59e0b

Always included. This specific image was selected through our style calibration process—a 120-image interview that determined exactly which visual characteristics define "CyberIdle style."

Without this anchor, Gemini drifts toward whatever style it "feels like" generating. With it, every image inherits the same DNA: NES-era pixel art aesthetic, limited color palette, clean linework.

"Use this reference image as a style guide. Match its pixel art style, color palette (magenta #FF0080, cyan #00FFFF, dark teal #1a1a2e), pixel density, and overall aesthetic exactly."
2

Scene-Type References (Composition)

Different scene types need different composition rules. A cityscape wide shot has completely different framing than a character portrait or an action sequence. We calibrated 12 scene categories through 120-image interviews.

Cityscape reference
Cityscape Wide establishing shots
Alley reference
Alley Street-level perspective
Terminal reference
Terminal Interior tech spaces
Action reference
Action Dynamic sequences
Portrait mood reference
Portrait Mood Character emotion focus
Cyberspace reference
Cyberspace Digital environments
Interior reference
Interior Indoor spaces
Rooftop reference
Rooftop Elevated views
Tech detail reference
Tech Detail Close-up technology
Dramatic reference
Dramatic High-contrast lighting
Rain reference
Rain Weather effects
Warmth reference
Warmth Amber-lit scenes

Each reference was selected from 4 interview options based on composition quality, style consistency, and how well it guides generation for that scene category.

3

Character References (Identity)

Major characters get a 4-reference set. Here's GG's complete reference allocation:

GG's 4-Reference Set
GG primary portrait
PORTRAIT Primary reference
GG face detail
FACE Close-up details
GG full body
BODY Full figure
GG action pose
ACTION Dynamic pose
Major Characters
4 refs
Portrait + Face + Body + Action
Cyber Chomp - major character
Supporting
3 refs
Portrait + Face + Body
Cipher - supporting character
Minor
2 refs
Portrait + Face
Ghost - minor character
Background
1 ref
Primary portrait only
Background character

Real Generation Examples

Here's exactly what goes into generating different types of images—the input scene, all reference images, the complete prompt, and the output.

1

Age 1 Building: Street Hacker Setup

Building sprite • No character refs needed
Scene Description

"A cramped street-level hacker den with exposed wiring, multiple flickering monitors, and makeshift tech equipment. Age 1 starter building vibe."

References Used (2 of 14 slots)
Gold standard Gold Standard
Terminal scene type Terminal Type
Complete Prompt to Gemini
Use this reference image as a style guide. Match its pixel art style, color palette (magenta #FF0080, cyan #00FFFF, dark teal #1a1a2e), pixel density, and overall aesthetic exactly.

SCENE: A cramped street-level hacker den with exposed wiring, multiple flickering monitors, and makeshift tech equipment. Age 1 starter building vibe.

FRAMING: Square composition for building sprite. Clean edges, no character subjects. Focus on environmental detail.

STYLE CONSTRAINTS: NES-era pixel art. No photorealism. No gradients. Clean pixel edges. Dark background for transparency cropping.
Generated Output
Generated street hacker building
Resolution: 2K Aspect: 1:1 Cost: ~$0.04
2

Story Hero: The Sick Day

Story banner • Atmospheric scene with silhouette
Scene Description

"The Sprawl during a brutal storm - lightning cracks across the skyline while GG's silhouette watches from a window. Cinematic ultra-wide composition."

References Used (4 of 14 slots)
Gold standard Gold Standard
Rain scene type Rain Type
Cityscape scene type Cityscape Type
Dramatic lighting Dramatic Type
Generated Output
The Sprawl during a brutal storm with GG's silhouette
Resolution: 2K Aspect: 21:9 Cost: ~$0.04

Queue-Based Generation System

At scale, we don't generate images one at a time. Jobs are defined in JSON queues and processed in batches—either sequentially for safety or in parallel for speed.

📋

1. Queue Definition

Jobs defined in JSON with prompt, output path, references, aspect ratio, and resolution.

⚙️

2. Generation Mode

Sequential (~25s/image, safe) or parallel with 4 workers (~6-7s/image, 4x speedup).

3. Checkpoint & Upload

Status tracked per job. Git commit every 3 images. Automatic CDN upload.

Queue job format (generation-queue.json)
{
  "jobs": [
    {
      "id": "age1_building_01",
      "prompt": "Cramped street-level hacker den...",
      "output": "docs-site/public/images/buildings/age1_hacker.png",
      "refs": ["gold_standard", "terminal_scene"],
      "aspect": "1:1",
      "resolution": "2K",
      "status": "pending"
    }
  ]
}

Sequential Mode

./generate-from-queue.sh --all
  • ~25 seconds per image
  • Safe for long batches
  • Checkpoint commits every 3
  • Best for overnight runs

Parallel Mode

./generate-parallel.sh --workers 4
  • ~6-7 seconds per image
  • 4x throughput speedup
  • Respects API rate limits
  • Best for batch sprints

Art Interview Philosophy

Why do we use multi-round interviews to calibrate art style instead of just writing detailed prompts? Because visual art direction exists in a space that language can't fully capture.

"Humans cannot successfully communicate visual art guidance through language alone. Art direction exists in a high-dimensional space that gets lossy-compressed when projected into the low-dimensional space of English tokens."

The Problem with Text-Only Prompts

"Make it more cyberpunk"
→ Infinite valid interpretations
"Better composition"
→ Countless framing options
"More dynamic"
→ Motion blur? Pose? Angle?
"Darker mood"
→ Lighting? Palette? Expression?

The Solution: Selection Over Description

Instead of trying to describe what you want, select from options. This is the same approach we use to prompt Gemini (reference images) and to prompt the human art director (interview choices).

1
Generate 4 dramatically different options (A/B/C/D)

Each should explore a different direction. Include at least one you think they'll hate—negative feedback is valuable.

2
Designer selects and explains WHY

Not just "I like B" but "B because the lighting feels more intimate and the pose suggests confidence not aggression."

3
Extract dimensional preferences

Convert subjective feedback into concrete rules: "Prefer warm accent lighting over cool. Neutral poses over aggressive."

4
Update guides and gold standards

Selected images become future references. Rules get added to style guides. The system learns.

🎯
Discovery Example: Cyber Chomp
Through the interview process, we discovered that Cyber Chomp should appear "zoned out while chaos happens around him"—a personality trait that emerged from selection, not description. We never would have written that in a prompt.

Prompt Construction Pipeline

User scene descriptions get transformed through multiple enrichment stages before hitting the API.

User Input
"GG confronts Helena Voss in Nexus lobby"
+ Style Instruction
"Match pixel art style, color palette..."
+ Framing Hint
"Wide cinematic composition with horizontal emphasis..."
+ Character Note
"Match distinctive visual features EXACTLY..."
Final Prompt
~400 tokens + 6 reference images
Actual prompt transformation (from generate-from-queue.sh)
STYLE_INSTRUCTION="Use this reference image as a style guide.
Match its pixel art style, color palette (magenta #FF0080,
cyan #00FFFF, dark teal #1a1a2e), pixel density, and
overall aesthetic exactly."

# Add framing based on aspect ratio
if aspect_ratio in ["21:9", "16:9"]:
    prompt += """
FRAMING: Wide cinematic composition with horizontal emphasis.
Key subjects should have breathing room. Use the full width."""

# Add character consistency instruction
if char_refs:
    prompt += f"""
CHARACTER CONSISTENCY: Match their distinctive visual
features EXACTLY as shown - same face structure, hair,
clothing, augmentations, and distinctive features."""

The Iteration Process

Major characters go through extensive calibration. Here's Cyber Chomp's reference development:

Round 1 Face Variants
A
B
C
D

Selected: Option C - best eye rendering

Round 2 Body Variants
A
B
C
D

Selected: Option B - proportions match face

Round 3 Action Variants
A
B
C
D

Selected: Option A - most dynamic pose

Final Reference Set
Cyber Chomp final reference

4 references locked for all future generations

This interview process was 52 iterations for Cyber Chomp alone. The investment pays off: every future image of this character will be consistent.

Strategies by Art Type

Type Refs Aspect Strategy Example
Character Portrait 0-1 1:1 NES pixel locked, contextual background Portrait example
Story Scene 3-5 16:9 Multi-ref: style + chars + location Story scene example
Hero Image 2-3 21:9 Ultra-wide, MUST center subjects Hero image example
Building Sprite 1 varies Transparent bg, isometric, NO rain Building example
Resource Icon 1 1:1 64px readable, dark teal background Resource icon example

Color Palette & Style Rules

Deep Navy #0f0f23 Backgrounds, void spaces
Dark Teal #1a1a2e Structures, panels, tech
Cyan #00FFFF Tech highlights, data, UI
Magenta #FF0080 Neon accents, danger, energy
Amber #f59e0b Warm light, fire, hope
Neon Green #00ff88 Success states, growth, nature

Hard Rejections (Never Use)

  • Photorealistic rendering or textures
  • Gradients that break pixel aesthetic
  • Pure white backgrounds (#FFFFFF)
  • Pastel or washed-out colors
  • 3D perspective inconsistency
  • Text/watermarks in generated images

Technical Implementation

For practitioners building similar systems:

API Details

Model
gemini-3-pro-image-preview
Cost
~$0.04 per image at 2K
Max refs
14 images per request
Response
Base64-encoded PNG

Resolution Options

1K
~1024px, quick iterations
2K
~2048px, production default
4K
~4096px, hero images only

Aspect Ratios

21:9
Ultra-wide banners
16:9
Story scenes, standard
3:4 / 9:16
Vertical portraits
1:1
Icons, avatars

Queue-Based Workflow

Jobs defined in JSON, processed sequentially with checkpointing:

./generate-from-queue.sh --all --changelog
Request payload structure
{
  "contents": [{
    "parts": [
      {"inlineData": {"mimeType": "image/png", "data": "...base64..."}},
      {"inlineData": {"mimeType": "image/png", "data": "...base64..."}},
      {"text": "Style instruction + scene prompt + framing + character notes"}
    ]
  }],
  "generationConfig": {
    "responseModalities": ["TEXT", "IMAGE"],
    "imageConfig": {
      "aspectRatio": "16:9",
      "imageSize": "2K"
    }
  }
}

Results Gallery

A sample of outputs demonstrating style consistency across different subjects and scenes.

Key Takeaways

1

Always include a gold standard style reference. Without it, you get random styles.

2

Budget your 14 reference slots strategically. More character refs = better consistency.

3

Use selection over description. Art interviews extract preferences that language can't capture.

4

Invest in character calibration upfront. 52 iterations saves thousands of inconsistent outputs later.

Questions about our art pipeline?

Open an Issue