Personal project

Gestro Hand-Tracking AI Paint Studio

Full-Stack Developer · Apr 2026

Next.js, TypeScript, MediaPipe, Firebase, Cloudflare R2, fal.ai, Stripe

Screenshot of the gestro studio in the browser. The left rail holds the brushes (pencil, oil, marker, acrylic, airbrush, eraser), a brush-size slider, and an HSB color picker. A flower drawn freehand sits on the canvas in the middle, with a yellow center, orange and pink petals, and a brown soil base. The right rail shows a Snapshots panel with one locked entry. The footer shows the Generate button with a credit balance. — Figure 1 | The studio mid-draw. The left rail holds six brushes plus an HSB color picker; the canvas in the middle is what the user is painting with their fingertips through the webcam; the right rail tracks the snapshots produced each time the AI refines the canvas. Pressing Generate sends the current canvas to an image model (Flux by default, with HD and reimagine variants) and inserts the result back into the snapshots panel.

gestro is a browser-based paint studio where the user draws with their hands in front of a webcam and an AI model refines the drawing on demand, repeatedly, in a loop. Hold a hand up to the camera, point a finger, and a line appears on the canvas where the fingertip moves. Pick a brush, draw, press Generate to send the canvas to an image model that polishes the picture without throwing away what was drawn, paint on top, generate again. The loop ends only when the user decides the picture is done.

I built gestro solo at HackUP 2026 (University of Portland), where it placed 1st in the Innovator Track and 3rd Best Project Overall. Devpost Every line, every cloud account, and every dependency choice is mine. The project runs in the browser with no install: no app store, no GPU setup, no driver to download.

The Infinite Inspiration Loop

Almost every AI art tool removes the user from the creative process. The user types words, the model invents the picture, and the result is something the user did not actually draw. gestro flips that. The user is the artist, and the AI is an assistant that polishes the user's drawing. Single-shot generators get boring fast because the loop ends after one round, so gestro is built around continuous iteration: draw, generate, paint over, generate again, until the picture is done.

That product framing drove every technical choice in the rest of the system. Hands beat tablets and styluses as the input modality because people learn to finger-paint before they learn to hold a pencil, and a webcam is the most natural input device a browser already has. Server-side credit transactions guard each AI call so a double-click cannot accidentally double-charge a user. Per-user Firestore rules plus server-only credit fields keep the balance math out of the browser entirely. None of those pieces is interesting on its own. What matters is that all of them are required for the loop to feel safe and fast.

The hand-tracking pipeline

The browser opens the webcam at 30 frames per second and runs MediaPipe (Google's in-browser machine-learning library) on the CPU directly on those frames, producing 21 landmarks per hand on every frame without sending any video off the device. Raw landmarks jitter, so a small tracking pipeline sits between MediaPipe and the canvas: an exponential moving-average smoother de-jitters the raw point stream, a velocity tracker measures how fast each fingertip is moving, a gesture debouncer turns short flickers between gestures into stable states, and a proximity detector catches the case where two fingertips visually overlap.

A small router (handInputBridge.ts) collects the resulting fingertip positions and forwards up to ten active fingertips to the canvas engine: both hands, up to five fingers per hand, configurable per session. A closed fist on either hand pauses that hand without disabling the other one, so the user can rest one hand while keeping the other on the canvas. Webcam frames never leave the browser; only the rendered canvas ever travels to the server, and only when the user explicitly presses Generate.

The brush engine

The drawing surface is a Canvas2D engine I named MastercraftEngine. The choice of Canvas2D over WebGL is deliberate. The math the engine runs is CPU-bound (smoothing, spline math, material lookups), the per-pixel work is simple, and Canvas2D's API is faster to drive for this workload than the equivalent WebGL setup would be. Strokes between fingertip samples are drawn as Catmull-Rom splines so the curve passes exactly through every sampled point, which keeps lines looking natural rather than poly-lined when a finger moves quickly.

The engine ships with six brush materials (pencil, oil, marker, acrylic, airbrush, eraser). A parallel material map (a second canvas keyed by stroke material) tracks which kind of paint is at each pixel, which is what lets the eraser respect what is actually painted underneath instead of erasing pixels indiscriminately. Velocity and pressure are modeled into stroke width so a fast stroke is thinner than a slow one. A 30-level undo and redo stack covers the common case of "I did not mean that one".

The AI refinement loop

The same studio screenshot after pressing Generate: the rough freehand flower has been replaced on the canvas by a more polished oil-painting version of the same composition (still a daisy with orange and pink petals, yellow center, brown soil base, but now textured and with depth). The Snapshots panel on the right now shows two entries: an AI snapshot and a Sketch snapshot, both timestamped. — Figure 2 | The same canvas after one round of AI refinement. The polished image preserves the user's composition (same flower, same colors, same arrangement) and only refines the rendering. The Snapshots panel keeps both the original sketch and the refined output, so the user can step back to either at any time.

When the user presses Generate, the browser exports the canvas as a PNG and POSTs it to a Next.js API route. The route runs four things in sequence and bails out cleanly if any of them fail. First, it verifies the user's Firebase ID token. Second, it opens a Firestore transaction that reads the user's credit balance, checks the balance is at least the cost of this generation, debits the credits, and writes a generation-log document, all atomically. Third, it uploads the source canvas to Cloudflare R2 (S3-compatible object storage with zero egress, which matters when a session can produce dozens of images) through a server-signed PUT. Fourth, it dispatches to fal.ai (a hosted image-model service) with a model chosen by the requested mode: Flux for the default Refine, Flux Kontext for HD Refine, and Nano Banana 2 for Reimagine.

The result is uploaded to R2 and a snapshot record (user, source URL, result URL, mode, timestamp) is written back to Firestore. The browser pulls the result image and composites it onto the canvas, where the user can paint on top and generate again. The pre-debit ordering is the part that matters: a double-click cannot double-charge because the credits are taken before the model is ever called, and a model error surfaces back to the client with the credits already reserved, refundable through the same transaction path.

Server-side correctness for billing

Credits are sold in packs through Stripe Checkout (a credit-pack model rather than a subscription, so a user can buy 50 credits and keep them). Stripe webhooks deliver the purchase confirmation, and the webhook handler is idempotent by design. Every event id is recorded in a Firestore stripeEvents collection inside the same transaction that grants the credits, and the transaction reads the collection first to no-op duplicate deliveries. A retried webhook on a transient failure cannot grant the credits twice. All credit math runs on the server through firebase-admin Firestore transactions; the browser cannot fake a balance because the relevant fields are server-only by Firestore rule.

The same correctness pattern (server-side transactions, an idempotent ledger, server-only fields) is the spine of the project. The hand-tracking pipeline and the brush engine are interesting on their own, but the reason the loop feels safe to use is that the billing layer underneath does not double-charge and does not lose credits.

Design and the case against the AI-startup look

The visual language is deliberately not the neon AI-startup look. The reference points are Procreate, Linear, and Apple's Human Interface Guidelines: a light-first palette, typography-forward layout, and Liquid Glass surfaces only on floating navigation (the top bar and the side panel) rather than stacked glass on glass. Page transitions, panel reveals, and the animated blue glow around the canvas all run through one motion library so the timing feels coherent across the whole app.

A small but important decision: everything is reachable without a webcam. Hand tracking is a superpower for users who have one, not a requirement. A mouse fallback covers the user on a desktop without a camera, the user who has not granted webcam permission yet, or the user who simply wants to fine-tune a stroke without holding their arm up.

Reading list

Devpost HackUP 2026 MediaPipe Hands fal.ai