Personal project
Gestro Hand-Tracking AI Paint Studio
Next.js, TypeScript, MediaPipe, Firebase, Cloudflare R2, fal.ai, Stripe
gestro is a browser-based paint studio where the user draws with their hands in front of a webcam and an AI model refines the drawing on demand, repeatedly, in a loop. Hold a hand up to the camera, point a finger, and a line appears on the canvas where the fingertip moves. Pick a brush, draw, press Generate to send the canvas to an image model that polishes the picture without throwing away what was drawn, paint on top, generate again. The loop ends only when the user decides the picture is done.
I built gestro solo at HackUP 2026 (University of Portland), where it placed 1st in the Innovator Track and 3rd Best Project Overall. Devpost Every line, every cloud account, and every dependency choice is mine. The project runs in the browser with no install: no app store, no GPU setup, no driver to download.
The Infinite Inspiration Loop
Almost every AI art tool removes the user from the creative process. The user types words, the model invents the picture, and the result is something the user did not actually draw. gestro flips that. The user is the artist, and the AI is an assistant that polishes the user's drawing. Single-shot generators get boring fast because the loop ends after one round, so gestro is built around continuous iteration: draw, generate, paint over, generate again, until the picture is done.
That product framing drove every technical choice in the rest of the system. Hands beat tablets and styluses as the input modality because people learn to finger-paint before they learn to hold a pencil, and a webcam is the most natural input device a browser already has. Server-side credit transactions guard each AI call so a double-click cannot accidentally double-charge a user. Per-user Firestore rules plus server-only credit fields keep the balance math out of the browser entirely. None of those pieces is interesting on its own. What matters is that all of them are required for the loop to feel safe and fast.
The hand-tracking pipeline
The browser opens the webcam at 30 frames per second and runs MediaPipe (Google's in-browser machine-learning library) on the CPU directly on those frames, producing 21 landmarks per hand on every frame without sending any video off the device. Raw landmarks jitter, so a small tracking pipeline sits between MediaPipe and the canvas: an exponential moving-average smoother de-jitters the raw point stream, a velocity tracker measures how fast each fingertip is moving, a gesture debouncer turns short flickers between gestures into stable states, and a proximity detector catches the case where two fingertips visually overlap.
A small router (handInputBridge.ts) collects the resulting fingertip positions and forwards up to ten active fingertips to the canvas engine: both hands, up to five fingers per hand, configurable per session. A closed fist on either hand pauses that hand without disabling the other one, so the user can rest one hand while keeping the other on the canvas. Webcam frames never leave the browser; only the rendered canvas ever travels to the server, and only when the user explicitly presses Generate.
The brush engine
The drawing surface is a Canvas2D engine I named MastercraftEngine. The choice of Canvas2D over WebGL is deliberate. The math the engine runs is CPU-bound (smoothing, spline math, material lookups), the per-pixel work is simple, and Canvas2D's API is faster to drive for this workload than the equivalent WebGL setup would be. Strokes between fingertip samples are drawn as Catmull-Rom splines so the curve passes exactly through every sampled point, which keeps lines looking natural rather than poly-lined when a finger moves quickly.
The engine ships with six brush materials (pencil, oil, marker, acrylic, airbrush, eraser). A parallel material map (a second canvas keyed by stroke material) tracks which kind of paint is at each pixel, which is what lets the eraser respect what is actually painted underneath instead of erasing pixels indiscriminately. Velocity and pressure are modeled into stroke width so a fast stroke is thinner than a slow one. A 30-level undo and redo stack covers the common case of "I did not mean that one".
The AI refinement loop
When the user presses Generate, the browser exports the canvas as a PNG and POSTs it to a Next.js API route. The route runs four things in sequence and bails out cleanly if any of them fail. First, it verifies the user's Firebase ID token. Second, it opens a Firestore transaction that reads the user's credit balance, checks the balance is at least the cost of this generation, debits the credits, and writes a generation-log document, all atomically. Third, it uploads the source canvas to Cloudflare R2 (S3-compatible object storage with zero egress, which matters when a session can produce dozens of images) through a server-signed PUT. Fourth, it dispatches to fal.ai (a hosted image-model service) with a model chosen by the requested mode: Flux for the default Refine, Flux Kontext for HD Refine, and Nano Banana 2 for Reimagine.
The result is uploaded to R2 and a snapshot record (user, source URL, result URL, mode, timestamp) is written back to Firestore. The browser pulls the result image and composites it onto the canvas, where the user can paint on top and generate again. The pre-debit ordering is the part that matters: a double-click cannot double-charge because the credits are taken before the model is ever called, and a model error surfaces back to the client with the credits already reserved, refundable through the same transaction path.
Server-side correctness for billing
Credits are sold in packs through Stripe Checkout (a credit-pack model rather than a subscription, so a user can buy 50 credits and keep them). Stripe webhooks deliver the purchase confirmation, and the webhook handler is idempotent by design. Every event id is recorded in a Firestore stripeEvents collection inside the same transaction that grants the credits, and the transaction reads the collection first to no-op duplicate deliveries. A retried webhook on a transient failure cannot grant the credits twice. All credit math runs on the server through firebase-admin Firestore transactions; the browser cannot fake a balance because the relevant fields are server-only by Firestore rule.
The same correctness pattern (server-side transactions, an idempotent ledger, server-only fields) is the spine of the project. The hand-tracking pipeline and the brush engine are interesting on their own, but the reason the loop feels safe to use is that the billing layer underneath does not double-charge and does not lose credits.
Design and the case against the AI-startup look
The visual language is deliberately not the neon AI-startup look. The reference points are Procreate, Linear, and Apple's Human Interface Guidelines: a light-first palette, typography-forward layout, and Liquid Glass surfaces only on floating navigation (the top bar and the side panel) rather than stacked glass on glass. Page transitions, panel reveals, and the animated blue glow around the canvas all run through one motion library so the timing feels coherent across the whole app.
A small but important decision: everything is reachable without a webcam. Hand tracking is a superpower for users who have one, not a requirement. A mouse fallback covers the user on a desktop without a camera, the user who has not granted webcam permission yet, or the user who simply wants to fine-tune a stroke without holding their arm up.