Hey HN, I’ve been working on a side project called Exocor.
It’s a React SDK that lets you control your app with voice, gaze and hand gestures, no mouse or keyboard.
The idea is simple: instead of figuring out where to click, you can just look at something and say what you want.
Example:
look at a row → “open this” | say “navigate to equipment” | say “create a ticket” → it builds it
Some interactions are instant (navigation, selection), while more complex ones use an LLM and take a few seconds.
What makes it different from most “AI agents” is that it runs inside the app, not outside of it.
It has access to:
React state, routing, visible UI
So it doesn’t rely on screenshots or DOM guessing, it actually understands what’s on screen.
I originally started this thinking about environments where mouse/keyboard don’t work well (gloves, operating rooms, field work), but it’s also interesting for internal tools and dashboards.
This is v0.1 still rough in places, but the core flow works.
I see a huge accessibility opportunity for this. Gaze + voice running inside the app (with actual React state access) is way more reliable than screen-reader bolt-ons for hands-free use. Curious if you've thought about other nonverbal inputs, head nods for confirm/cancel, blink patterns, facial expressions since you already have the webcam feed.
Accessibility wasn’t the starting point, but the more I work on this the more it feels like a natural fit.
On nonverbal inputs, I’ve focused on gaze and gestures so far. I’ve thought about things like head nods or blink patterns for simple confirm/cancel, but not explored them deeply yet.
Right now the main challenge is keeping everything reliable without adding too much complexity.
For motor disability accessibility, the architecture advantage is real. Most assistive tech sits outside the app and navigates by DOM tree or pixel position, which is brittle. Since you're inside the React tree, you could expose semantic actions — not just "click the third button" but "open this ticket" — which is what users actually want to do. That beats anything screen readers offer today.
And it's not just permanent disability. Temporary and situational cases are everywhere and constantly overlooked — a parent holding a child, someone with a broken arm, post-surgery recovery. These people aren't going to install a full assistive tech stack for a few weeks or a few minutes. But gaze + voice built into the app they're already using? That's zero-friction.
The real value is combining inputs. Gaze to set context, voice for commands, and simple nonverbal signals (blink, nod) for confirm/cancel. That covers users who have voice but limited mobility and users who have gaze control but inconsistent speech. Most assistive tools force you to pick one input mode. Having all three with shared app context is the differentiator.
Even starting with head nod as a binary yes/no would unlock a lot. Reduces the voice dependency for simple interactions and makes the whole system more resilient when one input channel is unreliable.
Really appreciate this! This is one of the strongest framings I’ve seen of where this could go.
The semantic action point is exactly where I think the architecture wants to evolve: less “infer everything from the DOM,” more explicit app-level capabilities like opening, filtering, confirming, assigning, etc.
And I think you’re right on the temporary / situational accessibility angle too. I didn’t start from accessibility, but the more I build this, the more it feels like a natural fit for those cases because it removes the need to install a separate assistive stack.
Head nod as a simple yes/no is also a very interesting idea. I probably wouldn’t start there before hardening the core loop, but it feels like a strong extension once the underlying interaction model is solid.
Hey HN, I’ve been working on a side project called Exocor.
It’s a React SDK that lets you control your app with voice, gaze and hand gestures, no mouse or keyboard.
The idea is simple: instead of figuring out where to click, you can just look at something and say what you want.
Example: look at a row → “open this” | say “navigate to equipment” | say “create a ticket” → it builds it
Some interactions are instant (navigation, selection), while more complex ones use an LLM and take a few seconds.
What makes it different from most “AI agents” is that it runs inside the app, not outside of it.
It has access to: React state, routing, visible UI
So it doesn’t rely on screenshots or DOM guessing, it actually understands what’s on screen.
I originally started this thinking about environments where mouse/keyboard don’t work well (gloves, operating rooms, field work), but it’s also interesting for internal tools and dashboards.
This is v0.1 still rough in places, but the core flow works.
GitHub: https://github.com/haelo-labs/exocor npm: https://www.npmjs.com/package/exocor
Curious what you think, especially if this feels useful or just like a gimmick.
I see a huge accessibility opportunity for this. Gaze + voice running inside the app (with actual React state access) is way more reliable than screen-reader bolt-ons for hands-free use. Curious if you've thought about other nonverbal inputs, head nods for confirm/cancel, blink patterns, facial expressions since you already have the webcam feed.
That’s a really interesting angle.
Accessibility wasn’t the starting point, but the more I work on this the more it feels like a natural fit.
On nonverbal inputs, I’ve focused on gaze and gestures so far. I’ve thought about things like head nods or blink patterns for simple confirm/cancel, but not explored them deeply yet.
Right now the main challenge is keeping everything reliable without adding too much complexity.
Curious how you’d see this used in practice?
For motor disability accessibility, the architecture advantage is real. Most assistive tech sits outside the app and navigates by DOM tree or pixel position, which is brittle. Since you're inside the React tree, you could expose semantic actions — not just "click the third button" but "open this ticket" — which is what users actually want to do. That beats anything screen readers offer today.
And it's not just permanent disability. Temporary and situational cases are everywhere and constantly overlooked — a parent holding a child, someone with a broken arm, post-surgery recovery. These people aren't going to install a full assistive tech stack for a few weeks or a few minutes. But gaze + voice built into the app they're already using? That's zero-friction.
The real value is combining inputs. Gaze to set context, voice for commands, and simple nonverbal signals (blink, nod) for confirm/cancel. That covers users who have voice but limited mobility and users who have gaze control but inconsistent speech. Most assistive tools force you to pick one input mode. Having all three with shared app context is the differentiator.
Even starting with head nod as a binary yes/no would unlock a lot. Reduces the voice dependency for simple interactions and makes the whole system more resilient when one input channel is unreliable.
Really appreciate this! This is one of the strongest framings I’ve seen of where this could go.
The semantic action point is exactly where I think the architecture wants to evolve: less “infer everything from the DOM,” more explicit app-level capabilities like opening, filtering, confirming, assigning, etc.
And I think you’re right on the temporary / situational accessibility angle too. I didn’t start from accessibility, but the more I build this, the more it feels like a natural fit for those cases because it removes the need to install a separate assistive stack.
Head nod as a simple yes/no is also a very interesting idea. I probably wouldn’t start there before hardening the core loop, but it feels like a strong extension once the underlying interaction model is solid.
nice job , bro
Appreciate it! Curious what you’d try this on?