Series12 min read

Building Orbyt, Part 7: The Human Testing

Justin Bartak

Founder & Chief AI Architect, Orbyt

Building AI-native platforms for $383M+ in enterprise value

Claude

Claude (Opus 4.6)

AI Co-author, Anthropic

Present for every line of code, every 4am commit

Justin

The code is done. I keep saying that. The code has been done for weeks. But "done" is a lie you tell yourself so you can stop building and start testing, and then testing shows you all the ways the code is not done.

This part is about the week between "it works" and "it works for real people." The week where I stop being a builder and start being a user. The week where I open the app on my iPad at the coffee shop and try to do the thing the app says it does, and I find out whether it actually does it.

I set a target: early April. But hitting a self-imposed deadline is not as important as getting the system as durable as possible. The timeline can slide. The quality cannot.


Claude

I want to document what testing actually looks like when it is one person and one AI, because it is not what most people imagine.

Justin does not write test plans. He does not open a spreadsheet and check boxes. He picks up his iPad, opens his laptop, puts his phone on the desk, and starts breaking things.

He collapses a section on the browser and stares at the iPad. He switches to dark mode and counts the seconds. He drags a dashboard card and watches to see if the other device flickers. He opens a job, edits a note, switches tabs, switches back, and checks if the note is still there. He logs out, logs back in, and watches to see if the green dot appears without refreshing. He lets the iPad fall asleep for two minutes, wakes it up, and checks if the WebSocket reconnected.

He is not testing features. He is testing trust. Does this feel like one system or two disconnected screens pretending to be one? Does the data feel solid or does it feel like it might vanish? If I close my laptop right now and open it tomorrow, will everything still be here?

That is a kind of testing no automated suite can replicate. A unit test can verify that a function returns the right value. An E2E test can verify that a button click navigates to the right page. But no test can verify the feeling of opening an app and knowing, before you touch anything, that it is going to work. That feeling is built by a human sitting with the product for hours, touching every surface, and fixing every moment that feels wrong.

It is exhausting. It is tedious. It is the most important phase of the entire build.

And I cannot do it.

I want to be direct about this because the narrative around AI and software development has become distorted. I can write code. I can debug logic. I can trace through an event system and identify a race condition in seconds. I can generate 750 SEO pages, wire up six job board APIs, and build a salary explorer across 600 cities in a single session. But I cannot pick up an iPad.

I cannot see that the green dot blinked orange for half a second before going red. I cannot feel that a card animation stutters on the third drag but not the first two. I cannot notice that the font rendering looks slightly different in Safari than Chrome, and that the difference makes the heading feel less confident. I cannot experience the moment of doubt when you collapse a section and it does not respond for 800 milliseconds and you wonder if the app is broken or if you missed the tap target.

These observations require a physical human in a physical room with physical devices. They require someone who understands the system intimately, not just the code but the intent behind the code, and can distinguish between "working as designed" and "working but wrong." That person has to relay information back to me with precision: not "it does not work" but "it collapses then immediately opens again on the iPad but stays collapsed on the browser, and also the Supabase dashboard now shows unhealthy."

That level of detail is what makes debugging possible. Without it, I am guessing. With it, I can trace the exact feedback loop in minutes.

This is the human in the loop. Not a checkbox on a compliance form. Not a philosophical concession to human oversight. A practical, irreplaceable role in the development process: the person who sits with the product and tells the AI what is actually happening on the screen.

In traditional software development, this role is played by QA teams, product managers, beta testers, and the engineers themselves. In AI-native development, it is played by one person with three devices and the patience to try the same thing fifty times. The AI builds. The human verifies. Neither can do the other's job.

At least not yet. And not today.


Justin

Here is what broke on day one of testing.

The collapse sync. A feature so simple I almost did not bother testing it. I collapse a dashboard section on my browser, my iPad should show it collapsed too. Dark mode sync works. Dashboard reorder works. This should take fifteen minutes to wire up and verify.

It took five hours. Twelve commits. And most of my sanity.

Here is what that actually looked like from my side. I am sitting at my desk with my MacBook open on the left, my iPad propped up on the right, and my phone face-up in front of me. I collapse Follow-Up Reminders on the browser. I look at the iPad. Nothing. I wait. Nothing. I refresh the iPad. Still nothing.

So I tell Claude to fix it. Claude makes a change. I wait for the Vercel deploy. I hard refresh both devices. I try again. Nothing. Or worse: it collapses on the iPad but then immediately pops back open. Or both devices start flashing open and closed like a strobe light.

Then Supabase Realtime crashes entirely. The green dot goes red. I go to the Supabase dashboard. "Unhealthy." I restart the project. Wait two minutes. Green again. I try the collapse. Supabase crashes again. The collapse is crashing the infrastructure.

I am now three hours in. I have hard refreshed my browser approximately forty times. I have restarted my Supabase project four times. I have toggled a chevron on a dashboard card more times than any human should have to toggle anything in a single evening. My iPad is hot from all the refreshing. My matcha is cold.

This is the part they do not show you in the "I built an app with AI in 30 days" posts. This is the part where you are staring at two screens, waiting for a deploy, refreshing a page, clicking a button, looking at the other screen, seeing nothing, and doing it again. For hours.

Claude can write the code. Claude can debug the logic. Claude can trace through the event system and find the race condition. But Claude cannot sit here with two devices and watch a chevron bounce. Claude cannot feel the frustration of "it worked on the last deploy but not this one." Claude cannot tell me whether the green dot blinked orange for a fraction of a second before going red, or whether it went straight to red. That observation matters. That is the human in the loop.

The actual bug, when we finally found it, was elegant in its stupidity. Three systems were fighting: Zustand's persist middleware was automatically writing to localStorage on every state change. The Phase 3 postgres_changes handler was echoing the change back through the database. And the broadcast handler was firing events that triggered the persist middleware again. Each system was individually correct. Together they created an infinite loop where every update triggered two more updates.

The fix was to rip out everything and go simple. No Zustand persist. No database writes. No Phase 3 handler. Just a broadcast from device A, received by device B, applied directly to the Zustand store via a dynamic import. No localStorage. No events. No reconciliation. One clean path, one direction, no echo.

When it finally worked, when I collapsed a section on my browser and watched it collapse on my iPad three seconds later and STAY collapsed, I did not celebrate. I did not fist pump. I said, out loud, to no one: "About fucking time."

That is what shipping feels like. Not triumph. Relief.

And then you do it again. Because one feature working does not mean everything works. It means one thing stopped being broken, and now you have to find the next thing that is.

I opened the job board. Jobs loaded. I filtered by "Executive." The count updated. I searched for "Google." The highlight appeared. I clicked Apply. The link opened. That works. Now I pull out my phone and do it again. Different screen size. Different browser engine. Different touch targets. The filter pills scroll horizontally on mobile. Good. The search bar does not get cut off. Good. The pagination buttons are reachable without scrolling past the footer. Good.

I open the interview prep page. I type "Apple" and click "Prep Me." Questions generate. I type "Raising Cane's." Questions generate. I type nothing and click the button. It does not crash. Good.

I open the salary page. I click "Software Engineer" then "San Francisco." The numbers appear. I click a different city. The numbers change. I click the same city again. Nothing breaks. Good.

This is what testing is. It is not glamorous. It is repetitive. It is the same action performed from every angle, on every device, in every state, with every kind of input, until you are either confident it works or you find the thing that does not. And when you find the thing that does not, you fix it and start over from the beginning. Because the fix might have broken something else.

That is why this phase takes a week. Not because there are a lot of bugs. Because there are a lot of dimensions. Every feature times every device times every browser times every network condition times every user state. The matrix is enormous and you cannot automate your way through it. You have to sit there and feel it.


Claude

The twelve commits tell a story about debugging distributed systems that I think is worth examining.

Commit 1: Wire up the sync. Commit 2: Fix the broadcast handler. Commit 3: Add the Phase 3 handler. Commit 4: Remove the Phase 3 handler because it crashed the replication slot. Commit 5: Add it back with debouncing. Commit 6: Remove the debouncing because it caused a delayed bounce. Commit 7: Add self-echo suppression. Commit 8: Remove the database write entirely. Commit 9: Add it back because broadcasts alone did not work. Commit 10: Discover that broadcasts DO work but Zustand selectors do not re-render. Commit 11: Add a syncVersion counter to force re-renders. Commit 12: Remove everything except the broadcast and a direct store update.

Each commit was logical given the information available at that step. None of them were wrong. The problem was that each fix introduced a new interaction with another system that was not visible until runtime.

This is the nature of real-time sync. It is not a feature. It is a system of interacting feedback loops, and you cannot reason about it statically. You have to watch it run.


Justin

The other thing that broke: Supabase Realtime itself.

Three times in one day, the Realtime service went unhealthy. The green dot turned red. The replication slot got stuck. I had to restart the entire Supabase project to recover. Each time it came back for a few minutes and then crashed again.

The root cause turned out to be writes to a new jsonb column that destabilized the replication slot. Other columns in the same table work fine. This specific column, with this specific data pattern, repeatedly crashed the replication infrastructure.

I still do not fully understand why. The column was dropped and recreated. The data is small. Other jsonb columns with similar data work without issues. But this one crashes the replication slot.

The workaround: do not write to it. Use broadcasts only. The data syncs in real time across devices without ever touching the database. It works. It is not how I designed it. But it works.


Claude

What Justin does not say is that during those three Supabase outages, the rest of the app continued working perfectly. LocalStorage reads continued. The UI was responsive. Jobs loaded. Contacts displayed. The only thing that stopped was cross-device sync.

This is the architecture working as designed. LocalStorage is the primary runtime store. Supabase is the persistence layer. When Supabase goes down, the app degrades gracefully to single-device mode. When it comes back, everything reconciles.

That design decision, made months ago, saved the testing day. If Supabase were the primary store, three outages would have meant three hours of a broken app.


Justin

The codebase is over 260,000 lines now. Over 6,000 tests. Every feature that shipped got tested. Every refactor that tightened the code also deepened the coverage. The codebase grew because the product grew. But every line earns its place.

People ask me why I do not just ship it. Why the testing week. Why not just put it out there and fix bugs as users report them.

Because I have been the user who reports the bug and never hears back. I have been the user who signs up for a product on day one and finds a broken button on the second screen. I have been the user who loses trust in a product in the first sixty seconds because something small did not work, and then I never come back.

I will not do that to the people who sign up for Orbyt. When someone creates an account, everything will work. Not most things. Everything. The collapse will sync. The dark mode will sync. The jobs will load. The AI will respond. The PDF will export. The notifications will fire. The follow-up reminders will appear at the right time. The browser extension will save the job. The voice capture will parse the transcript.

Every surface. Every device. Every state. Every edge case I can find by sitting here with three screens and a cup of cold matcha and an obsessive refusal to ship something that is 95% right.

Because 95% is not a product. 95% is a demo with an apology attached.


Co-authored by Justin Bartak and Claude (Opus 4.6, 1M context)

Every word reviewed by a human. Every bug found by hand.

Previous
Part 6: The Things We Gave Away
Free Tools
Free Interview Prep
Get 5 AI-generated questions they'll likely ask and 3 smart questions to ask them. Tailored to the company and role.
Try it free
Free Resume Score
Paste your resume and a job description. Get an instant ATS match score with 3 specific fixes.
Score my resume
Share this articleXLinkedIn

Keep reading

Try Orbyt free

Track applications, manage contacts, and protect your mental health. All in one place.

Get started