Get in touch.

On Vibe Coding

Apr 14, 2025

It’s 2025, which means a good deal of my time is spent talking with people about Artificial Intelligence.

I’ve been working with machine learning and artificial intelligence since fairly early in my career. I was fortunate to be involved in a small university with some cutting-edge research in robotics and manufacturing, where those techniques were experiencing a sort of renaissance (SDSM&T, go Hardrockers).

Even before that, the idea that developers would be replaced overnight by some sort of technology anyone could pick up and run with was a recurring theme. Those types of solutions certainly exist (and have existed), but they were (and continue to be) akin to scissors. Sure, anyone can pick them up and use them, but running with them was never really advisable. Think of any of your “low-code solutions” going back to VB 6, COBOL, or whatever.

All that said, I don’t know of any group of people working quite so hard to eradicate their role in the marketplace as engineers. I’m bullish that engineers will eventually code themselves out of a job, at which point we will all cash out our winnings to start our hobby farms, airsoft arenas, or whatever, so we can watch every dreamer vibe-code their way into a never-ending spiral of SaaS applications.

So I decided to try one

Recently, I upgraded my old desktop to one with 12GB of VRAM on a new-to-me RTX 3060 graphics card. This gave me just enough oomph to start playing with some of the open-source weight models like Deepseek and Llama without the constant drain of OpenAI or Anthropic subscription fees.

I’ve been using GitHub Copilot for several years to great effect. I was an early adopter, and it was rough going at first. I’m certain the algorithm has been improving as the models grow bigger and more refined, but I swear 50% of it all is just learning how to talk to the robot in a way that gets results.

Then, Anthropic decided to drop their open beta of Claude Code, a CLI-based agent that lives outside my IDE entirely. Would I like that? I had to find out.

First Test: A Side Project

For the last few years, I’ve been plugging away at a hobby project. It’s not live yet, but thanks to Claude Code, it is much closer to V1.

The first thing it did was get its bearings and form a little document to encapsulate its discoveries. Most of this document was about how to act like a member of the team based on the project structure and patterns. CLAUDE.md contained very thoughtful notes on every design pattern, architecture pattern, and design goal I had put into the project up to that point. The quality of the document was remarkable – something I would have handed off to another human.

Up to this point, this hobby project of mine got the benefit of being built by a single brain with consistent patterns and practices. Because I’ve been plugging away at it off and on for years, it’s chock-full of documented methods and classes. It’s been periodically refactored with each volley to simplify some things and make new features easier to add. In some ways, it’s turned into the ideal DDD application with near-real-time consistency and clear isolation of composable functional units.

Claude Code picked up on all of it. It gave itself guidance on code coverage goals (90% and above), to isolate data change operations into command structures, use data storage patterns that isolate the simulation state from the business logic, and understood the overall nature of the project and its goals.

One of the pieces of documentation I had maintained over the years is a README file with a checklist of features I wanted to implement before go-live. My next step had to do with some property ownership mechanics that I knew how to do but just didn’t have the gumption to complete on my own for a while. I figured… why not let Claude try?

No more than 20 minutes later, Claude refactored and added new classes to maintain entity ownership. It followed every pattern and practice I had put in place up to that point, kept the already-proven functionality of other features intact, and topped it all off with a thorough test suite to show that everything was working as expected. It even checked the box for me when it was satisfied with its progress.

I kept this going through a whole host of checklist items…

  • DevOps infrastructure
  • Docker compose and containers
  • Finding and fixing test automation edge cases
  • Refactoring
  • Server mechanics
  • GraphQL and OData APIs

The list goes on. At this point, I could hand this off to a friend and tell them to run a single command to install all dependencies for a fully-operational cluster on their own dev machine, not to mention a full suite of unit and integration tests.

All the while, it prompted me for permissions in a way I can only describe as polite. I could give it free reign to modify documents without constant prompting (which I did), but for anything having to do with the command line (cat, grep, rm), it came back to make sure its next move was ok.

I occasionally would help get it back on track if it seemed to be going down an unnecessary rabbit hole, but that was as simple as interrupting its “thought” and asking it to remember its core task. It would snap out of whatever analysis paralysis it was in and proceed with its marching orders dutifully.

Each 30-minute session seemed to cost me about $10-$20 in tokens on the API. By comparison, a developer of comparable skill in India doing the same work would likely be about $30 per hour.

The Second Test: Celery Ranch

At my last gig, my team had a years-long wrestling match with Django’s historical approach to async tasks. The weapon of choice was Celery. On its own, a great toolkit for queued tasks, but it was clearly not built with this company’s particular use case in mind.

The client had about 200 customers, many of which were small businesses. On the other end of the extreme, there were a handful of whales that made up 90% of the actual day-to-day SaaS operational volume.

Celery was used to manage many processes, but for the most part, it was handling large data imports and exports to & from CSV files. These files would usually measure in the millions of rows. This was an embarrassingly parallel task, and Celery was certainly aware of how to chop it up, but the nature of a client uploading a file or exporting a massive report was a first-come-first-served operation. The timing meant a small client could wait days for the queue to clear to get their report. Support tickets abounded.

Celery has no native construct akin to an LRU or token bucket. What we talked about for ages was a Celery-based system that would just let us assign a code to each parallel task for Celery to prioritize based on utilization. In other words, if Massive Client A was uploading a 50 million record file, Little Client B could slide in there real quick to export their 30 transactions for the month without stopping the presses.

I had tried Claude on an existing project with some great successes. Could it start from scratch?

I gave it some advantages. I started out by describing not just the next file I wanted it to build, but by giving it the whole business problem context. I told it what versions of the software I wanted to use, other tools that might make the project more convenient, and also outlined some quality and maintainability standards I wanted to hit. It dutifully added its CLAUDE.md file with all its notes first thing in the new directory. From there, it created a skeleton project with all the core components, then drove forward with an implementation. One functional unit and test at a time, it went after each step.

I did have to correct it periodically on what I wanted the ergonomics of the tool to be. I wanted someone who was running Celery to be able to implement this with minimal refactoring, but using an alternative to the “delay” function that accepted an LRU label.

But once it had that correction jotted down in its notes, it went above and beyond. Not only did it create the simple ergonomic replacement, it added Pythonic enhancements to add custom prioritization algorithms, configuration settings, and more. Like with my side project, I wanted it to pursue as much test coverage as it could, and that only enhanced the implementation process. Claude regularly checked its coverage, behavior, and edge cases, correcting its code and failed assumptions as it went. It even created a Docker-based test suite that simulated a semi-live environment with Redis and Postgres. As if showing off, it also added an examples documentation directory that showed how to use each feature it implemented.

All told, it was a great experience. In an hour, I had solved an issue that plagued my team for years.

I was shocked at how good it could be

The main takeaway was that with the kind of guidance I would give a junior engineer, it was very capable of understanding a goal and how to get there. I really appreciated how it would use the tools I encouraged it to use. If there were tests, it used the tests. If it wrote more code, it wrote more tests to match.

One of the greatest “life hacks” I discovered was telling it to prefer modifying code with lower test coverage opposed to modifying code with high coverage. This led to less code churn and kept the tool on task. In some cases, Claude would discover a flaw in the original implementation, then asked me if it was ok to break the rule once or twice to overcome the original assumptions in the older code.

I was shocked at how terrible it could be

I’ve made a lot of glowing comments so far, and I think those are deserved, but here is where I’m going to drop the “well, actually” all over this technology.

I was very hands-on with it. There were certainly moments where I could let it run for 10-15 minutes without interruption, but my eyes were on the output and changes the whole time. I might have looked a bit like a Matrix technician, watching the green rain float across the screen. The long-term side project I had it running on has some very involved and realistic market economics. While the TO-DO list was very clear in which ways those market mechanics were simplified for the simulation, it would progressively drift away from its original understanding with limits and venture off into much more realistic or complicated mechanics.

It was only as grounded in the two examples above because of the strong encouragement to follow basic engineering best practices. At times, I would manually edit the CLAUDE.md file with some front-and-center advice: make small, incremental changes; use dependency injection where appropriate; add a unit test before you implement the next unit of behavior (TDD). It would, when given such restrictive instructions, make its best effort.

Even then… like a naughty dev… it did occasionally try to take shortcuts that were counter to the goals of the session. If it consistently had issues with a particular set of tests (particularly ones that weren’t testing the code it was working on), it would disable the tests and commit its changes. When I wanted it to simply expand code coverage, it would often get “frustrated” at a failing piece of code and mark it out of scope for the coverage report. In particular, it preferred to ignore the nullability warnings in .NET. In the end, I had to add some very explicit instructions to the markdown files that it was to prefer modifying untested code over tested code, and to never take a shortcut (like disabling a test) before merging to main.

Sometimes it would get into an analysis paralysis loop, seemingly getting stuck on the behavior of a particular test that may not have been necessary to begin with. Other times, it would break some existing piece of code to make writing the next function better. That would lead it to an oscillation of breaking one test to fix another, back and forth in apparent perpetuity. None of these things are inherently unusual in software development – humans do this all the time. But humans also have accountability structures, and it’s clear to me that if you’re going to be using this kind of technology, you need that break-glass rollback and human accountability too.

Generally speaking, it’s about good management

I find that people who are bad managers are also bad LLM/AI users. The worst managers I’ve seen in the wild seem to be very keen on getting results without putting much into the people who are actually responsible for those results. The sink or swim approach always results in unnecessary frustration for both sides of the management relationship because the side giving the orders discounts the ability of the worker to perform any sort of critical thinking about their role. Imagine if all children were taught to swim by throwing them into a pool and seeing which ones drown. Sure, the natural swimmers would come out on top and every adult going forward would be guaranteed to swim… but you’ve now wasted a good horde of kids who could have been great swimmers if given a more thoughtful learning curve.

The opposite of that mindset is viewing your employees as an extension of your own brain. Early in my management career, a good friend and advisor told me that my main goal is to turn my reports into clones of myself. In order to make that work well, your team needs as much context as you do – desired outcomes, yes, but also why the outcomes are desired, how to achieve those outcomes, what is important/not important. In other words, they need the whole context. If you tell it to go after a goal without any more direction, the outcomes will be less than ideal one way or another.

These tools also need good automated companion tools they can cling to. Just like a human, it gets tremendous value from being able to automate its own compliance with linters, formatters, static analysis, and code coverage enforcement. It takes some of the cognitive overhead out of the LLM and puts it into the process. “Here are the standards, and you’re not allowed to merge until those standards are met.” I found Claude in particular to be very keen on using those tools early and often. The same would happen when I enforced this on my human engineering teams. Eventually, the coding discipline naturally converged with standards to avoid getting dinged by the merge tests.

What about the other tools?

GitHub Copilot is emerging as a very mature tool, and I still enjoy using it as a smarter auto-complete. I also started checking out the beta agent mode, and it does what it says on the tin. Frankly, though, I found the CLI-based Claude Code to be much closer to what I imagine an AI agent should be.

I’ve also started playing around with some local models, but this is where I’m really seeing the limitations of running local. Deepseek Coder and the rest do a good job, but not a great job. They require a lot more finagling, and integration with the CLI or IDE is not anywhere close to Claude Code’s ability to synthesize context and execution together.

If the distilled models get better on 12GB of VRAM and someone comes out with a CLI competitor to Claude Code that supports Ollama, I might dive a bit deeper.

What did it cost?

Claude Code runs direct from the API, which means it charges by the token. All told, I used about 500,000,000 tokens in and 3,000,000 tokens out at a rough total price of $500.

When you’re charging by the hour, this may not make sense. If you’re competing on quality and velocity, I think there’s an argument to be made that this is worth it.

Some resources…

Accompanying this post, I released two ways of consuming the final CLAUDE.md file I ended up with after a couple weeks of fiddling…

You can see the current and historical versions of the Celery Ranch project here…

And the agents…

If you’d like me to help your team understand the current AI landscape and how to adapt to it, please reach out on the contact form or LinkedIn.

Get the advice you need now.

Reach out to Teleos