$ cat i-ran-my-first-lora-experiment-and-made-the-ai-stupid.md
I ran my first LoRA experiment and made the AI stupid

My first LoRA-tuned model got 7% json validity where the untuned base model got 27%. On action recall, it went from 2% to 0.25%. Every single metric I checked, the fine-tuned adapter was worse than the raw base model it was built on top of.
And the training loss which is what I had been told to watch had been happily going down the whole time.
Turns out, I fine-tuned a model and made it worse. This was a learning adventure and I'm glad to have the lessons.
Why was I doing this in the first place?
I did not set out to become an accidental AI researcher. However, I have become an accidental AI home lab enthusiast, which is adjacent and cheaper and still not an outdoor activity which people claim is good for you.
My nightly knowledge graph pipeline runs gemma 3 27b on a mac studio. gemma 27b is a hungry model. while it's running, I can't do other fun things on the studio. Such as, no voice pipeline experiments, no rubber-duck agent, no parallel jobs. It parks for hours, extracts meeting notes into memgraph, and then releases the machine.
So I was curious if a smaller model could replace gemma 27b for this one specific job? if a 7b or even a 1.5b model could do 85% as well on the narrow task of "extract action items from a transcript," I'd get a big chunk of machine time back, and the studio would actually be available for other experiments.
LoRA felt like the right first experiment because it's narrow. LoRA (low-rank adaptation) is the cheap way to teach an existing model a new trick. Instead of retraining the whole model, you bolt on a small set of new weights and only train those. Tiny, fast, doesn't melt your gpu.
Rather than try to replace gemma on the full 12-type knowledge graph schema, I'd pick one slice. I chose actions. Fine-tune a tiny 1.5b model on one specific job, see if it learns, use that as a learning ramp before scaling up to the bigger question.
Start small, learn the tools, get honest numbers, decide what's next. That was my plan.
happy to have put in the time to build my knowledge graph
Every LoRA tutorial I read assumed the next step was generating training data. The standard move was to re-run gemma 27b on 700 transcripts, capture its outputs as the "right answers," spend four hours and an out-of-memory scare doing it. I had already done something like this once for a different project and the studio crashed partway through. Thus, I was not excited to do it again.
The real win here is that my memgraph knowledge graph had been ingesting nightly against 1,900+ transcripts for weeks. It had 5,284 action nodes, 5,355 assigned relationships, 1,797 meeting nodes. Every action carried a _source property pointing back to the transcript it came from, plus a summary, plus a link to the person it was assigned to.
What I'm saying is that the data I needed was already sitting in a running production graph.
One cypher query is all I needed. Group the rows by source file, load the matching transcript from disk, wrap each record in mlx's chat-format jsonl, write out train/valid/test splits. The whole script ran in about 3 seconds and produced 1,331 clean records. I split them into 500 train, 100 validation, 100 test. If this sounds nerdy it's because it is. Side note, AI can help you walk through this and learn it. It's a patient teacher with a lot of knowledge on this topic.
I'll probably get more mileage out of that 40-line script than out of the adapter it helped me train.
what I actually trained
I pointed mlx_lm.lora at mlx-community/Qwen2.5-1.5B-Instruct-4bit, a 1.5 billion parameter model, 4-bit quantized, about a gigabyte on disk. MLX's LoRA defaults gave me 5.276M trainable parameters, or 0.342% of the base model weights. The idea is that you're nudging a tiny fraction of the model and leaving the rest alone.
Got to work and within the first hundred iterations (iter), training loss dropped from 3.32 to 3.18. The textbook "model is learning" curve. If you'd like to dig in a bit on model training you can head to https://www.ibm.com/think/topics/stochastic-gradient-descent.
What happened next at iter 100, it went to 3.60. iter 110, 4.97. iter 120, 9.44. Started to hit loss territory where the model is essentially outputting gibberish. it plateaued up there for the next 70 iterations and I killed the run.
The model had lost the plot. The usual fix is to lower the learning rate, so I did.
Attempt two underway. This one looked better for a long stretch. Loss went from 3.24 at iter 10 down to 3.12 by iter 80. This was the absolute lowest point of the entire experiment. The curve flattened around 3.2 and stayed there through iter 180, which is roughly what a healthy training run looks like when it's approaching whatever it can learn. Which is all new to me but alas here I am learning alongside the model.
Then at iter 200, validation loss had crept up to 3.366.
What this means is the training loss is the model's score on examples it's been studying. Validation loss is the model's score on examples it hasn't seen. For reference, the very first validation loss at iter 1, before the model had been adjusted at all, was 3.253.
What happened is that after 200 iterations of training, on data the model hadn't been trained on, the fine-tuned model was predicting its targets worse than the untouched base model. Training loss looked fine. The model was memorizing patterns from the training set that did not generalize, and validation loss was the first honest signal.
From iter 200 through 250, both losses climbed steadily so I killed the run.
At this point I could have kept tinkering. try a lower rank, try a higher rank, try different layers, try 3b instead of 1.5b. but I wanted to finish what I started before pivoting. Not always the same approach I use at work but this is me at home learning. I wanted to run the eval so I could run it on both the base model and the best adapter checkpoint. Then compare.
I used the checkpoint from iter 75, where training loss was near its best. I ran both the bare base model and the base-plus-adapter against the held-out 100-record test set. The gold data contained 399 action items across those 100 transcripts.
| metric | baseline (no adapter) | iter-75 adapter |
|---|---|---|
| json validity rate | 27% | 7% |
| action recall | 2.0% (8/399) | 0.25% (1/399) |
| action precision | 4.6% (8/173) | 2.4% (1/41) |
| action f1 | 2.8% | 0.45% |
| assignee accuracy | 0% | 0% |
| generation time per record | 4.8s | 8.2s |
First, the baseline was already bad. a 1.5b base model with no fine-tuning got 2% recall on this task, which is the honest signal that this task is outside the capacity of a 1.5b model regardless of what I did to it.
Second, the adapter was worse. not marginally worse but categorically worse. Lower on every single metric.
So what did the model learn? I opened up a handful of the adapter's outputs on the test set to see what was going on. The adapter learned three things right. It learned the json envelope. It learned that an action summary follows an "X to do Y" pattern, and it learned that the assignee field should hold a person's name.
It did not learn when to stop.
I hadn't explicitly terminated each training example with the right end-of-sequence token, a little marker that tells the model "you're done now". It never learned that the actions list should close after the real actions were listed. It learned the shape of an action, failed to learn the boundary of the list, and now produces one valid-looking action repeated however many times it takes to hit the generation limit, after which the json is truncated mid-sentence and fails to parse.
That's also why generation time nearly doubled. the adapter was always filling up the full 800-token budget. The baseline at least sometimes produced short outputs, valid or not, and stopped.
things I've learned
Training loss going down is not evidence the model is working. At iter 30, loss had reached its best value of the whole experiment, and the model had already learned to repeat itself. Loss is a token-level metric. It rewards the model for putting the next character in the right place most of the time. It says nothing about whether the output, as a complete artifact, is useful. Knowing this in the abstract is quite different than seeing it in action.
Eval is the experiment. Running a fine-tuning job and watching the loss curve is the setup. The actual experiment starts when you compare the adapter's outputs against the base model's outputs on data neither has seen. If I had skipped that step, I would have believed my training had worked.
If I try this kind of fine-tune again, the first change is explicit end-of-sequence tokens on every training example, so the array has a reason to close. It's the single most fixable mistake in the whole run.
The experiment cost 3.5 hours and zero dollars. The adapter is not useful and I will not be deploying it. What I do have now is a script that turns my production knowledge graph into training data in 3 seconds, an honest baseline number for what a 1.5b model can do on this task with no training, and a concrete feel for what divergence and repetition collapse look like when they happen to you instead of when you read about them happening to someone else.
My blog and website is all about "adventures in pretending to know what I'm doing." In this case, the model pretended to learn, the loss curve pretended to cooperate, and then eval walked in and called all of us out.
$ _
LIKED THIS?
I write about AI in plain English every other Sunday. No hype, no jargon — just the stuff that actually helps.
I'M IN →