Breaking Down Large Data Sets to Get Good Results From AI
Listen to this episode
Episode Transcript
Transcript
The Context Window Problem
Good day. My name is Mike from Lone Wolf Unleashed and today we’re going to talk about how to interact with AI agents and your AI tools to get the best possible outcome for what you’re trying to do with them.
This week I had a data set we were looking at with 103,000 lines. That’s a lot of data. And there is a particular problem when you’re dealing with data sets this large — it will consume your AI context window.
The context window is basically how much information the AI can read before it needs to compact it and reload, or you need to start a new conversation. A few months ago, Claude would just run out of context window and you’d need to start again from scratch. It’s since got a little bit better — it now auto-compacts and provides a summary of the conversation so it can continue.
A couple of weeks ago, Anthropic increased my context window to about a million tokens, which was five times what I was on at the time. It’s a game changer — I haven’t seen it compact since.
When we’re dealing with data sets this large, we need to be thinking about how much room the AI has to work with.
Working Within Constraints
The customer I’m working with for this one is constrained to — yep, you guessed it — Microsoft Copilot. It’s not my favourite AI tool. Let’s just say it’s near the bottom of the list. But we have to deal with constraints every day and this is the one the customer is dealing with.
The idea was to feed Copilot this data set and have it return a good set of outputs based on our objective. Specifically: we wanted it to scan over a bunch of categories and the descriptions of the cases in there, and return a new categorisation based on a different parameter driven by a legislative requirement.
Doing this manually in the past, you would have to search for trigger words — a very, very long exercise for that many records. The idea is that Copilot runs itself over those fields and completes the task. The problem is it cannot consume that much information at a time.
Making the Data Token-Efficient
So first — let’s look at our token usage and make the data set as token-efficient as possible, and make our prompt as efficient as possible.
I’ve spoken before about how to build AI agents, and we want to articulate what the agent or AI is supposed to do. The principles are the same whether you’re dealing with an agent team or just interacting in a chat window — we want to make all of those things token-efficient.
Step one: strip out the IDs. They were fairly long. We can just go one through 103,000 and match it back to the original data set with a matching field. We’ve saved a few tokens already.
What other columns do we not need? Strip out every single piece of information we don’t need.
Then we articulate the outcome in a prompt. And we do some testing about how much information is there. We tried the first time — results were rubbish. It tried to consume too much information and the outputs weren’t great.
Chunking the Data
Now we chunk the data set down into manageable pieces.
Let’s do this the long way around just to make sure it works. Let’s break this down into 20 data sets — split it out into several Excel workbooks with a master workbook where it returns its summaries. So we have 20 workbooks with just over 5,000 records each.
This is much more manageable for the AI tool because now we’re not consuming nearly as many tokens for each pass.
If you’re on a pro version of Microsoft Copilot, you can point it towards your documents as knowledge sources. What I’d do is set up an agent to take care of this — there’s a “create agent” option in there. In your agent instructions, put in what your original prompt was for what it needs to do, then direct it on how it’s supposed to interact with the files.
The initial part is easy — hey, look at this data, provide me some insights based on these new categories. Now we direct it: okay, here is data set 1. We’re going to take this, put the result in a new workbook so we can consolidate everything again. The first thing it’s going to do is name the file it was in, then return the total number of cases in there. In the original workbook, it categorises them all out.
This creates a quick check for us — we can open the new summary workbook to see that it’s working. We don’t have to open all 20 of the other ones. One open workbook tells us the totals are coming back correctly.
Once it’s done all 20, we should have our total case count with the new categories, and we can then combine the 20 back into one main data set.
Yes, this is still a bit manual. I could probably run this through Claude in my current setup in one shot — but if you’re dealing with constraints, that’s what you need to do.
Why This Matters
In this case, the client is trying to make a determination around the type of resourcing needed within teams based on case volume. There’s an understanding that there are invisible cases not being raised because operations are fragmented, or there’s not enough governance or ownership — and because of that, reporting is very limited.
So we’re coming from a different angle and asking: if we’ve got all these cases and case information, maybe they’ve been categorised incorrectly in the past, or maybe there are other layers to these cases that we haven’t previously understood. Can we figure out the order of magnitude of any new resourcing that would be needed? That’s the idea.
This exercise wasn’t about drilling into specific case-by-case matters — it was about figuring out the bigger picture.
Check the Work
The basic premise when doing exercises like this — particularly around data analysis — is that we need to check that it’s doing it right. We need to scrutinise its outputs. We can’t just take what it has as gospel.
It’s also possible to run a multi-stage exercise here. We feed it the original stuff, it returns. We then have a manager agent run over the top and go: check the work of this, provide us with the gaps around the assumptions or the outputs. It returns to you and you can test and refine from there.
I always highly recommend a battle test or manager review type exercise over AI outputs. Now you might say — well, we’re just getting it to check its own homework. Yes and no. The role it used to produce the original content wasn’t the role you’ve given it to go and review. It does break down the time needed to spot-check, which is really super helpful for when you’re doing your own review.
Use the Right Model
The other thing to think about is the type of model you’re using. If you’re in a tool like ChatGPT or Claude, make sure it’s in the right mode for these really heavy tasks. I’d be using Opus 4.6 with extended thinking mode on. Yes, it consumes more tokens — but you get a way better outcome, and that’s what we’re looking for. Better outcomes means less of your time spent on admin work.
What We Covered
We’ve covered today that we need to be thinking about the constraints of working with AI tools in your business — maybe a tool decision was made earlier and you’re stuck with it.
We’ve covered how to break down really large data sets when you want to get insights on unstructured data or data that needs an AI to run over it — breaking those down in a token-efficient way to get it to work properly and produce good outcomes.
And we’ve covered how to make sure it’s given you the right output by running a manager-type role over it and then spot-checking ourselves.
Thank you so much for joining me. You could be doing so many other things, but instead you decided to hang out with me and learn how to break down these data sets to use in AI tools when you’re trying to make big decisions. This is all about saving time.
You can find more resources at lonewolfunleashed.com/resources — there’s AI stuff, process stuff, procedure stuff, presentations, heaps of stuff. Go and check that out and I’ll see you next week.
Want to go deeper?
Join the Pack for exclusive content, community access, and live discussions about episodes like this one.
Join the Pack