Thyag's Blog

drawbacks of SKILL.md

while skill.md has gained a lot of traction lately due to its usage in claude code and other coding agents, i believe its application is still very limited, particularly in long horizon tasks.

the skill.md files come handy if you want the agent to use specific tools for a particular task. for example, if we are analyzing sec filing data, we already have structured data in json. it is easy to exploit that structure using python scripts to get clean outputs. you can write dedicated scripts for your task, the agent provides the data as an input parameter, and the script runs!

however, there are failure points to skill.md files that are rarely discussed. below are the issues are overlooked:

table of contents

1. requiring the llm to think in a specific direction

a lot of the skill.md files you find online are dedicated to tool usage and code generation. but what if i ask a domain expert to document their workflow for a particular task and give it to an agent to replicate the analysis rigor? will it be able to do it? it depends.

lately, agents have become very good at instruction following, but they struggle to replicate complex human analysis. a major reason is the inherent bias introduced during their training. additionally, analysis requirements often fall into "grey areas." if you ask the agent to analyze cash flow against growing profit, it performs well. however, if you ask it to analyze a corporate management structure and flag x, y, and z, it often overfits and fails. a human wouldn't have this issue because they don't overfit in the same way; they notice nuances that are out-of-distribution, whereas an llm fails to interpret those nuances properly.

2. failed skill execution

imagine you give your agent 10 analysis goals and 15 specific "if-else" instructions wrapped inside your skill.md file. the llm loads them but only performs 5 out of 10 analyses and follows 10 out of 15 steps. the task fails.

at this point, you have two options:

3. overfitting and hallucinations

one of the things people rarely talk about when using skill.md files is overfitting. when an agent loads skills dynamically as required by the task, you give it a free hand in how to perform that task; as a result, it can exercise its own intelligence if something new comes up.

however, when you dump rigid instructions into your skill.md, the model overfits. it tries to meet every required step as instructed which can actually end up incentivising the agent to hallucinate. it may generate its own analysis or "find" insights that simply do not exist in the provided data, just to satisfy the constraints of your file.

conclusion

while skill.md files have proven useful for structured tool orchestration but their application to long-horizon analytical tasks introduces still bring significant failure modes. the core issue is that skill.md files are rigid by design, and complex analysis is not.

when instructions are too sparse, the agent underperforms, skipping steps, ignoring constraints, and producing incomplete work. when instructions are too detailed, the agent overfits. it hallucinate insights, fabricating data points, and produce outputs to satisfy the list rather than actively engaging with the source material. between these two extremes we end up settling for a compromised middle ground where "good enough" becomes the standard and the remaining gaps are left to the dice players.

perhaps the most concerning aspect is how invisible these failures are. a skipped analysis step does not throw an error. a hallucinated insight looks identical to a real one. the agent never tells you it struggled. this makes skill.md failures uniquely dangerous in high-stakes domains where the cost of a wrong answer is not a formatting issue but a flawed decision.

note: thanks to gemini to fix grammar, and opus to help me write a conclusion