🤖 AI Onboarding
We are really excited to have you join the team. The goal of the document is to help you, and future employees, get up to speed as fast as possible. If you notice anything missing, please create a PR in the internal-docs repository to help future generations!
Principles
Start by reviewing the Legion principles to understand the vision and values of the company. The goal of an AI engineer in Legion is to build magical AI powered software, that radically improves the experience of our customers. Our goal is to build an AI security analyst and thinks, sees and acts like a human analyst, and everything flows from that.
AI Principles​
1. Data > Models​
We live in a time with rapid progress of model capabilities. Large fine tuned models are taking over the world and demonstrating very strong performance on many tasks, and there are good reasons to believe that we are still early in this trend.
For many problems, model can be used as the base of a more complex system with other components built around them, while for others they prove to be a strong base to finetune. To a large extent, the toolset of a data scientist is morphing from model building to “AI Engineering” - the art of using large pre-trained models, engineering their context and usage to get good performance on your own tasks, while meeting with your own constraints: latency, cost, reliability, etc.
More and more problems are solved by model choice, model chaining, RAG, prompting, pre and post processing, "agentic flows" and Agent-computer interface design. And that's a great thing! We get more capabilities and a faster iteration cycle, allowing us to build better products faster.
But data science is not dead. Although we have different tools, we still have the same modus operandi:
- Identify tasks - It often helps to break down big problems into smaller, self-contained tasks that can be solved independently and measured.
- Collect data - Collect high quality data for your tasks, from the same distribution as the data you expect to see in production.
- Label data - Find ways to couple your data points with expected outcomes.
- Evaluations - Develop evaluation scripts that will measure how well an AI system is doing at solving your tasks. Good evals are highly correlated with the desired behavior of your system, and are easy and fast to run. With LLMs you will sometimes, but not always, have to resort to evals that look more like integration tests (asserts on the output of the system), and using LLMs as a judge. You might need to get creative. Read this. You want evals for both the separate task, and the whole system, and you want to make sure your evals have some really hard samples in them. A 99% performance evaluation is cool, but doesn’t really help anyone :).
- Develop AI Solution - Fun part! Build your system and evaluate it on your evals!
- Iterate - Analyse the errors from your evals, think about a way to improve the system, build it and try it out. Rinse and repeat until desired performance is achieved.
- Log - Keep a tidy research log across tasks, detailing your ideas, iterations, and plans. When researching and developing a complex system, it’s easy to get stuck in loops. Proper logs and justifications for ideas and changes will help us keep on track and improve, both our system but also improve ourselves as researchers.
Last point - while models are getting a lot better very quickly, some data is still hard to find on the internet, so collecting unique domain-specific data is probably still very valuable. If anything, better models give you more leverage on your data, so collecting data is more valuable than ever.
2. Start small with a PoC​
The systematic cycle of evals based model improvement is the only way to build performant AI systems that work well in real world applications. But intuition is important, and trying out ideas quickly helps develop good intuition. It also helps in igniting the imagination of the team and the customer, and in communicating ideas. So start with a small PoC, and iterate from there.
Be careful to not overengineer in your imagination, solving problems that might not even realistically occur. A good system PoC is malleable.
3. Deploy quickly​
When our systems are in production we learn a lot more than from when they are in development. This is mainly because:
- Data - We get real data, and we can see how our system behaves on it.
- Feedback - We get feedback from the users, and we can see how they interact with our system.
We are a startup and our customers understand that some of our features are experiments. Prioritize getting your system to production, even if it's not perfect yet.
4. Inspect the data manually and often​
Yes, running experiments that push the metrics up is the best way to systematically improve our systems, but looking at the data is important. It gives us a good sense of what is not working, and what we can do to fix it. Every hour spent looking at the data is an hour well spent. There's no need to be as extreme as Karpathy, but looking at data is generally so valuable that it's even worth developing custom interfaces that help look at more data faster.
Tools​
AI engineers in Legion are first and foremost engineers that contribute code to the product. As such, start with the New Employee Onboarding guide to get your environment set up.
There are also some dedicated ML tools to get familiar with:
- Langfuse: Langfuse is an LLM tracing platform. In it, we log all LLM calls and any LLM related logic - like agentic routing, tool calls and completions. It allows us to see the inputs and outputs of observed functions, and tag specific traces as dataset examples.
We also use langfuse to manage our datasets and evaluation runs. - Axiom - Langfuse’s counterpart - any other trace that doesn’t appear in Langfuse, will be here. This would be things related to the actual operation of the extension & backend - requests, logs, etc.
- OpenAI - Our main model provider. You should be familiar with the main models and features. Especially - Function Calling, Structured Outputs, and the finetuning APIs. Most of our use is with the LiteLLM wrapper so we’re not constrained to OpenAI, but their API defines all others.
What we work on​
The AI team at Legion has several ongoing projects. Some of these are short term, while others are long term projects. Many projects are ongoing with short sprints of working on them, deploying, and waiting for user feedback.
Our (current) major goals are:
1. AI Skills​
Our system executes operations through cyber security tools. We call each operation we support a skill. Developing skills takes time and effort, slows down our customer onboarding and is too hard for customers to do by themselves. To enable quicker skill creation, we have created the browser use agent AI Skill (SPA Agent - SeePlanAct). This includes guardrails to make sure the system doesn't do anything harmful when not watched. This is known as "Browser Use" in the LLM world, and many people are working on similar things. We do have some niche requirements though. References: Browser Use, SeePlanAct, SeeAct, BrowserGym.
Once AI skills are robust and effective, we can store AI Skill actions as tailored deterministic skills, dropping the AI component and allowing very quick robust skill creation.
2. Automatic workflow extraction​
If we can automatically create skills, the next step up is to automatically identify workflows that appear in recordings and create automations from them. Classically this is known as "Process Mining", but we think that classical methods for process mining are little help for us. This is a problem we can solve in stages:
- For each new recording, figure out if it's a new workflow or a variation of an existing one.
- If it's an existing workflow, check if it needs updating.
- If it's a new workflow, create a new automation for it.
- If a new skill is required, create it (using AI skill).
Some of the mappings here can probably be done based on heuristics and simple ML, but some will require a good ability to map operations from Audit Nodes to Skills.
It's okay if initially the workflows we create are not perfect, as long as they decrease the time it takes our team to create automations. Our end goal is full automation, but just like our principles - we want to deploy early and iterate.
3. Triage decisions​
At the heart of our operation lies the decision node. Almost all workflows can be seen as two parts - information collection and containment actions. Between these two is the triage node, where the alert is triaged and we choose appropriate actions. The triage node is super important - it defines a lot of the value Legion brings to the table, and bad decisions can lead to bad security consequences.
Triage is very much a reasoning task, where most problems are due to bad reasoning or not enough information.
This means our main vector for improvement is improving the reasoning, through prompts and few shot examples.
Since it’s a super important component, Triage nodes come with a feedback gate in the extension that forces the user to select the correct decision. This leads to a lot of tagged data on whether the triage was correct or not. Using this data we can improve the prompts dynamically and select few shot examples for each run, boosting our performance.
Leveraging data for LLM tasks is still an ongoing challenge in LLMs. While few shot and prompt optimizations methods that see wide success, it’s still uncertain how good things like model finetuning or more complex frameworks are in comparison.
4. Dynamic Investigations​
On the other side of the fence, one of the large goals for Legion is being able to investigate autonomously. In many cases, it’s not clear from the get go what the investigation workflow would be, so deterministic pre-determined investigations are a bad fit. Instead, we need some dynamic actor that has an understanding of our goal, can dynamically gather information, able to triage and finally commit actions.
This goal can include:
- Dynamic search across knowledge bases with a fuzzy goal.
- Recommending next investigation steps, based on current investigation history
- Leading an entire investigation (can be seen as 1, just from an empty investigation)
- Improving workflows using past investigations, determining what was missing (similar to Recommendation systems).
This is the classic case for an “agent”. I (Ohav) define an “agent” as a system that can observe and interact with an environment. This is not limited to LLMs, though they are definitely the best fit currently. While having many online influencers, actual agent research is still in its early steps:
- How to leverage past trajectories?
- Few shot? Training?
- How good are multi agent frameworks?
- Why are agents so noisy? How can we mitigate this noise?
- Implementing guardrails
- How can we properly evaluate long term tasks?
- Can we interpret the actions agents took?
We obviously have many more goals, and each of these goals contains many different approaches, ideas and final features that will find their way into the product. As prioritization changes, so will the shape the goals take, and the projects therein.
Todo List​
This is a list of tasks that should help you get up to speed with the current state of the product and the team.
Get up to speed what the team does (ask Michael for help):
- Handle a small dev issue
- Get to know the Session Viewer
- Create and run your own workflow, with an AI Skill
- Look it up on Langfuse!
- Solve an issue from the bug research backlog
- Set up an AI coding environment (if you want)
Get up to speed with the AI stack:
- Read the LiteLLM (& LLMProvider) docs
- Talk to Ohav about agentic design.
- Get familiar with Langfuse.
- Create a temporary dataset
- Add your own temporary tracing to some function
- Add your trace to the new dataset
- Interact with the example using the Langfuse API.
Optional additional materials:
Context Engineering Guide
Agentic Design Patterns(Long, recommended)