Architecture
DFIR graphs are divided into two layers: the outer scheduled layer and inner compiled layer.
In the mermaid diagrams generated for DFIR, the scheduler is responsible for the outermost gray boxes (the subgraphs) and handoffs, while the insides of the gray boxes are compiled via the compiled layer. As a reminder, here is the graph from the reachability chapter, which we will use a running example:
The DFIR Architecture Design Doc contains a more detailed explanation of this section. Note that some aspects of the design doc are not implemented (e.g. early yielding) or may become out of date as time passes.
Scheduled Layer
The scheduled layer is dynamic: it stores a list of operators (or "subgraphs") as well as buffers ("handoffs") between them. The scheduled layer chooses how to schedule the operators (naturally), and when each operator runs it pulls from its input handoffs and pushes to its output handoffs. This setup is extremely flexible: operators can have any number of input or output handoffs so we can easily represent any graph topology. However this flexibility comes at a performance cost due to the overhead of scheduling, buffering, and lack of inter-operator compiler optimization.
The compiled layer helps avoids the costs of the scheduled layer. We can combine several operators into a subgraph which are compiled and optimized as a single scheduled subgraph. The scheduled layer then runs the entire subgraph as if it was one operator with many inputs and outputs. Note the layering here: the compiled layer does not replace the scheduled layer but instead exists within it.
Compiled Layer
Rust already has a built-in API similar to dataflow: Iterator
s.
When they work well, they work really well and tend to be as fast as for loops.
However, they are limited in the flow graphs they can represent. Each operator
in a Iterator
takes ownership of, and pulls from, the previous operator.
Taking two iterators as inputs and merging them together (e.g. with
.chain(...)
or .zip(...)
)
is natural and performant as we can pull from both. However if we want an
iterator to split into multiple outputs then things become tricky. Itertools::tee()
does just this by cloning each incoming item into two output buffers, but this
requires allocation and could cause unbounded buffer growth if the outputs are
not read from at relatively even rates.
However, if instead iterators were push-based, where each operator owns one or more output operators, then teeing is very easy, just clone each element and push to (i.e. run) both outputs. So that's what we did, created push-based iterators to allow fast teeing or splitting in the compiled layer.
Pull-based iterators can be connected to push-based iterators at a single "pivot" point. Together, this pull-to-push setup dictates the shape compiled subgraphs can take. Informally, this is like the roots and leaves of a tree. Water flows from the roots (the pull inputs), eventually all join together in the trunk (the pull-to-push pivot), the split up into multiple outputs in the leaves. We refer to this structure as an in-out tree.
The Oaklandish Logo. |
In the reachability example above, the pivot is the flat_map()
operator: it pulls from the join that feeds it, and pushes to the tee that follows it. In essence that flat_map()
serves as the main thread of the compiled component.
See Subgraph In-Out Trees for more, including how to convert a graph into in-out trees.
Flow Syntax and APIs
DFIR Flow Syntax hides
the distinction between these two layers. It offers a natural Iterator
-like chaining syntax for building
graphs that get parsed and compiled into a scheduled graph of one or more compiled subgraphs. Please see the Surface Syntax docs for more information.
Alternatively, the Core API allows you to interact with handoffs directly at a low
level. It doesn't provide any notion of chainable operators. You can use Rust Iterator
s
or any other arbitrary Rust code to implement the operators.
The Core API lives in dfir_rs::scheduled
,
mainly in methods on the DFIR
struct.
Compiled push-based iterators live in dfir_rs::compiled
.
When working with DFIR, we intend users to use the Flow Syntax as it is much more friendly, but as
DFIR is in active development some operators might not be available in
the Flow Syntax, in which case the Core API can be used instead. If you find
yourself in this situation be sure to submit an issue!