Persistent Write Ahead Log
Starting with .NEXT 1.0.0 the library shipped with the general-purpose high-performance persistent Write Ahead Log(WAL) providing the following features:
- Log compaction based on snapshotting
- File-based persistent storage for the log entries
- Caching
- Fast writes
- Parallel reads
- Automatic replays
However, it is not used as default audit trail by Raft implementation. You need to register it explicitly using Dependency Injection or configurator object.
Typically, PersistentState
class is not used directly because it is not aware how to interpret commands contained in the log entries. This is the responsibility of the data state machine. It can be defined through overriding of the two methods:
ValueTask ApplyAsync(LogEntry entry)
method is responsible for interpreting committed log entries and applying them to the underlying persistent data storage.SnapshotBuilder CreateSnapshotBuilder()
method is required if you want to enable log compaction. The returned builder squashes the series of log entries into the single log entry called snapshot. Then the snapshot can be persisted by the infrastructure automatically. By default, this method always returns null which means that compaction is not supported.
Note
.NEXT library doesn't provide default implementation of the database or persistent data storage based on Raft replication
Internally, persistent WAL uses files to store the state of cluster member and log entries. The journal with log entries is not continuous. Each file represents the partition of the entire journal. Each partition is a chunk of sequential log entries. The maximum number of log entries per partition depends on the settings.
PersistentState
has a rich set of tunable configuration parameters and overridable methods to achieve the best performance according with application needs:
recordsPerPartition
allows to define maximum number of log entries that can be stored continuously in the single partition. Log compaction algorithm depends on this value directly. When all records from the partition are committed and applied to the underlying state machine the infrastructure calls the snapshot builder and squashed all the entries in such partition. After that, write ahead log records the snapshot into the separated file and removes the partition file from the file system. The log compaction is expensive operation. So if you want to reduce the number of compactions then you need to increase the maximum number of log entries per partition. However, the partition file will take more disk space.BufferSize
is the numbers of bytes that is allocated by persistent WAL in the memory to perform I/O operations. Set it to the maximum expected log entry size to achieve the best performance.InitialPartitionSize
represents the initial pre-allocated size, in bytes, of the empty partition file. This parameter allows to avoid fragmentation of the partition file at file-system level.UseCaching
isbool
flag that allows to enable or disable in-memory caching of log entries metadata.true
value allows to improve the performance or read/write operations by the cost of additional heap memory.false
reduces the memory footprint by the cost of the read/write performanceCreateMemoryPool
generic method used for renting memory and can be overriddenMaxConcurrentReads
is a number of concurrent asynchronous operations which can perform reads in parallel. Write operations are always sequential. Ideally, the value should be equal to the number of nodes. However, the larger value consumes more system resources (e.g. file handles) and heap memory.ReplayOnInitialize
is a flag indicating that state of underlying database engine should be reconstructed whenInitializeAsync
is called by infrastructure. It can be done manually usingReplayAsync
method.
Choose recordsPerPartition
value with care because it cannot be changed for the existing persistent WAL.
Let's write a simple custom audit trail based on the PersistentState
to demonstrate basics of Write Ahead Log. Our state machine stores only the single long value as the only possible persistent state.
The example below additionally requires DotNext.IO library to simplify I/O work.
using DotNext.IO;
using DotNext.IO.Pipelines;
using DotNext.Net.Cluster.Consensus.Raft;
using System.Threading.Tasks;
using static System.Buffers.Binary.BinaryPrimitives;
sealed class SimpleAuditTrail : PersistentState
{
internal long Value; //the current int64 value synchronized across all cluster nodes
//snapshot builder
private sealed class SimpleSnapshotBuilder : SnapshotBuilder
{
private long currentValue;
private readonly Memory<byte> sharedBuffer;
internal SimpleSnapshotBuilder(Memory<byte> buffer) => sharedBuffer = buffer;
//2.1
public override ValueTask CopyToAsync(Stream output, CancellationToken token)
{
return output.WriteAsync(currentValue, buffer, token);
}
//2.2
public override async ValueTask CopyToAsync(PipeWriter output, CancellationToken token)
{
return output.WriteAsync(currentValue, token);
}
//1
protected override async ValueTask ApplyAsync(LogEntry entry) => currentValue = await Decode(entry);
}
//3
private static async Task<long> Decode(LogEntry entry) => ReadInt64LittleEndian((await entry.ReadAsync(sizeof(long))).Span);
//4
protected override async ValueTask ApplyAsync(LogEntry entry) => Value = await Decode(entry);
//5
protected override SnapshotBuilder CreateSnapshotBuilder() => new SimpleSnapshotBuilder(Buffer);
}
1)Aggregates the commited entry with the existing state; 2)called by infrastructure to serialize the aggregated state into stream; 3)Decodes the command from the log entry; 4) Applies the log entry to the state machine; 5)Creates snapshot builder
In the reality, the state machine should persist its state in reliable way, e.g. disk. The example above ignores this requirement for simplicity and maintain its state in the form of the field of type long
.
API Surface
PersistentState
can be used as general purpose Write Ahead Log. Any changes in the cluster node state can be represented as a series for log entries that can be appended to the log. Newly added entries are not committed. It means that there is no confirmation from other cluster nodes about consistent state of the log. When the consistency is reached across cluster then the all appended entries marked as committed and the commands contained in the committed log entries can be applied to the underlying database engine.
The following methods allows to implement this scenario:
AppendAsync
adds a series of log entries to the log. All appended entries are in uncommitted state. Additionally, it can be used to replace entries with another entriesDropAsync
removes the uncommitted entries from the logCommitAsync
marks appended entries as committed. Optionally, it can force log compactionEnsureConsistencyAsync
suspends the caller and waits until the last committed entry is from leader's termWaitForCommitAsync
waits for the specific or any commit
ReadAsync
method can be used to obtain committed or uncommitted entries in stream-like manner.
State Reconstruction
PersistentState
is designed with assumption that underlying state machine can be reconstructed through sequential interpretation of each committed log entry stored in the log. When persistent WAL used in combination with other Raft infrastructure such as extensions for ASP.NET Core provided by DotNext.AspNetCore.Cluster library then this action performed automatically in host initialization code. However, if WAL used separately then reconstruction process should be initiated manually. To do that you need to call ReplayAsync
method which reads all committed log entry and pass each entry to ApplyAsync
protected method. Usually, ApplyAsync
method implementation implements data state machine logic so sequential processing of all committed entries can restore its state correctly.