SSTable - Write
SSTables are created in the following situations:
- When a
Flush (Minor Compaction)
occurs from the MemTable - When
Compaction
occurs in storage
This document focuses on how SSTables are created when a Flush
occurs from the MemTable.
When a Flush Occurs from the MemTable
The process of a Flush
occurring from the MemTable is as follows:
CompactMemTable
is called, which in turn calls WriteLevel0Table
, resulting in the creation of an SSTable. At this point, BuildTable
is the function that actually creates the SSTable.
The flow of BuildTable
is as follows:
Overall Process
Steps
- Create an instance of
TableBuilder
- Add key-value pairs from the MemTable one by one using the
Add
method ofTableBuilder
- Complete the process of creating the SSTable using the
Finish
method ofTableBuilder
- Write the contents in the
WritableFile
to storage - Load the SSTable stored in storage into the cache and check if it is available
Status BuildTable(const std::string& dbname, Env* env, const Options& options,
TableCache* table_cache, Iterator* iter, FileMetaData* meta) {
Status s;
meta->file_size = 0;
// Make sure the Iterator points to the first element
iter->SeekToFirst();
std::string fname = TableFileName(dbname, meta->number);
if (iter->Valid()) {
WritableFile* file;
s = env->NewWritableFile(fname, &file);
if (!s.ok()) {
return s;
}
// 1. Create an instance of TableBuilder
TableBuilder* builder = new TableBuilder(options, file);
meta->smallest.DecodeFrom(iter->key());
Slice key;
// 2. Add key-value pairs of MemTable one by one via TableBuilder's "Add" method
for (; iter->Valid(); iter->Next()) {
key = iter->key();
builder->Add(key, iter->value());
}
if (!key.empty()) {
meta->largest.DecodeFrom(key);
}
// 3. Complete the process of creating SSTable via TableBuilder's "Finish" method
s = builder->Finish();
if (s.ok()) {
meta->file_size = builder->FileSize();
assert(meta->file_size > 0);
}
delete builder;
// 4. Write the contents in the WritableFile to storage
if (s.ok()) {
s = file->Sync();
}
if (s.ok()) {
s = file->Close();
}
delete file;
file = nullptr;
if (s.ok()) {
// 5. Put the SSTable stored in storage into cache and check if it is available
Iterator* it = table_cache->NewIterator(ReadOptions(), meta->number,
meta->file_size);
s = it->status();
delete it;
}
}
// Check for Iterator related errors
if (!iter->status().ok()) {
s = iter->status();
}
if (s.ok() && meta->file_size > 0) {
// Keep it
} else {
env->RemoveFile(fname);
}
return s;
}
TableBuilder::Add
The role of passing the key-value pair currently referenced by the Iterator to each BlockBuilder within the TableBuilder.
- If the Data Block being created by the
BlockBuilder
is empty, meaning a new Data Block is being started, add a new entry to the Index Block. The entry added at this time is not for the newly started Data Block but for the Data Block that was being created just before. - If using a Bloom Filter, update the Filter Block as well.
- Add data to the Data Block.
- If the Data Block being created is full (i.e., exceeds the block size specified by the option), call
Flush
.
void TableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
// ...
// 1. If the Data Block that BlockBuilder is creating is empty,
// add a new entry to the Index Block
if (r->pending_index_entry) {
assert(r->data_block.empty());
r->options.comparator->FindShortestSeparator(&r->last_key, key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
// 2. If using Bloom Filter, update the Filter Block as well
if (r->filter_block != nullptr) {
r->filter_block->AddKey(key);
}
r->last_key.assign(key.data(), key.size());
r->num_entries++;
// 3. Add data to the Data Block
r->data_block.Add(key, value);
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
// 4. If the Data Block being created is full, call "Flush"
if (estimated_block_size >= r->options.block_size) {
Flush();
}
}
TableBuilder::Finish
Called when
Add
is finished for all key-value pairs in the MemTable, and it finalizes the SSTable being written.
- Call
Flush
. - If using a Bloom Filter, add the Filter Block to the
WritableFile
usingFilterBlockBuilder
. - Add the Meta Index Block to the
WritableFile
. - Add the Index Block being created by
BlockBuilder
to theWritableFile
. - Add the Footer to the
WritableFile
.
Status TableBuilder::Finish() {
Rep* r = rep_;
// 1. Call "Flush"
Flush();
assert(!r->closed);
r->closed = true;
BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;
// 2. If using Bloom Filter, add the Filter Block to the WritableFile.
if (ok() && r->filter_block != nullptr) {
WriteRawBlock(r->filter_block->Finish(), kNoCompression,
&filter_block_handle);
}
// 3. Add the Meta Index Block to the WritableFile
if (ok()) {
BlockBuilder meta_index_block(&r->options);
if (r->filter_block != nullptr) {
std::string key = "filter.";
key.append(r->options.filter_policy->Name());
std::string handle_encoding;
filter_block_handle.EncodeTo(&handle_encoding);
meta_index_block.Add(key, handle_encoding);
}
WriteBlock(&meta_index_block, &metaindex_block_handle);
}
// 4. Add the Index Block to the WritableFile.
if (ok()) {
if (r->pending_index_entry) {
r->options.comparator->FindShortSuccessor(&r->last_key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
WriteBlock(&r->index_block, &index_block_handle);
}
// 5. Add the Footer to the WritableFile.
if (ok()) {
Footer footer;
footer.set_metaindex_handle(metaindex_block_handle);
footer.set_index_handle(index_block_handle);
std::string footer_encoding;
footer.EncodeTo(&footer_encoding);
r->status = r->file->Append(footer_encoding);
if (r->status.ok()) {
r->offset += footer_encoding.size();
}
}
return r->status;
}
TableBuilder::Flush
The role of writing the Data Block being created to storage.
- Add the contents of the Data Block being created to the
WritableFile
. - Write the contents of the
WritableFile
to storage. - If using a Bloom Filter, create a new Bloom Filter.
void TableBuilder::Flush() {
Rep* r = rep_;
// ...
// 1. Add the contents of the Data Block being created to the WritableFile
WriteBlock(&r->data_block, &r->pending_handle);
if (ok()) {
r->pending_index_entry = true;
// 2. Write the contents of WritableFile to storage
r->status = r->file->Flush();
}
// 3. If using Bloom Filter, create a new Bloom Filter
if (r->filter_block != nullptr) {
r->filter_block->StartBlock(r->offset);
}
}