/ src / leveldb / doc / table_format.md
table_format.md
  1  leveldb File format
  2  ===================
  3  
  4      <beginning_of_file>
  5      [data block 1]
  6      [data block 2]
  7      ...
  8      [data block N]
  9      [meta block 1]
 10      ...
 11      [meta block K]
 12      [metaindex block]
 13      [index block]
 14      [Footer]        (fixed size; starts at file_size - sizeof(Footer))
 15      <end_of_file>
 16  
 17  The file contains internal pointers.  Each such pointer is called
 18  a BlockHandle and contains the following information:
 19  
 20      offset:   varint64
 21      size:     varint64
 22  
 23  See [varints](https://developers.google.com/protocol-buffers/docs/encoding#varints)
 24  for an explanation of varint64 format.
 25  
 26  1.  The sequence of key/value pairs in the file are stored in sorted
 27  order and partitioned into a sequence of data blocks.  These blocks
 28  come one after another at the beginning of the file.  Each data block
 29  is formatted according to the code in `block_builder.cc`, and then
 30  optionally compressed.
 31  
 32  2. After the data blocks we store a bunch of meta blocks.  The
 33  supported meta block types are described below.  More meta block types
 34  may be added in the future.  Each meta block is again formatted using
 35  `block_builder.cc` and then optionally compressed.
 36  
 37  3. A "metaindex" block.  It contains one entry for every other meta
 38  block where the key is the name of the meta block and the value is a
 39  BlockHandle pointing to that meta block.
 40  
 41  4. An "index" block.  This block contains one entry per data block,
 42  where the key is a string >= last key in that data block and before
 43  the first key in the successive data block.  The value is the
 44  BlockHandle for the data block.
 45  
 46  5. At the very end of the file is a fixed length footer that contains
 47  the BlockHandle of the metaindex and index blocks as well as a magic number.
 48  
 49          metaindex_handle: char[p];     // Block handle for metaindex
 50          index_handle:     char[q];     // Block handle for index
 51          padding:          char[40-p-q];// zeroed bytes to make fixed length
 52                                         // (40==2*BlockHandle::kMaxEncodedLength)
 53          magic:            fixed64;     // == 0xdb4775248b80fb57 (little-endian)
 54  
 55  ## "filter" Meta Block
 56  
 57  If a `FilterPolicy` was specified when the database was opened, a
 58  filter block is stored in each table.  The "metaindex" block contains
 59  an entry that maps from `filter.<N>` to the BlockHandle for the filter
 60  block where `<N>` is the string returned by the filter policy's
 61  `Name()` method.
 62  
 63  The filter block stores a sequence of filters, where filter i contains
 64  the output of `FilterPolicy::CreateFilter()` on all keys that are stored
 65  in a block whose file offset falls within the range
 66  
 67      [ i*base ... (i+1)*base-1 ]
 68  
 69  Currently, "base" is 2KB.  So for example, if blocks X and Y start in
 70  the range `[ 0KB .. 2KB-1 ]`, all of the keys in X and Y will be
 71  converted to a filter by calling `FilterPolicy::CreateFilter()`, and the
 72  resulting filter will be stored as the first filter in the filter
 73  block.
 74  
 75  The filter block is formatted as follows:
 76  
 77      [filter 0]
 78      [filter 1]
 79      [filter 2]
 80      ...
 81      [filter N-1]
 82  
 83      [offset of filter 0]                  : 4 bytes
 84      [offset of filter 1]                  : 4 bytes
 85      [offset of filter 2]                  : 4 bytes
 86      ...
 87      [offset of filter N-1]                : 4 bytes
 88  
 89      [offset of beginning of offset array] : 4 bytes
 90      lg(base)                              : 1 byte
 91  
 92  The offset array at the end of the filter block allows efficient
 93  mapping from a data block offset to the corresponding filter.
 94  
 95  ## "stats" Meta Block
 96  
 97  This meta block contains a bunch of stats.  The key is the name
 98  of the statistic.  The value contains the statistic.
 99  
100  TODO(postrelease): record following stats.
101  
102      data size
103      index size
104      key size (uncompressed)
105      value size (uncompressed)
106      number of entries
107      number of data blocks