mmdbctl icon indicating copy to clipboard operation
mmdbctl copied to clipboard

Low-level data about MMDB

Open svbatalov opened this issue 2 years ago • 2 comments

Hey @UmanShahzad.

To make mmdbctl even more awesome, it would be great to be able to display some low-level data about an MMDB file, such as

  • Tree size in bytes
  • Data section start/end offsets
  • Data section size in bytes
  • Metadata section start offset

This is helpful, for example, if you want to inspect (with hexdump) the actual data section, or if you want to estimate relative impact of the tree/data sections to file size.

Simple example. Let's say we want to find out whether the actual MMDB writer deduplicates written objects (replaces by pointers) or not. I'll use my MMDB parser to display abovementioned offsets.

  • Case 1 -- write two different objects
$ echo -e '{"range":"1.0.0.0/24","value":{"col":"nested1"}}\n{"range":"2.0.0.0/24", "value":{"col":"nested2"}}' | mmdbctl import --no-network -j -o test.mmdb
writing to test.mmdb (2 entries)

$ python3 ./parser.py  test.mmdb
Namespace(file='test.mmdb', meta=False, data=None, ip=None)
Data section offset 1096 (data starts at 1112)  # <===
Metadata section offset: 1146 (metadata starts at 1160)
Data section size 34 bytes (3.4e-05 MB)  # <===
Record size: 32
Node count: 137
Tree size: 1096 (bytes)
ip_version: 6
First data record at 153 pointer

# Knowing the offset/size, we can inspect specific portion of the file:
$ hd -s 1112 -n 34 test.mmdb
00000458  e1 45 76 61 6c 75 65 e1  43 63 6f 6c 47 6e 65 73  |.Evalue.CcolGnes|
00000468  74 65 64 31 e1 20 01 e1  20 08 47 6e 65 73 74 65  |ted1. .. .Gneste|
00000478  64 32                                             |d2|
0000047a
  • Case 2 -- write duplicate objects:
$ echo -e '{"range":"1.0.0.0/24","value":{"col":"nested1"}}\n{"range":"2.0.0.0/24", "value":{"col":"nested1"}}' | mmdbctl import --no-network -j -o test.mmdb
writing to test.mmdb (2 entries)

$ python3 ./parser.py  test.mmdb
Namespace(file='test.mmdb', meta=False, data=None, ip=None)
Data section offset 1096 (data starts at 1112)  # <===
Metadata section offset: 1132 (metadata starts at 1146)
Data section size 20 bytes (2e-05 MB)   # <===
Record size: 32
Node count: 137
Tree size: 1096 (bytes)
ip_version: 6
First data record at 153 pointer

$ hd -s 1112 -n 20 test.mmdb
00000458  e1 45 76 61 6c 75 65 e1  43 63 6f 6c 47 6e 65 73  |.Evalue.CcolGnes|
00000468  74 65 64 31                                       |ted1|   # Note it removed whole second object and tree points directly to the first one
0000046c

So it does deduplicate objects. Looks like it even deduplicates nested objects, which is great.

The point is it is really convenient to know those offsets when doing stuff like this.

Not sure if Go MMDB reader exposes this data, but it should be easy to find section separators (see the specs) even without parsing the file, e.g. by mmap-ing the file and using string search functions: https://github.com/svbatalov/construct_mmdb_parser/blob/11b13ef946b7d85cec4e21a538af49b5b44f22a1/parser.py#L13-L19

Thanks, Sergey

svbatalov avatar Dec 10 '23 08:12 svbatalov

Great feedback and thanks for those feature requests @svbatalov !

The data's definitely gonna be available within the MMDB library, will check if it's exposed or not, and we could try to get a PR merged to expose it if not and/or temporarily use a fork.

We can add this data to the mmdbctl metadata output - is that the ideal place to expose it for you @svbatalov ?

cc @coderholic

UmanShahzad avatar Dec 12 '23 21:12 UmanShahzad

@UmanShahzad Yeah, sounds great!

svbatalov avatar Dec 13 '23 06:12 svbatalov

The metadata has been included. Closing issue:

$ mmdbctl metadata ip_geolocation_sample.mmdb 
- Binary Format 2.0 
- Database Type ipinfo ip_geolocation_sample.mmdb 
- IP Version    6 
- Record Size   32 
- Node Count    2927 (2.86 KB)
- Tree Size     23416 (22.87 KB)
- Data Section Size 10790 (10.54 KB)
- Data Section Start Offset 23432 (22.88 KB)
- Data Section End Offset 34222 (33.42 KB)
- Metadata Section Start Offset 34236 (33.43 KB)
- Description    
    en ipinfo ip_geolocation_sample.mmdb
- Languages     en 
- Build Epoch   1722965173 

max-ipinfo avatar Aug 13 '24 23:08 max-ipinfo