Another solution is to use an MMDB (“MaxMind DB”) file [1] which is essentially a binary tree + deduplicated values (same as idea 3.1).
There are several free ASN MMDBs [2,3] but you can also build your own MMDB files from any Prefix->Value mapping with the mmdbwriter library [4] or a CLI tool built on top of it like mmdbctl [5].
Assuming the ASN MMDB is fully loaded in memory, it would use around 60MB.
If you want to get fun with it you can actually fit the entire IP range in a single 32 bit value. On the internet networks larger than /24s aren't considered valid advertisements so you only need to store 3 bytes (the first 24 bits) of the address to hold the start. If you store than in a uint32 you can use the remaining 8 bits to hold the netmask (you don't actually need all of them but it keeps things byte aligned), which lets you dynamically calculate the end address in just a few CPU cycles. The search still occurs on the base address portion so you don't ever have to calculate the end address until you've already found the right base address.
> fit the entire IP range in a single 32 bit value
… only if you pretend IPv6 does not exist, which isn't the case in the original blogpost.
But the idea can be extended to fit entire IP (both v4 and v6) subnets in an uint64, given that the minimal IPv6 prefix size to be globally routed is /48, you could use the first byte of the uint64 to mark the address type (e.g. 0x4 for IPv4 and 0x6 for IPv6), the remaining 3 (for IPv4) or 6 (for IPv6) bytes for actual prefix, and the last byte for prefix length.
If you do it this way I think you lose all of the prefix size info I.e. every match will claim to be part of a /24 match. Depends if you care showing that portion or not if that matters though.
Hey, Julia! I know this isn't really material, but: `hallpass`, our SSH server, probably shouldn't be using 40MB. We had a bug in our `init` that we fixed recently: `hallpass` is a Go program, and we were passing user environment variables to it, so users who were passing Go memory settings were also passing settings through to `hallpass`.
If you're still seeing `hallpass` using that much memory, let me know! We'll figure out what's going on.
Since this is fairly static data, preprocess the on-disk format to match your memory layout and just mmap the file so that your memory usage for the DB is effectively 0 and the OS kernel pages it in/out for you based on system-level needs?
Rest of the described approaches relies on data being in memory rather on disk - not sure whether it was microsd card, SSD over USB, etc. but there's in memory option for SQLite.
Having that in mind it's time to optimize storage - instead of text just store using two 64 bit integers and it was already noticed two indexes are not very helpful - just create one that has both columns included - preferable as primary key so you don't waste space and already have rest of the select data included in single lookup.
Happy to see revisited results, but still not sure if that's a best approach for your problem - but seems like obvious improvements for generic SQL approach.
> As an aside – I’m storing the ASN in a uint32, is that right? I looked in the ip2asn file and the biggest one seems to be 401307, though there are a few lines that say 4294901931 which is much bigger and would not fit in a uint32. I decided to just ignore the gigantic ASNs and assume I could fit everything in a uint32.
Does anyone have a theory to how that 4294967296 might have made its way into the data? I'm getting 0 matches on the actual internet with:
birdc show route protocol peer4 | grep 4294967296 -c
And ASNs are certainly not >32 bit.
Edit: actually, I'm not seeing it in the linked iptoasn data either?
Bogus announcements are probably filtered by your upstream(s) (see [1] for a common list of filters).
IP-to-ASN mappings are typically built from route collectors [2,3] that peer with various networks and receive their announcements. AFAIK route collectors don't filter anything and it's easy to find bogus announcements (e.g. private ASNs) in the data.
I can't find 4294967296 from a quick glance at the latest RouteViews data but I can find other private ASNs. For example AS7594 - AS2764 - AS4294901866 for 210.10.189.0/24 seen by the route-views.perth collector.
I don't know what kind of filtering iptoasn.com is doing but at work (ipinfo.io) we do filter bogus origins, as well as a bunch of other things like RPKI/IRR-invalid routes and hyper-specific prefixes (> /24 or /48) [4].
Actually 4294967296 couldn't ever appear as the maximum value you can fit in the protocol field is 1 less than that... my problem here was I couldn't manage to keep the 2 numbers I was comparing (the one in the article and 2^32) straight haha! This was mistake was noted by a commenter here https://news.ycombinator.com/item?id=41963745
That said you're ultimately right that my upstream provider is filtering the 4294901866 value from the article as well anyways for the reasons you stated.
Ha, this is exactly right! It's 100% in u32 range and part of the private use reservations so wouldn't appear in the internet table unless such values aren't filtered by the provider and/or it's being used locally.
Originally I went to write your exact comment because it seemed the value in the article should fit at a glance and then I must have done the comparison check backwards because I started pasting the 2^32 value in the rest of my comment concluded it was actually too large when really I had just jumbled things about.
You need an array of 64-bit values: a 32-bit end address, and a 32-bit ASN. Use 0 for unallocated ranges - it's an invalid ASN.
For IPv6 you need more. It just barely fits if you consider that all public Internet addresses start with 001 and only up to 48 bits can be a published prefix - that's 45 bits - and the public ASNs go up to 19 bits according to your article - that's exactly 64 in total (those high ASNs are bad data from wherever you're getting it from). But it won't work once the next 100000 ASNs get assigned, so you'd better go up to 96 or 128 bits and store the ASN properly in 32 bits.
For cache efficiency you should have one table of addresses and another parallel table of ASNs. Then every access to the address table that pulls in a cache line doesn't waste half the cache line with ASNs you aren't looking for. It won't affect the total memory use, however.
A separate table would hold the name and country for each ASN.
SQL is not going to be of much help when the dataset easily fits into memory, but I'm a bit surprised by the lack of performance of the trie. I guess the overhead is too much when the dataset has a lot of single IP addresses (only one byte difference), but I imagine it would be helpful for sparse IPv6 addresses.
Go has an excellent standard library, but the solutions in there rarely compete with others writing a dedicated library to solve a hard problem they really had to solve.
The trie approach seemed to be a bit plain. Tries are particularly good for storing netblocks, and see if a particular IP belongs to some block stored there. But is better to use binary values instead of strings, and to use patricia or radix trees as they might compress strings of bits without branches with a single node. And, anyway, postgresql already have the netblock type with efficient indexes for them.
But if that is used for individual IPs, without worrying about blocks they belong to, probably won't get big gains in that area.
That has the benefit of Systemd not letting more than one instance run at the same time, which I'm guessing is one of the reasons the lock was needed in the first place.
You just have to make sure that the type of Service used is correct, so that Systemd can track whether Restic has actually stopped running.
Another solution is to use an MMDB (“MaxMind DB”) file [1] which is essentially a binary tree + deduplicated values (same as idea 3.1).
There are several free ASN MMDBs [2,3] but you can also build your own MMDB files from any Prefix->Value mapping with the mmdbwriter library [4] or a CLI tool built on top of it like mmdbctl [5].
Assuming the ASN MMDB is fully loaded in memory, it would use around 60MB.
[1] https://maxmind.github.io/MaxMind-DB/
[2] https://dev.maxmind.com/geoip/docs/databases/asn/
[3] https://ipinfo.io/products/free-ip-data-downloads
[4] https://github.com/maxmind/mmdbwriter
[5] https://github.com/ipinfo/mmdbctl
(I work for IPinfo, but there are lots of other companies offering MMDB files).
This is the same idea I used for my Cloud IP lookup tool [1], lets it all work in browser with a small file to search against
[1] https://cloud-ips.s3-us-west-2.amazonaws.com/index.html
Oh this is nice, and a cool use of HTTP range requests!
Oh, then you'll love: Hosting SQLite databases on GitHub Pages or any static file hoster (2021), https://news.ycombinator.com/item?id=27016630
cool
https://github.com/adulau/mmdb-server
"MaxMouchet DB"
If you want to get fun with it you can actually fit the entire IP range in a single 32 bit value. On the internet networks larger than /24s aren't considered valid advertisements so you only need to store 3 bytes (the first 24 bits) of the address to hold the start. If you store than in a uint32 you can use the remaining 8 bits to hold the netmask (you don't actually need all of them but it keeps things byte aligned), which lets you dynamically calculate the end address in just a few CPU cycles. The search still occurs on the base address portion so you don't ever have to calculate the end address until you've already found the right base address.
> fit the entire IP range in a single 32 bit value
… only if you pretend IPv6 does not exist, which isn't the case in the original blogpost.
But the idea can be extended to fit entire IP (both v4 and v6) subnets in an uint64, given that the minimal IPv6 prefix size to be globally routed is /48, you could use the first byte of the uint64 to mark the address type (e.g. 0x4 for IPv4 and 0x6 for IPv6), the remaining 3 (for IPv4) or 6 (for IPv6) bytes for actual prefix, and the last byte for prefix length.
> On the internet networks larger than /24s aren't considered valid advertisements
The number 24 is larger, but the network is smaller.
Good catch, thanks! This should have been "netmasks" instead of "networks" but it's too late for me to edit now.
A netmask is different and deprecated terminology.
The "/24" you refer to is a "prefix length", i.e. a network prefix being advertised by BGP.
for ipv4 you can do with with just a single lookup, so real O(1)
store the entire thing in one big array, indexed by the uint24 value of the /24
so ASN for 0.0.0.0 stored in position 0, ASN for 0.0.1.0 stored in position 1, etc
and you can drop 0.0.0.0/8, 10.0.0.0/8, 127.0.0.0/8, 224.0.0.0/4, 240.0.0.0/24 to save 35 /8s
(256-16-16-3)256256*4 bytes needed -> 55mb
mmap it if necessary
If you do it this way I think you lose all of the prefix size info I.e. every match will claim to be part of a /24 match. Depends if you care showing that portion or not if that matters though.
Hey, Julia! I know this isn't really material, but: `hallpass`, our SSH server, probably shouldn't be using 40MB. We had a bug in our `init` that we fixed recently: `hallpass` is a Go program, and we were passing user environment variables to it, so users who were passing Go memory settings were also passing settings through to `hallpass`.
If you're still seeing `hallpass` using that much memory, let me know! We'll figure out what's going on.
Since this is fairly static data, preprocess the on-disk format to match your memory layout and just mmap the file so that your memory usage for the DB is effectively 0 and the OS kernel pages it in/out for you based on system-level needs?
Or get a ready-made solution like libloc that implements this already with a tree, deduplication and blazing-fast lookups.
See https://location.ipfire.org/
I feel like you should add a slight disclaimer that you are involved in the project in some capacity :)
Would be nice to see revisited SQLite solution.
Rest of the described approaches relies on data being in memory rather on disk - not sure whether it was microsd card, SSD over USB, etc. but there's in memory option for SQLite.
Having that in mind it's time to optimize storage - instead of text just store using two 64 bit integers and it was already noticed two indexes are not very helpful - just create one that has both columns included - preferable as primary key so you don't waste space and already have rest of the select data included in single lookup.
Happy to see revisited results, but still not sure if that's a best approach for your problem - but seems like obvious improvements for generic SQL approach.
> As an aside – I’m storing the ASN in a uint32, is that right? I looked in the ip2asn file and the biggest one seems to be 401307, though there are a few lines that say 4294901931 which is much bigger and would not fit in a uint32. I decided to just ignore the gigantic ASNs and assume I could fit everything in a uint32.
Does anyone have a theory to how that 4294967296 might have made its way into the data? I'm getting 0 matches on the actual internet with:
And ASNs are certainly not >32 bit.Edit: actually, I'm not seeing it in the linked iptoasn data either?
https://datatracker.ietf.org/doc/html/rfc6996#section-5 - it's in the range reserved for private use, so it's probably network internal to wherever the list is originally from?
Bogus announcements are probably filtered by your upstream(s) (see [1] for a common list of filters).
IP-to-ASN mappings are typically built from route collectors [2,3] that peer with various networks and receive their announcements. AFAIK route collectors don't filter anything and it's easy to find bogus announcements (e.g. private ASNs) in the data.
I can't find 4294967296 from a quick glance at the latest RouteViews data but I can find other private ASNs. For example AS7594 - AS2764 - AS4294901866 for 210.10.189.0/24 seen by the route-views.perth collector.
I don't know what kind of filtering iptoasn.com is doing but at work (ipinfo.io) we do filter bogus origins, as well as a bunch of other things like RPKI/IRR-invalid routes and hyper-specific prefixes (> /24 or /48) [4].
[1] https://bgpfilterguide.nlnog.net
[2] https://www.routeviews.org/routeviews/
[3] https://www.ripe.net/analyse/internet-measurements/routing-i...
[4] https://hyperspecifics.io
Actually 4294967296 couldn't ever appear as the maximum value you can fit in the protocol field is 1 less than that... my problem here was I couldn't manage to keep the 2 numbers I was comparing (the one in the article and 2^32) straight haha! This was mistake was noted by a commenter here https://news.ycombinator.com/item?id=41963745
That said you're ultimately right that my upstream provider is filtering the 4294901866 value from the article as well anyways for the reasons you stated.
Ah right haha. Thanks for the heads up, I should have checked ^^
Also 4_294_901_931 is actually just small enough to fit in a u32 whose max value is 4_294_967_295
Perhaps hex notation helps to see things better: 0xFFFF00AB. It's actually nearly 64K below the maximum. 0xAB is 171.
Ha, this is exactly right! It's 100% in u32 range and part of the private use reservations so wouldn't appear in the internet table unless such values aren't filtered by the provider and/or it's being used locally.
Originally I went to write your exact comment because it seemed the value in the article should fit at a glance and then I must have done the comparison check backwards because I started pasting the 2^32 value in the rest of my comment concluded it was actually too large when really I had just jumbled things about.
Thanks for setting my mind straight!
The ip2asn-v4-u32.tsv file has
Interesting read - but how cool is Mess with DNS! What a nice live learning tool.
You need an array of 64-bit values: a 32-bit end address, and a 32-bit ASN. Use 0 for unallocated ranges - it's an invalid ASN.
For IPv6 you need more. It just barely fits if you consider that all public Internet addresses start with 001 and only up to 48 bits can be a published prefix - that's 45 bits - and the public ASNs go up to 19 bits according to your article - that's exactly 64 in total (those high ASNs are bad data from wherever you're getting it from). But it won't work once the next 100000 ASNs get assigned, so you'd better go up to 96 or 128 bits and store the ASN properly in 32 bits.
For cache efficiency you should have one table of addresses and another parallel table of ASNs. Then every access to the address table that pulls in a cache line doesn't waste half the cache line with ASNs you aren't looking for. It won't affect the total memory use, however.
A separate table would hold the name and country for each ASN.
For even more savings: One netip.Prefix will use less memory than two netip.Addr
Better still, use the free geolite ASN MMDB with geoip2-golang[0]. Or the lower-level maxminddb-golang[1] if you only need certain fields.
0 - https://github.com/oschwald/geoip2-golang
1 - https://github.com/oschwald/maxminddb-golang
A good example on how to use better data structures, instead of the usual rewrite into something else.
SQL is not going to be of much help when the dataset easily fits into memory, but I'm a bit surprised by the lack of performance of the trie. I guess the overhead is too much when the dataset has a lot of single IP addresses (only one byte difference), but I imagine it would be helpful for sparse IPv6 addresses.
Go has an excellent standard library, but the solutions in there rarely compete with others writing a dedicated library to solve a hard problem they really had to solve.
The trie approach seemed to be a bit plain. Tries are particularly good for storing netblocks, and see if a particular IP belongs to some block stored there. But is better to use binary values instead of strings, and to use patricia or radix trees as they might compress strings of bits without branches with a single node. And, anyway, postgresql already have the netblock type with efficient indexes for them.
But if that is used for individual IPs, without worrying about blocks they belong to, probably won't get big gains in that area.
For hierarchy like IP addresses, I've seen programs use specialized data structures like CritBit Tries [0] and Allotment Routing Tables [1].
[0] https://news.ycombinator.com/item?id=3015246
[1] https://github.com/openbsd/src/blob/2bd42e97200bee/sys/net/a...
Had the same issue with restic taking up a lock in the event of an unsuccessful operation. I have a very simple workaround for it.
ExecStartPre=/usr/bin/restic unlock
ExecStart=/usr/bin/restic backup
That has the benefit of Systemd not letting more than one instance run at the same time, which I'm guessing is one of the reasons the lock was needed in the first place.
You just have to make sure that the type of Service used is correct, so that Systemd can track whether Restic has actually stopped running.
I wonder if DuckDB could do better than SQLite?
[dead]
[dead]