Well, CT logs are a data dump, they are not searchable, ingesting all that data near-real time and making it searchable in a useful and fast way (especially with wildcards) is actually quite challenging!
Have you considered adding a monitoring feature where a user can enter a domain to be monitored and then be notified if a "similar" domain comes across the ingestion pipeline.
This would be useful for early detection of potential impersonations/typo-squatting domains typically used for phishing/scams.
Something as simple as a configurable levenshtein distance/jaro-winkler similarity check across CN and SAN of all new certs maybe? (user can configure with threshold to control how "noisy" they want their feed).
I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
I'm not sure but I believe that's used by Google internally for testing purposes.
For example if you search google, it returns 120k+ results, and these useless results are at the front.
> I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
The goal is to have something exhaustive so I'll keep them. But you are right that I probably should not put them at front.
Not sure how important it is though as these results shouldn't match many queries.
I am not using certstream as we'd lose data on the first network error. The way it's designed is more "Rsync for ct logs" than something like a stream => storage system.
How strange. I just tried this out and I see two unauthorised subdomains, with one being an actual "spam" website. However, I don't even know how to delete a subdomain that doesn't show up in my domain registrar or cloudflare!
Is this just searching certificate transparency logs?
I'd imagine it's a combination of
- CT log monitoring (https://github.com/CaliDog/CertStream-Server)
- Mass-Scanning across ipv4 on 80/443 at the least?
- Brute-forcing subdomains on wildcards with large DNS wordlist (like something from assetnote: https://wordlists-cdn.assetnote.io/data/manual/best-dns-word...)
- Scraping/extracting subdomains/domains from JS
But I've never attempted to enumerate subdomains on this scale before, so I could be missing something obvious
I think it's a mix of different sources. Certainly, some of my subdomains there never had an SSL certificate.
Well, CT logs are a data dump, they are not searchable, ingesting all that data near-real time and making it searchable in a useful and fast way (especially with wildcards) is actually quite challenging!
Where does one ingest them from?
https://github.com/google/certificate-transparency-community...
Thanks!
I have subdomains with (non-wildcard) certificates that aren't on there.
[dead]
Have you considered adding a monitoring feature where a user can enter a domain to be monitored and then be notified if a "similar" domain comes across the ingestion pipeline.
This would be useful for early detection of potential impersonations/typo-squatting domains typically used for phishing/scams.
Something as simple as a configurable levenshtein distance/jaro-winkler similarity check across CN and SAN of all new certs maybe? (user can configure with threshold to control how "noisy" they want their feed).
Try https://dnstwister.report/
[dead]
For sure, it was on my todo list :)
Awesome, I will keep my eye on this for sure, I've spent the past few months tinkering with ingesting CT logs for bug bounty automation.
Curious if you're running your own CertStream server, or just continuously polling known CT logs with your own implementation.
I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
I'm not sure but I believe that's used by Google internally for testing purposes.
For example if you search google, it returns 120k+ results, and these useless results are at the front.
> I also noticed you are ingesting/storing flowers-to-the-world.com certs, not sure what stage of optimization you are at but blacklisting/ignoring these certs in my ingestion pipeline helped with avoiding storing unnecessary data
The goal is to have something exhaustive so I'll keep them. But you are right that I probably should not put them at front. Not sure how important it is though as these results shouldn't match many queries.
Exhaustive/Robust is the way for sure.
Minimizing storage was a priority for me since it's just a small side-project/automation.
I've looked for information on what the hell the `flowers-to-the-world` entries are that pop and have found nothing, curious what's going on there.
It's actually a google thing!
I found that back then when I wondered the same: https://medium.com/@hadfieldp/hey-ryan-c0fee84b5c39
Ahhhh, that tracks, cheers mate.
I am not using certstream as we'd lose data on the first network error. The way it's designed is more "Rsync for ct logs" than something like a stream => storage system.
Btw, you can get our feed like that:
Doesn't seem to work if you have wildcard records, e.g.
https://www.merklemap.com/search?query=marginalia.nu&page=1
doesn't catch the fact that I have like 20 viable subdomains for marginalia.nu.
Yup. I can see the ones I registered without a wildcard, but all the ones I created after that are hidden.
How strange. I just tried this out and I see two unauthorised subdomains, with one being an actual "spam" website. However, I don't even know how to delete a subdomain that doesn't show up in my domain registrar or cloudflare!
Thanks for the tool.
Found only 43 subdomains for *.wordpress.com.