This is interesting, I somehow missed this. Unfortunately, those are not full text searchable. Maybe I will download them and import them into Searchcord, with proper credit of course.
This project was (I assume) now taken down after widespread backlash from Discord users who never consented to having their messages scraped, indexed, and published. The opt-out system was not "one click"—it required navigating a server where users were mocked and dismissed for objecting to their inclusion.
While the developer claimed to be protecting privacy, the system displayed usernames and full message content from community spaces that were never intended for public indexing. The burden was placed on users to remove themselves from something they never opted into.
Discord’s own Developer Policy explicitly forbids scraping or mass data harvesting. This wasn’t innovation—it was exploitation dressed in pseudo-academic language. If you can still search by channel + user ID, that’s traceable content. That's not “privacy preserving,” that’s thinly veiled exposure.
The developer's status message (extremely concerning and very inappropriate) and behaviour during take-down tickets further emphasized the lack of empathy behind the project. This wasn’t a public service—it was a boundary violation, and the shutdown was well-earned.
I also want to challenge the idea that this tool “solves the problem of not being able to easily search Discord servers.” That’s not a problem—that’s a design choice. Discord isn’t built for global indexing on purpose. Private communities, support groups, fandoms, and sensitive spaces rely on that separation to feel safe. Treating the lack of global search as a bug instead of a boundary shows a complete disregard for how real people use the platform.
The "community spaces" opted-in to the discovery platform. It not only requires confirmation from the server owners but also that the server has specific metrics to even be elegible to enter. If the users don't want their messages to be public they should complain to the server owners instead.
The problem of "not being able to easily search Discord server" is a real problem. Not too long ago I was unable to find any information about UE4 modding and had to dig deep until I found a few discord servers centered about it. The only place of information aside from two small docs sites.
> This is my first large scale project, so I'd love to hear your feedback!
> I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.
1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
2) You not seem to have a clear rate limit documentation. If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
> 1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.
> 2) You not seem to have a clear rate limit documentation.
This is a good idea. The rate limit varies by endpoint, and I haven't gotten around to documenting each one.
> If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.
> 3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.
The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.
And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.
Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.
Look, I get the frustration that (likely) motivated this. Discord has become an information black hole for many communities, and the shift away from open, searchable forums for project support is a genuine problem I've been incredibly frustrated with myself. But this "solution" - creating a massive, non-consensual archive that tramples over user privacy (and platform terms) - creates far graver ethical and practical issues than the one it purports to solve.
> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context
Honestly, maybe they should. Maybe we need more stuff like this, until people finally wake up about the privacy catastrophe. The now defunct service spy.pet used to sell this kind of data with the stated purpose of doxxing people. There’s black markets for this. And it’s the same kind of data the service providers themselves have full access to.
> The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.
Not really, it is not free to host and serve this data. If they want to get the data for free, they can get it directly from Discord. I did that work for them.
> And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.
Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.
> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.
I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. This also happens on platforms like Telegram, look at the SangMata_BOT for example. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.
Thanks for your input, though, I really do want to build a platform that balances privacy and usability.
> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.
and that makes it ok for you to do aswell? Bots storing all the messages is also not ok, but they also don't publish it, so it is way less problematic
Okay, the "not really" and "I'll solve that problem if and when" responses are... something else. It feels like you're speedrunning how to get into a world of trouble while hand-waving away every legitimate concern. Let's try to unpack this again, because your justifications are frankly baffling.
> Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.
That's not quite correct, and frankly it borders on willful obfuscation. In your own words elsewhere in this thread, you're eager for search engines to index this archive. That "privacy preserving" barrier of needing to know both a user ID and a server/channel id evaporates the moment Google or any other search engine hoovers up your pages. At that point, any combination of keywords, usernames, aliases, or snippets could reveal someone's posting history, across contexts and years. How is that "functionally identical" to Discord's walled-garden search or "privacy preserving"?
> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized.
This is a disingenuous deflection.
- Moderation bots operate within a specific server, for a specific purpose (moderation, utility) defined by the server admins. Their logs are typically for admin/moderator use, not for creating a global, publicly searchable archive.
- Users joining a server often see these bots, understand their function, and server admins explicitly add these bots. It's a known quantity. What you're doing is orders of magnitude different - an external, uninvited entity scraping everything discoverable and making it universally public.
- Just a matter of time" is a lazy, fatalistic excuse for unethical data harvesting. Just because something can be technically scraped doesn't mean it should be, or that you doing so is fine.
Your "I really do want to build a platform that balances privacy and usability" line sounds utterly hollow when the entire foundation of the project demonstrates a profound misunderstanding, or disregard, for basic privacy, consent, and intellectual property.
Speaking of which... have you actually thought about the legal Pandora's Box you're prying open? Your casual "I'll deal with Discord's ToS issues if they arise" attitude is quaint, because Discord's ToS is likely the tip of a colossal iceberg of legal trouble.
You're not just 'breaking ToS', you're potentially looking at:
- Data Protection Law Violations (GDPR, CCPA, etc.) because you're scraping personal data of EU/California (and other) residents without any lawful basis. The fines can be astronomical. "Opt-out" after the fact for data you had no right to take in the first place isn't how this works.
- COPPA Violations if you scraped any messages from a 12-year-old on a "public, discoverable" server before their account was deleted by Discord. Guess who's holding that data now without parental consent? You.
- Every original, creative message is copyrighted by its author. Roleplay, detailed discussions, code snippets, even well-crafted tirades – you're republishing millions of these. While not every "lol" is copyrightable, a massive volume of content on Discord absolutely is. "Fair use" for wholesale, non-transformative republication on this scale? Unlikely.
- And last but not least, CSAM (Child Sexual Abuse Material): This is the nightmare scenario. You are scraping public Discord. Some public, poorly moderated Discords inevitably contain links to or text-based CSAM. Even if you don't intend to host it, if your scraper picks it up and it becomes accessible via your archive (even just a link), you are in profoundly serious trouble. "But I don't re-publish attachments" is irrelevant if you're archiving and re-publishing the links. This isn't just fines; this is potential prison time.
Good luck with all of this.
I hope you have a good lawyer, ideally multiple. You might need them.
Ridiculous take. If you're posting in a server that's intentionally open to the public and accessible to anyone with a link or even indexed by server discovery you shouldn't expect privacy. That's just the basic reality of the internet.
No, what's "ridiculous" is this simplistic, black-and-white framing that deliberately ignores any nuance, the concept of contextual integrity or reasonable user expectations.
Of course, no one expects absolute secrecy in a public-facing Discord server. That's a straw man. The issue isn't about some naive belief that messages are invisible. It's about the scope, permanence, and method of access and archiving.
People participating in public Discord spaces have reasonable contextual expectations about how their words will be accessed and by whom. They expect their messages to be seen by current and maybe future server members - not extracted, permanently archived, and made globally searchable by entirely unrelated third parties.
This is similar to how conversations in a public park are technically "public," but most people would be rightfully disturbed if someone recorded everything, transcribed it, published it online with their names attached, and made it all searchable forever. Just because something isn't strictly private doesn't mean any and all forms of collection, republication, and indexing are ethically justified.
If you can't see the distinction between "not perfectly private within this specific semi-public space" and "archived indefinitely, and globally searchable forever by anyone, anywhere, for any reason," then you're either arguing in bad faith or your understanding of these issues is so superficial that further engagement is pointless.
It seems the core concept of contextual integrity is still not landing.
It's not a question of surprise that public data can be scraped - I'm well aware of how the internet functions, thank you. The point, which you seem determined to evade, is about the fundamental ethics of systematically doing so and the vast difference in impact and expectation between, say, a server's own moderation logs or incidental screenshots, and a third party, globally indexed, permanent archive. The former serves limited, often known functions within that specific community; the latter is a privacy-invasive data trawl weaponizing the 'public' label. Just because a thing is technically possible doesn't grant a free pass to ignore privacy implications or users' reasonable expectations of how their contributions will be used and disseminated.
Your attempt to dismantle the 'public park' analogy only underscores your misunderstanding of it. The scenario isn't about someone yelling (an exceptional event, often a public nuisance, that might indeed attract specific attention or recording). It's the equivalent of someone systematically planting listening devices by every park bench, transcribing every casual, low-expectation conversation - like my dinner plans with my girlfriend, or a vent about my boss - and then publishing it all online, forever, simply because the park itself is 'public' and it was a technically possible thing to do. The ethical chasm between observing a public spectacle and conducting mass, indiscriminate surveillance of every day, semi-private interactions within a public space shouldn't be this difficult to grasp. One involves a specific event; the other is a dragnet.
As for flagging, I didn't touch your comment. I have never flagged a single comment on this site. Perhaps others simply disagreed with the quality, relevance or the dismissive tone of your contribution.
I won't continue a discussion with someone who relies on AI for writing, this response you posted presents the tells of someone using a language model to write a response paragraph.
> In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.
I suggest you do since tbh you are likely (as others have said) to be violating privacy laws with your current implementation + the discord ToS. If its anonymized better, you are less likely to be a target of someone who gets angry about not knowing you exist.
Up to you, your life your circus y'know?
> I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.
LLM data collection if its not being bought via discord already directly.
Same reason I'd want to use highly anonymized and curated data from the roleplay / writing discords as training data. It is just I'd have to go through and anonymize your data and curate it / clean it up before I would dare to send it to an LLM for legal reasons.
If I send/share PII, I'd be screwed just like you will be if someone gets upset.
> I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.
Fair, for me, this is for hobby implementations of solo roleplaying content similar to AI Dungeon and other implementations so its not commercial but my use case (for your purposes) would be better served by just being able to download a database dump (properly anoynmized or me doing it) for specific servers since most data is useless to me that you collect since I've got a specific goal in mind and want to minimize data collection for legal liability reasons. (i.e. non-commercial roleplaying with no PII or other privacy risky info is likely to be a safe use case)
EDIT:
I'd consider dropping attachments + links and only recording text as well for CSAM and other abusive material reasons. I doubt you have the moderation in place to protect yourself.
Pictures and videos and what not are a lot more dangerous to you than text would be. (i.e. despite what people say about it, realistically, most text in a public forum on the internet w/o PII is not going to get you hit with fines)
That said, personally, I would not publish this as you have because I don't have that kind of risk tolerance but I can see it being "safe enough" for some people. But the images/attachements are in "are you really sure you want to do that? You could go bankrupt" territory.
Would you consider making regular dumps of the database available in sharded torrents like Anna's Archive does so that users can back up the data themselves for preservation purposes? This would complicate retroactively removing users' activity, but that data could already be scraped.
And related, I'd like to be able to run this locally for exports of guilds that I'm on myself. Is that even possible with the architect you've built?
This is absolutely something I want to do, but at the guild level. The database itself is over 13TB which is much to large to create regular exports of. I will probably provide a SQLite export of each guild, regenerated each week/month. Anyone is free to download whatever they want in real time from the API.
You might try reaching out to Anna's Archive and see if this would be a dataset they'd be interested in helping host/distribute. I think they'd agree that such data is important and should be archived.
This is really cool and actually useful for peeking behind those annoying login walls. What software do you use to store/index/search in so much data? How did you get the data in the first place? Discord isn't exactly known for letting its data be available easily. Have the administrators of the guilds asked you for this? Have you contacted them and made them aware after the fact?
For software, I use ScyllaDB and Elasticsearch. It's split across 6 physical nodes (8 including the CDN). Data collection is handled using standard user accounts, accessing only public, discoverable servers. I plan to write a blog post about the technical aspect of how this was done soon.
Admins of these servers weren't contacted, as the content indexed is already publicly accessible, comparable to a forum like this or public subreddit. That said, I understand the sensitivity around data visibility, and I've made it very simple for any user to opt out of indexing at any time. Private or invite-only servers are, of course, completely excluded.
I suggest you to remove the opt-out functionality and let it scrape private servers that it discovers via publicly posted invite links. You don't owe anyone posting on a public forum any privacy. Moreover, the most valuable data to search for is probably somewhat obscured.
Thanks for your suggestions. However, this does not work for a few reasons:
1. Joining servers is protected by increasingly difficult to solve captchas that have no commercially available solver. This is not a battle I want to fight.
2. There are a LOT of CSAM rings that spam invite links in public servers. This is also not something I want to go anywhere near.
Moreover, after the fallout of spy.pet, I think it is very important that users are able to opt out.
It is arguably more important that the world doesn't lose as much information as Discord now contains, or leave it gated behind logins and captchas and whatever other nonsense each server implements.
I already own the hardware, so I only pay for colocation and transit. It's probably a lot less than you think. I hope to find some way to monetize it, but it is cheap enough that I can keep it running for quite a long time without any income.
Wow, that must be quite expensive! You said the files alone are a few PB. So at least 2PB / 8 servers ~= 250TB per server, which would probably put each server at > 20k $ (unless you’re putting it together with duct tape and scraps, but even then the disks will cost a ton).
Not exactly. Attachments are only fetched from Discord as the user requests them. This means that the vast majority of attachments are never stored on my server. Right now, I only have about 280TB of attachments locally on my own infrastructure. You can see more stats here: https://searchcord.io/about
I did consent for discord to have my data, I did NOT consent to you having my data.
The discord TOS clearly state:
> Our services might also provide you with access to other people’s content. You may not use this content without that person’s consent, or as allowed by law.
As I was not informed of the usage BEFORE it was taken, I could neither opt in nor opt out.
GDPR clearly states, even in the case of "legitimate interest" I have to be informed.
I only found this randomly, but if I hadn't, I had no idea of the data validation happening, so I couldn't opt out.
Unfortunately TOS consist of words and words cannot constrain technical affordances. There has been a black market of scraped discord data for years (it was even sold on the public web at spy.pet). Stuff like this is probably the only way people will wake up to the realities of digital privacy.
Hopefully this will also wake people up to the issues with putting so much information (announcements, support, documentation, etc) behind a closed platform, thus making efforts like this invaluable for the future.
Do you plan to handle servers where you need to do some action (like send a message) to join all channels ?
I was scrolling through the home page and came across afew where the only channels you're allowed to access are the verify-yourself or welcome channels.
Probably not. Discord will aggressively captcha you and every server has a different implementation of verification. It might be possible with a captcha solver and then some LLM to figure out the next steps.
Incredible work! Truly eye-opening to see how some rarer keywords in my native language return pages of relevant results. Meanwhile google gives 0 results or just AI/ad spam.
Finding good Discord servers has been a great thing for me. I was getting super disconnected and isolated, so different Discord servers has made me feel human again.
I personally like the idea of this. It affords opportunity like the farms where peoples reprehensible behavior can be documented. I bet the roblox condo bros are really going to be harping on the "I didn't consent to you documenting my public communications!"
Maybe also exclude messages by bots (e.g. "username has joined the server") from the index to decrease the stalking-potential of your site (99.9% of these bot messages have no informational-value for the index anyways). Currently you can still search for an username and get a subset of servers that the username is in (even if not active) by finding these bot messages.
This is something I have looked into. Unfortunately, every server uses a different format/bot. It might be possible to develop some sort of machine learning classifier to determine if it is a welcome message.
This is an amazing project, I always wonder how much information is lost in those chat apps, not only Discord, but also Telegram. The latter has hude dev community specifically around Android Rom Development, which migrated from forum based XDA to more flexible chat/support platform like Telegram. I wish that also can be searchable without having their client.
Telegram is already heavily monitored and scraped due to the large volume of illegal or extremely controversial activity that happens there. This is something I will look into though, my XDA threads rarely get any replies anymore. Thanks for the suggestion!
I'm in more than a hundred Discord servers. I've been wanting to scrape the members of each of them to discover the people with whom I share the most servers but we're not yet friends. Someone with 10+ would highly likely be a new friend since we'd have a lot of shared niche interests
This is something I have been trying to make as a way to learn about graph theory. If I can find a way to make it work efficiently, I will definitely add this.
Yeah that would be awesome! You could build a whole new social network on top of Discord. Whether in this way or others, I believe we'll all be finding an increasing number of hypercompatible people as technology advances.
Could Searchcord API be useful for discord servers which want to archive their chats to their own website?
e.g. I have discord server for my product and I want to copy the Q&A threads to FAQ section of my product website will Searchcord be useful for that or are there better solutions?
This is very much against TOS (selfbots are explicitly against it) and the only way you should be doing this is through the normal bot interface with community consent. Even then you could get in trouble with consent issues.
Wait until you find out about all the three-letter agencies and private scrapers doing the same.
Probably best to not let it keep you up at night, especially on public servers you yourself explicitly decided to *opt-in* to Discord's 'Discovery' feature, but who am I to say.
Some very large servers are eligible for what Discord calls "discovery". This makes their data visible without joining the server. You can find a list of those on Discord's site here: https://discord.com/servers
Suggestion: a bot for smaller servers that do want to be archived like a public forum. Their admins could install the bot themselves and perhaps specify what channels they want archived.
This is something I have already completed but have not finished bug testing. The bot also includes functionality to recover any server in case it was nuked/wiped and Searchcord has a backup of it. It uses webhooks to resend the messages so you have an approximation of what the channels used to be.
This all depends on when the message is scraped. The scraper does not go back and check for edits/deletions. This is something that while could be added, would be incredibly inefficient.
I for one appreciate your efforts, and hope that you don't cave to the negative feedback you've gotten. There is no reasonable expectation of privacy in public chatrooms available to anyone, it's odd to hear so many people getting their pants in a bunch over it
I do see how this might be useful for certain use cases, but I don't like the fact that someone is scraping my messages without my consent. You might tell me oh well I could just opt-out: but what about those who don't even know about this thing?
I can already see Discord coming after this. Good luck fighting the legal battle with them.
and in it's place, rather than a service that was censoring usernames in the interest of privacy being the leading service holding the spotlight, in the future we'll surely get another spy.pet created for the explicit purpose of doxxing getting all the attention and revenue.
Round of applause all around, great job everyone!
Next we should take down the internet archive, another "abhorrent project blatantly violating users' privacy" for all the forums it's archived over the years. Who knows, maybe there's a post by a gasp 12-year-old in there somewhere! Maybe even a European!
This is great, I'll be returning to this tool often. Thanks.
A few suggestions and ideas for futher projects.
-allow for "keyword", -negate operators and "mult word string" searches, [Pubmed](https://pubmed.ncbi.nlm.nih.gov/advanced) is what I'd consider an Ideal search interface
-allow for regex, or direct sql lookups with limited query time ratelimited by POW. for example, if the server is under load, require a token from something like [anubis](https://anubis.techaro.lol/) and lower the maximum DB query time
-Index the title of all discussion/forum type posts with a VectorDB for semantic search. And add an option to sort by replies. (Like [answer overflow](https://github.com/AnswerOverflow/AnswerOverflow))This would make it possible to find relevant discussions among ~60B messages.
ScyllaDB doesn't support vector search, so I'd suggest something like [usearch](https://github.com/unum-cloud/usearch) for a detached index. Embedding models are faster and smaller than most people realize, pick whatever's on top of the [mteb leaderboard](https://huggingface.co/spaces/mteb/leaderboard) after deciding on size.
There are some Discord archives on archive.org too btw.
https://archive.org/search?query=subject%3A%22DiscordChatExp... https://archive.org/search?query=subject%3A%22archiveteam_di... https://wiki.archiveteam.org/index.php/Discord
Hey,
This is interesting, I somehow missed this. Unfortunately, those are not full text searchable. Maybe I will download them and import them into Searchcord, with proper credit of course.
Thanks for this!
This project was (I assume) now taken down after widespread backlash from Discord users who never consented to having their messages scraped, indexed, and published. The opt-out system was not "one click"—it required navigating a server where users were mocked and dismissed for objecting to their inclusion.
While the developer claimed to be protecting privacy, the system displayed usernames and full message content from community spaces that were never intended for public indexing. The burden was placed on users to remove themselves from something they never opted into.
Discord’s own Developer Policy explicitly forbids scraping or mass data harvesting. This wasn’t innovation—it was exploitation dressed in pseudo-academic language. If you can still search by channel + user ID, that’s traceable content. That's not “privacy preserving,” that’s thinly veiled exposure.
The developer's status message (extremely concerning and very inappropriate) and behaviour during take-down tickets further emphasized the lack of empathy behind the project. This wasn’t a public service—it was a boundary violation, and the shutdown was well-earned.
I also want to challenge the idea that this tool “solves the problem of not being able to easily search Discord servers.” That’s not a problem—that’s a design choice. Discord isn’t built for global indexing on purpose. Private communities, support groups, fandoms, and sensitive spaces rely on that separation to feel safe. Treating the lack of global search as a bug instead of a boundary shows a complete disregard for how real people use the platform.
The "community spaces" opted-in to the discovery platform. It not only requires confirmation from the server owners but also that the server has specific metrics to even be elegible to enter. If the users don't want their messages to be public they should complain to the server owners instead.
The problem of "not being able to easily search Discord server" is a real problem. Not too long ago I was unable to find any information about UE4 modding and had to dig deep until I found a few discord servers centered about it. The only place of information aside from two small docs sites.
What was the developer's status message? There's no easy way to find it now. Can you share what it was? Thanks!
> This is my first large scale project, so I'd love to hear your feedback!
> I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.
1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
2) You not seem to have a clear rate limit documentation. If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
Hey,
Thanks for your suggestions.
> 1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.
> 2) You not seem to have a clear rate limit documentation.
This is a good idea. The rate limit varies by endpoint, and I haven't gotten around to documenting each one.
> If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.
> 3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.
The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.
And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.
Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.
Look, I get the frustration that (likely) motivated this. Discord has become an information black hole for many communities, and the shift away from open, searchable forums for project support is a genuine problem I've been incredibly frustrated with myself. But this "solution" - creating a massive, non-consensual archive that tramples over user privacy (and platform terms) - creates far graver ethical and practical issues than the one it purports to solve.
> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context
Honestly, maybe they should. Maybe we need more stuff like this, until people finally wake up about the privacy catastrophe. The now defunct service spy.pet used to sell this kind of data with the stated purpose of doxxing people. There’s black markets for this. And it’s the same kind of data the service providers themselves have full access to.
> The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.
Not really, it is not free to host and serve this data. If they want to get the data for free, they can get it directly from Discord. I did that work for them.
> And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.
Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.
> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.
I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. This also happens on platforms like Telegram, look at the SangMata_BOT for example. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.
Thanks for your input, though, I really do want to build a platform that balances privacy and usability.
> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.
and that makes it ok for you to do aswell? Bots storing all the messages is also not ok, but they also don't publish it, so it is way less problematic
Okay, the "not really" and "I'll solve that problem if and when" responses are... something else. It feels like you're speedrunning how to get into a world of trouble while hand-waving away every legitimate concern. Let's try to unpack this again, because your justifications are frankly baffling.
> Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.
That's not quite correct, and frankly it borders on willful obfuscation. In your own words elsewhere in this thread, you're eager for search engines to index this archive. That "privacy preserving" barrier of needing to know both a user ID and a server/channel id evaporates the moment Google or any other search engine hoovers up your pages. At that point, any combination of keywords, usernames, aliases, or snippets could reveal someone's posting history, across contexts and years. How is that "functionally identical" to Discord's walled-garden search or "privacy preserving"?
> I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized.
This is a disingenuous deflection.
Your "I really do want to build a platform that balances privacy and usability" line sounds utterly hollow when the entire foundation of the project demonstrates a profound misunderstanding, or disregard, for basic privacy, consent, and intellectual property.Speaking of which... have you actually thought about the legal Pandora's Box you're prying open? Your casual "I'll deal with Discord's ToS issues if they arise" attitude is quaint, because Discord's ToS is likely the tip of a colossal iceberg of legal trouble.
You're not just 'breaking ToS', you're potentially looking at:
Good luck with all of this.I hope you have a good lawyer, ideally multiple. You might need them.
The COPPA part is only if it was knowingly.
did you type this?
Ridiculous take. If you're posting in a server that's intentionally open to the public and accessible to anyone with a link or even indexed by server discovery you shouldn't expect privacy. That's just the basic reality of the internet.
No, what's "ridiculous" is this simplistic, black-and-white framing that deliberately ignores any nuance, the concept of contextual integrity or reasonable user expectations.
Of course, no one expects absolute secrecy in a public-facing Discord server. That's a straw man. The issue isn't about some naive belief that messages are invisible. It's about the scope, permanence, and method of access and archiving.
People participating in public Discord spaces have reasonable contextual expectations about how their words will be accessed and by whom. They expect their messages to be seen by current and maybe future server members - not extracted, permanently archived, and made globally searchable by entirely unrelated third parties.
This is similar to how conversations in a public park are technically "public," but most people would be rightfully disturbed if someone recorded everything, transcribed it, published it online with their names attached, and made it all searchable forever. Just because something isn't strictly private doesn't mean any and all forms of collection, republication, and indexing are ethically justified.
If you can't see the distinction between "not perfectly private within this specific semi-public space" and "archived indefinitely, and globally searchable forever by anyone, anywhere, for any reason," then you're either arguing in bad faith or your understanding of these issues is so superficial that further engagement is pointless.
[flagged]
It seems the core concept of contextual integrity is still not landing.
It's not a question of surprise that public data can be scraped - I'm well aware of how the internet functions, thank you. The point, which you seem determined to evade, is about the fundamental ethics of systematically doing so and the vast difference in impact and expectation between, say, a server's own moderation logs or incidental screenshots, and a third party, globally indexed, permanent archive. The former serves limited, often known functions within that specific community; the latter is a privacy-invasive data trawl weaponizing the 'public' label. Just because a thing is technically possible doesn't grant a free pass to ignore privacy implications or users' reasonable expectations of how their contributions will be used and disseminated.
Your attempt to dismantle the 'public park' analogy only underscores your misunderstanding of it. The scenario isn't about someone yelling (an exceptional event, often a public nuisance, that might indeed attract specific attention or recording). It's the equivalent of someone systematically planting listening devices by every park bench, transcribing every casual, low-expectation conversation - like my dinner plans with my girlfriend, or a vent about my boss - and then publishing it all online, forever, simply because the park itself is 'public' and it was a technically possible thing to do. The ethical chasm between observing a public spectacle and conducting mass, indiscriminate surveillance of every day, semi-private interactions within a public space shouldn't be this difficult to grasp. One involves a specific event; the other is a dragnet.
As for flagging, I didn't touch your comment. I have never flagged a single comment on this site. Perhaps others simply disagreed with the quality, relevance or the dismissive tone of your contribution.
I won't continue a discussion with someone who relies on AI for writing, this response you posted presents the tells of someone using a language model to write a response paragraph.
[flagged]
> In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.
I suggest you do since tbh you are likely (as others have said) to be violating privacy laws with your current implementation + the discord ToS. If its anonymized better, you are less likely to be a target of someone who gets angry about not knowing you exist.
Up to you, your life your circus y'know?
> I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.
LLM data collection if its not being bought via discord already directly.
Same reason I'd want to use highly anonymized and curated data from the roleplay / writing discords as training data. It is just I'd have to go through and anonymize your data and curate it / clean it up before I would dare to send it to an LLM for legal reasons.
If I send/share PII, I'd be screwed just like you will be if someone gets upset.
> I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.
Fair, for me, this is for hobby implementations of solo roleplaying content similar to AI Dungeon and other implementations so its not commercial but my use case (for your purposes) would be better served by just being able to download a database dump (properly anoynmized or me doing it) for specific servers since most data is useless to me that you collect since I've got a specific goal in mind and want to minimize data collection for legal liability reasons. (i.e. non-commercial roleplaying with no PII or other privacy risky info is likely to be a safe use case)
EDIT:
I'd consider dropping attachments + links and only recording text as well for CSAM and other abusive material reasons. I doubt you have the moderation in place to protect yourself.
Pictures and videos and what not are a lot more dangerous to you than text would be. (i.e. despite what people say about it, realistically, most text in a public forum on the internet w/o PII is not going to get you hit with fines)
That said, personally, I would not publish this as you have because I don't have that kind of risk tolerance but I can see it being "safe enough" for some people. But the images/attachements are in "are you really sure you want to do that? You could go bankrupt" territory.
Would you consider making regular dumps of the database available in sharded torrents like Anna's Archive does so that users can back up the data themselves for preservation purposes? This would complicate retroactively removing users' activity, but that data could already be scraped.
And related, I'd like to be able to run this locally for exports of guilds that I'm on myself. Is that even possible with the architect you've built?
Hey,
This is absolutely something I want to do, but at the guild level. The database itself is over 13TB which is much to large to create regular exports of. I will probably provide a SQLite export of each guild, regenerated each week/month. Anyone is free to download whatever they want in real time from the API.
Thanks for your question!
Big +1 for dumps
You might try reaching out to Anna's Archive and see if this would be a dataset they'd be interested in helping host/distribute. I think they'd agree that such data is important and should be archived.
Yes, we'd happily host and torrent this.
I was imagining something like a month-by-month historical export so only recent data would need to be regularly exported.
SQLite dumps would be awesome to work with though.
This is really cool and actually useful for peeking behind those annoying login walls. What software do you use to store/index/search in so much data? How did you get the data in the first place? Discord isn't exactly known for letting its data be available easily. Have the administrators of the guilds asked you for this? Have you contacted them and made them aware after the fact?
Hey,
Thanks for your feedback.
For software, I use ScyllaDB and Elasticsearch. It's split across 6 physical nodes (8 including the CDN). Data collection is handled using standard user accounts, accessing only public, discoverable servers. I plan to write a blog post about the technical aspect of how this was done soon.
Admins of these servers weren't contacted, as the content indexed is already publicly accessible, comparable to a forum like this or public subreddit. That said, I understand the sensitivity around data visibility, and I've made it very simple for any user to opt out of indexing at any time. Private or invite-only servers are, of course, completely excluded.
I suggest you to remove the opt-out functionality and let it scrape private servers that it discovers via publicly posted invite links. You don't owe anyone posting on a public forum any privacy. Moreover, the most valuable data to search for is probably somewhat obscured.
Hey,
Thanks for your suggestions. However, this does not work for a few reasons:
1. Joining servers is protected by increasingly difficult to solve captchas that have no commercially available solver. This is not a battle I want to fight.
2. There are a LOT of CSAM rings that spam invite links in public servers. This is also not something I want to go anywhere near.
Moreover, after the fallout of spy.pet, I think it is very important that users are able to opt out.
It is arguably more important that the world doesn't lose as much information as Discord now contains, or leave it gated behind logins and captchas and whatever other nonsense each server implements.
That's a lot of compute, how much does it cost to keep it running? I don't see how that project would generate any income on its own
I already own the hardware, so I only pay for colocation and transit. It's probably a lot less than you think. I hope to find some way to monetize it, but it is cheap enough that I can keep it running for quite a long time without any income.
Wow, that must be quite expensive! You said the files alone are a few PB. So at least 2PB / 8 servers ~= 250TB per server, which would probably put each server at > 20k $ (unless you’re putting it together with duct tape and scraps, but even then the disks will cost a ton).
Hey,
Not exactly. Attachments are only fetched from Discord as the user requests them. This means that the vast majority of attachments are never stored on my server. Right now, I only have about 280TB of attachments locally on my own infrastructure. You can see more stats here: https://searchcord.io/about
Thanks for your question!
Thanks for this. Well good luck with keeping it up, it's a really useful service.
This clearly breaks discord TOS.
I did consent for discord to have my data, I did NOT consent to you having my data.
The discord TOS clearly state: > Our services might also provide you with access to other people’s content. You may not use this content without that person’s consent, or as allowed by law.
As I was not informed of the usage BEFORE it was taken, I could neither opt in nor opt out.
GDPR clearly states, even in the case of "legitimate interest" I have to be informed.
I only found this randomly, but if I hadn't, I had no idea of the data validation happening, so I couldn't opt out.
Technically, cool project. Legally, not so cool.
Unfortunately TOS consist of words and words cannot constrain technical affordances. There has been a black market of scraped discord data for years (it was even sold on the public web at spy.pet). Stuff like this is probably the only way people will wake up to the realities of digital privacy.
Hopefully this will also wake people up to the issues with putting so much information (announcements, support, documentation, etc) behind a closed platform, thus making efforts like this invaluable for the future.
Do you plan to handle servers where you need to do some action (like send a message) to join all channels ?
I was scrolling through the home page and came across afew where the only channels you're allowed to access are the verify-yourself or welcome channels.
Probably not. Discord will aggressively captcha you and every server has a different implementation of verification. It might be possible with a captcha solver and then some LLM to figure out the next steps.
nice project. how are you going to handle the issues involved with breaking Discord's TOS?
> "scraping our services without our written consent"
additionally, are these pages indexable? i know of other projects (opt-in) that create pages made from user discussion.
> how are you going to handle the issues involved with breaking Discord's TOS?
Not sure. I will solve that problem if and when Discord takes issue with Searchcord.
> additionally, are these pages indexable?
Yes, I would actually like for search engines to index it as their search is much more contextually aware than mine.
> Not sure. I will solve that problem if and when Discord takes issue with Searchcord.
You are in for such a rude awakening once discord gets to you
I've been looking for something like this for so long, thanks for making!
There's so much stuff locked in Discord now that forums have fallen in popularity, think this sort of thing really helps unlock that knowledge again.
Thanks for your feedback! <3
Incredible work! Truly eye-opening to see how some rarer keywords in my native language return pages of relevant results. Meanwhile google gives 0 results or just AI/ad spam.
How much storage did you had to use? also are you shutting down already? its a shame that I couldn't find like minded people like me..
Finding good Discord servers has been a great thing for me. I was getting super disconnected and isolated, so different Discord servers has made me feel human again.
I hope Searchcord helps you! <3
I personally like the idea of this. It affords opportunity like the farms where peoples reprehensible behavior can be documented. I bet the roblox condo bros are really going to be harping on the "I didn't consent to you documenting my public communications!"
Maybe also exclude messages by bots (e.g. "username has joined the server") from the index to decrease the stalking-potential of your site (99.9% of these bot messages have no informational-value for the index anyways). Currently you can still search for an username and get a subset of servers that the username is in (even if not active) by finding these bot messages.
This is something I have looked into. Unfortunately, every server uses a different format/bot. It might be possible to develop some sort of machine learning classifier to determine if it is a welcome message.
This is an amazing project, I always wonder how much information is lost in those chat apps, not only Discord, but also Telegram. The latter has hude dev community specifically around Android Rom Development, which migrated from forum based XDA to more flexible chat/support platform like Telegram. I wish that also can be searchable without having their client.
Telegram is already heavily monitored and scraped due to the large volume of illegal or extremely controversial activity that happens there. This is something I will look into though, my XDA threads rarely get any replies anymore. Thanks for the suggestion!
I'm in more than a hundred Discord servers. I've been wanting to scrape the members of each of them to discover the people with whom I share the most servers but we're not yet friends. Someone with 10+ would highly likely be a new friend since we'd have a lot of shared niche interests
This is something I have been trying to make as a way to learn about graph theory. If I can find a way to make it work efficiently, I will definitely add this.
Yeah that would be awesome! You could build a whole new social network on top of Discord. Whether in this way or others, I believe we'll all be finding an increasing number of hypercompatible people as technology advances.
Congratulations and all the best.
Could Searchcord API be useful for discord servers which want to archive their chats to their own website?
e.g. I have discord server for my product and I want to copy the Q&A threads to FAQ section of my product website will Searchcord be useful for that or are there better solutions?
https://support.discord.com/hc/en-us/articles/360039598252-P...
> 3 years ago
100% on their radar. surely
This is very much against TOS (selfbots are explicitly against it) and the only way you should be doing this is through the normal bot interface with community consent. Even then you could get in trouble with consent issues.
Even then that is probably against the developer policy & terms
> 20. Do not mine or scrape any data, content, or information available on or through Discord services (as defined in our Terms of Service).
https://support-dev.discord.com/hc/en-us/articles/8563934450... (point 20)
I own one of the servers that you scraped and uploaded without consent.
This violates Discord TOS and a whole host of privacy laws and I will be taking action.
Did you make an account just to post this
Wait until you find out about all the three-letter agencies and private scrapers doing the same.
Probably best to not let it keep you up at night, especially on public servers you yourself explicitly decided to *opt-in* to Discord's 'Discovery' feature, but who am I to say.
cry lol. discord is a company that doesn't care about you
All discord servers require an invitation link as far as I know, do you consider a link you find online as a public server?
Some very large servers are eligible for what Discord calls "discovery". This makes their data visible without joining the server. You can find a list of those on Discord's site here: https://discord.com/servers
Suggestion: a bot for smaller servers that do want to be archived like a public forum. Their admins could install the bot themselves and perhaps specify what channels they want archived.
This is something I have already completed but have not finished bug testing. The bot also includes functionality to recover any server in case it was nuked/wiped and Searchcord has a backup of it. It uses webhooks to resend the messages so you have an approximation of what the channels used to be.
Check out Linen https://www.linen.dev/
How does this tool handle it when a user deletes or edits their message? Is the message still logged on your site?
This all depends on when the message is scraped. The scraper does not go back and check for edits/deletions. This is something that while could be added, would be incredibly inefficient.
I for one appreciate your efforts, and hope that you don't cave to the negative feedback you've gotten. There is no reasonable expectation of privacy in public chatrooms available to anyone, it's odd to hear so many people getting their pants in a bunch over it
Any chance for a style option that doesn't have anime girls?
lol. It's only on the homepage, not on any content pages.
How can i talk to you via dms?
Can I download all the messages & attachments?
Sure, but there's a few petabytes of attachments and over 63 billion messages. Feel free to use the API.
So basically Spy.pet v2? Hmm.
I do see how this might be useful for certain use cases, but I don't like the fact that someone is scraping my messages without my consent. You might tell me oh well I could just opt-out: but what about those who don't even know about this thing?
I can already see Discord coming after this. Good luck fighting the legal battle with them.
hopefully a legal action is taken against such abhorrent project blatantly violating users' privacy
and in it's place, rather than a service that was censoring usernames in the interest of privacy being the leading service holding the spotlight, in the future we'll surely get another spy.pet created for the explicit purpose of doxxing getting all the attention and revenue.
Round of applause all around, great job everyone!
Next we should take down the internet archive, another "abhorrent project blatantly violating users' privacy" for all the forums it's archived over the years. Who knows, maybe there's a post by a gasp 12-year-old in there somewhere! Maybe even a European!
[dead]
This is great, I'll be returning to this tool often. Thanks.
A few suggestions and ideas for futher projects.
-allow for "keyword", -negate operators and "mult word string" searches, [Pubmed](https://pubmed.ncbi.nlm.nih.gov/advanced) is what I'd consider an Ideal search interface
-allow for regex, or direct sql lookups with limited query time ratelimited by POW. for example, if the server is under load, require a token from something like [anubis](https://anubis.techaro.lol/) and lower the maximum DB query time
-Index the title of all discussion/forum type posts with a VectorDB for semantic search. And add an option to sort by replies. (Like [answer overflow](https://github.com/AnswerOverflow/AnswerOverflow))This would make it possible to find relevant discussions among ~60B messages. ScyllaDB doesn't support vector search, so I'd suggest something like [usearch](https://github.com/unum-cloud/usearch) for a detached index. Embedding models are faster and smaller than most people realize, pick whatever's on top of the [mteb leaderboard](https://huggingface.co/spaces/mteb/leaderboard) after deciding on size.
-calculate the jaccard similarity (user overlap) between discord server members, this would allow for searching in "similar" severs, and potentially, mapping discord. [github](https://anvaka.github.io/map-of-github) [reddit](https://anvaka.github.io/map-of-reddit)
-fix doxing. Searching by <@userid> is currently possible.
-expect the alternative to the cloudflare captcha to be abused, it's too simple for modern solvers.
-open source the stack? I'm interested in the scraper.
[dead]
[flagged]
[flagged]
[flagged]