I was recently researching ways of anonymizing production data for staging, and I also found existing tools either cumbersome to setup or lacking in features.
I stumbled upon clickhouse-obfuscator[1], and really liked that it worked on standalone dump formats (CSV, Parquet, etc.) rather than any specific DBMS. I think that's a great approach for this, since it keeps things simple and generic, and it can be conveniently added as a middle step in the backup-restore pipeline. Unfortunately, the tool is quite barebones, and has issues maintaining referential integrity, so we had to abandon it.
This is still an unsolved problem in our team, so I'll keep an eye on your tool. We would need support for ClickHouse as well, so it's good you're planning support for other DBMSs. Good luck!
This is really awesome - and it's so amazing that you've build this as a standalone tool!
I can absolutely speak to the pain of having a dozen pg_dump --exclude-table-data arguments and having a developer experience that makes it difficult to reproduce bugs due to drift between production data and test fixtures (even if they share the same schema, assumptions can change massively!).
Secure and robust database cloning also enables preview apps that actually answer the stakeholder question "can I see/play with what the new code would do, if applied to the actual [document/record/product listing] that motivated the feature/bugfix?" Subsetting and PII masking are both critical for this, and it's amazing to see that you've thought about them as integral parts of the same product.
I really want to see a product like this succeed! The easier the tool is to use, the harder it might be to monetize... but there are so many applications of a tool like this, including ones that can materially improve security at organizations large and small (https://nabeelqu.substack.com/i/150188028/secrets just posted here earlier today remarks on this!) that I'm sure you'll find the right niche!
> The easier the tool is to use, the harder it might be to monetize... but there are so many applications of a tool like this
I am in favour with you! That’s exactly why we've started developing our Dynamic Staging Environments platform. It will integrate seamlessly with CI/CD systems, enabling to generate, create and maintain stateful components of your services more efficiently.
One key insight we’ve gained is that this type of software introduces new responsibilities across the Dev, Business, and Security teams within a company. This can result in complex processes and interactions. For example, changes to schemas often require approvals from both the security team and service owners, which can slow down workflows. That's why I really appreciate how Bytebase (https://www.bytebase.com/) tackled this issue—by providing a platform that combines GitOps for stateful resources with business logic in a streamlined way. We should be looking in this direction.
Important Remark:
Current design of Greenmask fits well with future plathform. We remain committed to focusing on simple utilities, as they are essential—especially for small businesses. These tools serve a critical purpose by solving problems quickly for smaller companies and for individuals who need a solution that works ASAP.
Oh, that's interesting - sounds like you're implying that the act of making a change in how a database's data is transformed for different parties/outputs, would go through through a review/governance/rollout process, in a similar way (but with even more stakeholders) that a schema migration or code deployment would? And you'd provide GUI and governance controls on top of that? That makes a ton of sense!
And I wonder if it would be worth talking to folks who do SOC 2 auditing; if you could say "we can provide a framework that allows you to continue to be SOC 2 certified while letting your developers access real-world data" that would be tremendously valuable.
> that a schema migration or code deployment would?
> provide GUI and governance controls on top of that?
Exactly. Data doesn’t exist in isolation. Databases are dependencies of services, and schemas evolve throughout the software lifecycle, often managed by different data migration tools. In large organisations, regular developers usually don’t have direct access to the data sources, and masking rules along with real data sources are often restricted. Schema changes must be validated by the responsible data governance teams to ensure compliance and accuracy.
That’s why we implemented the validate command even in this standalone tool, which checks for schema differences and prevents running a dump if any schema changes are detected with detailed warnings. https://docs.greenmask.io/latest/commands/validate
I once presented Greenmask at an event organized by Percona in Cyprus, and one of the questions raised was: “What if we have a staging database, but instead of cleaning up the database and data, we want to add something to the existing dataset?” At the time, I didn’t have an immediate answer. However, this question inspired me to think, and eventually, I found a solution that at least partially covers this case:
I want to emphasize that this type of software must be flexible and adaptable to meet the ever-evolving needs of businesses… Otherwise, the project is as good as dead.
——————————
> And I wonder if it would be worth talking to folks who do SOC 2 auditing
I’ve had discussions with professionals from Information Security, including those working in SOCs, and you're absolutely pointing in the right direction. At the moment, I’m actively exploring solutions and building a concept. I believe that by 2025, we’ll be able to showcase something new.
I've had my fair share of struggles with data anonymization. I've tried various techniques, from simple data masking to more complex approaches like pseudonymization and generalization.
However, I've found that these traditional methods often fall short in preserving the intricate relationships and structures within the data. That's why I was excited to discover synthetic data generation, which has been a game-changer for me. By creating artificial data that mirrors the statistical properties of the original, I've been able to share data with others without compromising sensitive information. I've used synthetic data generation for various projects and it's been a valuable tool in my toolkit.
I liked similar thing, snaplet, unfortunately they're dead now. One thing I liked was the option to run proxy to which you could connect with any tool you like (psql, dbeaver, ...) and see preview of your transformations. Also they had some good (stable) generators for names, emails, etc...(I haven't yet checked this fully in greenmask).
Anyway, I will definitely try this. It looks real good!
It's great seeing more tools in this space.
I was recently researching ways of anonymizing production data for staging, and I also found existing tools either cumbersome to setup or lacking in features.
I stumbled upon clickhouse-obfuscator[1], and really liked that it worked on standalone dump formats (CSV, Parquet, etc.) rather than any specific DBMS. I think that's a great approach for this, since it keeps things simple and generic, and it can be conveniently added as a middle step in the backup-restore pipeline. Unfortunately, the tool is quite barebones, and has issues maintaining referential integrity, so we had to abandon it.
This is still an unsolved problem in our team, so I'll keep an eye on your tool. We would need support for ClickHouse as well, so it's good you're planning support for other DBMSs. Good luck!
[1]: https://clickhouse.com/docs/en/operations/utilities/clickhou...
This is really awesome - and it's so amazing that you've build this as a standalone tool!
I can absolutely speak to the pain of having a dozen pg_dump --exclude-table-data arguments and having a developer experience that makes it difficult to reproduce bugs due to drift between production data and test fixtures (even if they share the same schema, assumptions can change massively!).
Secure and robust database cloning also enables preview apps that actually answer the stakeholder question "can I see/play with what the new code would do, if applied to the actual [document/record/product listing] that motivated the feature/bugfix?" Subsetting and PII masking are both critical for this, and it's amazing to see that you've thought about them as integral parts of the same product.
I really want to see a product like this succeed! The easier the tool is to use, the harder it might be to monetize... but there are so many applications of a tool like this, including ones that can materially improve security at organizations large and small (https://nabeelqu.substack.com/i/150188028/secrets just posted here earlier today remarks on this!) that I'm sure you'll find the right niche!
Hi! Thank you for your kind words.
> The easier the tool is to use, the harder it might be to monetize... but there are so many applications of a tool like this
I am in favour with you! That’s exactly why we've started developing our Dynamic Staging Environments platform. It will integrate seamlessly with CI/CD systems, enabling to generate, create and maintain stateful components of your services more efficiently.
One key insight we’ve gained is that this type of software introduces new responsibilities across the Dev, Business, and Security teams within a company. This can result in complex processes and interactions. For example, changes to schemas often require approvals from both the security team and service owners, which can slow down workflows. That's why I really appreciate how Bytebase (https://www.bytebase.com/) tackled this issue—by providing a platform that combines GitOps for stateful resources with business logic in a streamlined way. We should be looking in this direction.
Important Remark:
Current design of Greenmask fits well with future plathform. We remain committed to focusing on simple utilities, as they are essential—especially for small businesses. These tools serve a critical purpose by solving problems quickly for smaller companies and for individuals who need a solution that works ASAP.
Oh, that's interesting - sounds like you're implying that the act of making a change in how a database's data is transformed for different parties/outputs, would go through through a review/governance/rollout process, in a similar way (but with even more stakeholders) that a schema migration or code deployment would? And you'd provide GUI and governance controls on top of that? That makes a ton of sense!
And I wonder if it would be worth talking to folks who do SOC 2 auditing; if you could say "we can provide a framework that allows you to continue to be SOC 2 certified while letting your developers access real-world data" that would be tremendously valuable.
> that a schema migration or code deployment would?
> provide GUI and governance controls on top of that?
Exactly. Data doesn’t exist in isolation. Databases are dependencies of services, and schemas evolve throughout the software lifecycle, often managed by different data migration tools. In large organisations, regular developers usually don’t have direct access to the data sources, and masking rules along with real data sources are often restricted. Schema changes must be validated by the responsible data governance teams to ensure compliance and accuracy.
That’s why we implemented the validate command even in this standalone tool, which checks for schema differences and prevents running a dump if any schema changes are detected with detailed warnings. https://docs.greenmask.io/latest/commands/validate
I once presented Greenmask at an event organized by Percona in Cyprus, and one of the questions raised was: “What if we have a staging database, but instead of cleaning up the database and data, we want to add something to the existing dataset?” At the time, I didn’t have an immediate answer. However, this question inspired me to think, and eventually, I found a solution that at least partially covers this case:
You can restore data in topological order by preserving references and ensuring proper dependency handling (https://docs.greenmask.io/latest/commands/restore/#restorati...)
You can exclude non-critical errors to streamline the process without disrupting key operations (https://docs.greenmask.io/latest/configuration/#restoration-...)
I want to emphasize that this type of software must be flexible and adaptable to meet the ever-evolving needs of businesses… Otherwise, the project is as good as dead.
——————————
> And I wonder if it would be worth talking to folks who do SOC 2 auditing
I’ve had discussions with professionals from Information Security, including those working in SOCs, and you're absolutely pointing in the right direction. At the moment, I’m actively exploring solutions and building a concept. I believe that by 2025, we’ll be able to showcase something new.
I’ve used https://postgresql-anonymizer.readthedocs.io/en/latest/ before to create trimmed down dev databases based on scrubbed and fuzzed production data.
I've had my fair share of struggles with data anonymization. I've tried various techniques, from simple data masking to more complex approaches like pseudonymization and generalization.
However, I've found that these traditional methods often fall short in preserving the intricate relationships and structures within the data. That's why I was excited to discover synthetic data generation, which has been a game-changer for me. By creating artificial data that mirrors the statistical properties of the original, I've been able to share data with others without compromising sensitive information. I've used synthetic data generation for various projects and it's been a valuable tool in my toolkit.
Any generation tool you suggest check-in out?
I've used a few different tools for synthetic data generation, but some of the ones I've found to be most useful include SDV, Faker and DataSynth.
Congrats on the release! I should be able to switch from datanymizer (unmaintained) now.
The other tool in this space to look at is neosync: https://www.neosync.dev/
Thanks for the shout-out! Co-founder of Neosync here - love seeing more tools in this space and pushing the envelope further. Good luck!
I liked similar thing, snaplet, unfortunately they're dead now. One thing I liked was the option to run proxy to which you could connect with any tool you like (psql, dbeaver, ...) and see preview of your transformations. Also they had some good (stable) generators for names, emails, etc...(I haven't yet checked this fully in greenmask).
Anyway, I will definitely try this. It looks real good!
Hi! Noted. Alternatively, you can use the validate command, which will show you the transformation differences:
https://docs.greenmask.io/latest/commands/validate/
Having jumped from Replibyte to Greenmask already I can say it is a significantly better architecture - hands down.
Thank you! I’m really happy that Greenmask meets your expectations