Open-source AI must reveal its training data, per new OSI definition

37 points | by belter 2 days ago

13 comments

wruza 2 days ago
OSI is not an “open-source” trademark holder though. It’s basically an opinion, cause current OSI-approved licenses do not include this exlicitly.
I guess I’m built different, but all this open-washing noise makes no sense to me. I don’t think “oh, open-free as in both” every time I see “open” in the wild from a bigcorp. Especially in the area that just emerged and is clearly nuanced in the “source” part. I mean, yes, it’s not fully correct, but also everyone to whom this matters figures it out sorta immediately. What is even washing here? These semantic arguments are the least problem we’ll have with all that power concentration, that is assuming the current tech is worth anything outside of its bubble.
[-]
- koolala 2 days ago
  Wikipedia AI, Archive.org AI, All GPL Code Ever Made AI. True open-source AI is like a Library + a Librarian.
  Would be nice if we could ever have a post-singularity post-scarcity society. An enlightened world with access to all human knowledge across all past cultures.
  [-]
  - wruza 2 days ago
    Would. The problem was never access to knowledge, but the unwillingness to get it and start creating value. Some people are talking animals who need no post-scarcity future and most people just listen to what’s being told. Quite a tangent, I’d leave it to post-socialistic scifi, cause that’s what it is, at least for gmo-free humans.
Havoc 2 days ago
I doubt anyone in llm world is gonna care.
In practice the key criteria seem to be:
1) Can I get the weights
2) Is commercial use permitted
More nuances sure but if those are met then many consider it open in the non Stallman sense
[-]
- infotainment 2 days ago
  Agreed. I feel like the "training data must ALSO be open source" argument solely stems from trying to nitpick at Meta's efforts.
  The training data provides almost no value, since it's usually just unstructured text dumped from the internet. However, by demanding it, one creates almost impossible-to-reach goalposts. If, somehow, Meta had also released their training data, I suspect the goalposts would immediately move to something else.
  [-]
  - papichulo2023 2 days ago
    It will never happened, no legal team will approve it, endless liability.
    Also I wonder what will normal users do with dozens of TB of synthetic data.
  - blackeyeblitzar 2 days ago
    It’s absolutely valuable. It lets you determine the biases of the model, by examining what it was trained on. It also lets you alter the curation, pre, and post processing to achieve a different model that may be more accurate or truthful (if you can train it). But the transparency is necessary to audit what they do.
    > If, somehow, Meta had also released their training data, I suspect the goalposts would immediately move to something else.
    No it wouldn’t. There’s a finite list of what is needed for an LLM to be open source. See an example of this in AI2’s OLMo:
    https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...
- rsynnott a day ago
  It'd be pretty bizarre to consider that _open source_, though. On that basis, say, MacOS is open source; you can get the binaries for free, and commercial use is permitted.
  Like, words mean something.
- koolala 2 days ago
  Lots of people in the LLM world share open-source data sets. It's incredibly helpful for training new models...
- blackeyeblitzar 2 days ago
  Commercial use permitted is only sort of true. All of the models right now are encumbered with all sorts of restrictions on what you are allowed to use it for. This is also true for the ones claiming to be “open”. They’re not actually open.
ChrisArchitect 2 days ago
[dupe] https://news.ycombinator.com/item?id=41951421
talldayo 2 days ago
Once again proving that the OSI is a fringe organization that is almost entirely ignored in practical prosecution of Open Source licensing.
[-]
- koolala 2 days ago
  I'm glad they have real principles. It's the whole point.