Hello all,

tl;dr—this is a proposal to focus on where we have consensus right now in RC1 so we can meet both a time-based and quality-based release on 27 Oct 2024. To help the discussion, I have drafted an RC2-STRAW to show what changes could look like:

NB: This is a straw-draft for the purposes of discussion, it can be modified or thrown away in the course of this discussion. I wasn’t able to fork the original MarkDown file as the OSI GitHub repo is non-public, so I recreated the file. The formatting may not be identical, but the content was identical before I began changes so it should be possible to see the differences between versions.

Specifically I am proposing:

  1. We reduce the data classes covered in the definition to the first and second (Open Data, Open Access. These classes are defined in the FAQ.
  2. We set aside the third and fourth data classes, Restricted (née Obtainable) and Unavailable (née Unshareable) for further discussion after the 1.0 release.
  3. Following the 1.0 release, advocates for solving the problems around Restricted and Unavailable data classes can continue to work for solutions in a 1.1 release.

Regardless of where one’s opinion is on the matter of data Openness and availability, we can still obtain consensus for a definition release that is good-enough, even if more strict than what some would prefer.

If a definition is released that both lacks consensus and is less strict, it will not be practical to adjust the definition in the future to be more strict. Once the toothpaste has left the tube, it’s not going back in.

10 Likes

@quaid: Thanks for taking the time to draft this; it’s going to take me a little while to digest, but on first pass I think this is something I could get behind.

It looks like you’ve eliminated the two problematic data classes (thanks also for simplifying the nomenclature — the English language will tell you anything if you torture it enough!) while minimising deviation from the release candidate. This is a pragmatic decision that will minimise impact and resistance due to the sunk cost fallacy, but there’s still a lot of cruft in the resulting document when every single word should be “load-bearing”.

Unfortunately, without the diffs there’s additional cognitive load to see clearly what’s changed. I see you can click through from your HackMD document (OSAID-v1RC2-STRAW) to your repo (quaid/OSAID-WIP), but the OSI’s own documents (e.g., osaid-1-0-RC1) reside in a private repo (OpenSourceOrg/osaid). This is a curious decision for an “open” development process, but likely a deliberate one. If they don’t want to open the kimono, then perhaps you could back out commit 86a6bea, apply the original RC1 reverse engineered from HackMD, and then apply your minimalist changes again so we can see precisely what’s changed? I can help you with this if need be.

4 Likes

Very nice and constructive attempt!

At a rapid check, your proposal fixes 7 of 11 issues that still plague RC1.

The remaining issues here are:

  • Implicit or Unspecified formal requirements: if ambiguities in the OSAID will be solved for each candidate AI system though a formal certificate issued by OSI, such formal requirement should be explicitly stated in the OSAID. (reported here and here )
  • OSI as a single point of failure: since each new version of each candidate Open Source AI system world wide should undergo to the certification process again, this would turn OSI to a vulnerable bottleneck in AI development, that would be the target of unprecedented lobbying from the industry. (reported here and here )
  • Unknown “OSI-approved terms”: the definition requires the code distribution under “OSI approved licenses”, but requires that Data Information and Weights to be distributed under “OSI approved terms”. Nobody knows these terms, and this pose issues “critical legal concerns for developers and bsinesses”. (reported here )
  • Underspecified “substantial equivalence”: the definition requires a skilled person to be able to build a “substantially equivalent” system out of the components of an Open Source AI system, but it doesn’t explain (not even in FAQs) what such equivalence does means.
    In Computer Science, two software are equivalent if and only if for any given input they produce the same output in a comparable amount of time, but OSI have not specified what such equivalence should mean in their understanding of AI systems. (reported here, here, here)

I suggest to replace “OSI-approved terms” and “OSI-approved licenses” with “licenses that comply with the Open Source Definition”.

This should remove from OSI the burden of certifying each version of every AI system.

As for “build a substantially equivalent system." I suggest to replace with “recreate a copy of the system”.

Since my previosly censored proposal, I realized that the adjective “exact” is not really really required: lets trust the court’s wisdom.

This an industry accepted term that indicates the license has been subjected to the OSI approval process and placed in the list.

At the very least, the definition should refer to already well established licenses processes, like the OSI list for software and the FAIR/O definition.

If set up the checklist in a very clear way, there is no burden on the OSI to certify each and every one of the Ai systems claims, a simple observation of its definitions will suffice.

Whatever will be decided with the OSAiD, I’m expecting that some players will try to create their own versions of software and data licenses and try to pass them as open, so the work of proving them wrong will always be there.

2 Likes

Not if the definition contains subjective (e.g., “sufficiently detailed”, “skilled person”, “substantially equivalent system”, etc.) rather than objective tests, like the status quo for software: “Is the source available under OSI-Approved license/s, and does it produce the software when compiled/executed?”

It also needs to be applied by the OSI to every single candidate, which is infeasible due to both scalability and liability (just ask Bruce). That’s why sticking with an arms-length indirect model like the OSD does with OSI-approved licenses would be safer. For example, MOF Class I may be the first candidate to meet (and exceed) the requirements for Open Source AI — complete with its own machine-readable checklist and tool to create/verify same — but more permissive frameworks that skate closer to the definition could also apply and be approved.

Sticking with “OSI-Approved license” until the unlikely event that some other “instrument” proves necessary resolves another item on @Shamar’s issue list, leaving only problems with the normative part of the specification itself. Here every word should carry its weight, and I’d suggest putting @quaid’s RC2 on a diet could fix all the outstanding issues he identified.

I’m feeling a lot more positive about this approach, and while we’ll see soon enough what the others have to say, we’ve got until the 28th to find fertile ground for it.

@samj and all, this is the new repo location (makes sense in this org anyway) and the diff of the RC1 and RC2-STRAW drafts:

1 Like

This would relegate Open Source AI to a niche, and it would be a tactical mistake, as explained on How we passed the AI conundrums – Open Source Initiative.

This would only kick the problem down the road after we’ve discussed this specific issue for almost 3 years without finding a better solution than what’s in RC1.

There is a growing amount of endorsers to RC1 already, coming from large parts of different communities: we can’t make everybody happy, and large corporations are unhappy about the Code requirements (if that helps).

I know you’re coming at this with good intentions but I don’t see what this proposal would achieve besides confusing policy makers even more.

It may, but then, the ultimate test, as you have yourself made clear, is whether the definitions uphold the four freedoms. There is a tension, possibly irreconcilable, between the idealistic criterion, the four freedoms, and the pragmatic criterion, that some systems must fit. It may well be that no current system upholds the four freedoms.

As one who has Emacs on his desktop every day, I’m quite sure I bought the analogy :wink:

Well, when the OSD was written, it had 12 years of actual experience behind it, and every part of it was based on real, practical experiences of deciding whether a package should go into Debian or not.

That’s obviously not a luxury we have today, but it is very clear that it is easier to start narrow and allow more to fit, then to throw things out.

I have a problem with the waterfall methodology here, one shouldn’t think that the definition can be written in stone. One should assume further iterations will have to be done going forward. I certainly see a problem with the assumption that there is a data set and that has to be open, particularly around federative learning, but I also see that OSI is not able to tackle it at present.

“kicking problems down the road” is negative just because you are not prepared to iterate, if you designed the process for iterations, then it wouldn’t actually have been bad.

4 Likes

In that regard, and whatever decision the OSI board takes, we should already be asking developers to check their projects against the OSAiD, to get a more detailed view of the size of “niche” the definition creates and ask the Open Source community (*) if they would accept the Ai systems that were validated as compliant with the four principles of freedom.

This validation has already been made in the beginning of the process [1], by others [2] [3] [4], and presently can also be asserted, up to a certain point, with the Model Openness Tool (MoT) [5] and complemented with the Foundation Model Transparency Index [6], but I do not know if any analysis from the Open Source community (*) at large was made over those systems.

references

[1] Towards a definition of “Open Artificial Intelligence”: First meeting recap – Open Source Initiative
[2] Z. Liu et al., ‘LLM360: Towards Fully Transparent Open-Source LLMs’, Dec. 11, 2023, arXiv: arXiv:2312.06550. doi: 10.48550/arXiv.2312.06550.
[3] I. Solaiman, ‘The Gradient of Generative AI Release: Methods and Considerations’, Feb. 05, 2023, arXiv: arXiv:2302.04844. Accessed: Oct. 17, 2024. [Online]. Available: [2302.04844] The Gradient of Generative AI Release: Methods and Considerations
[4] M. White, I. Haddad, C. Osborne, X.-Y. L. Yanglet, A. Abdelmonsef, and S. Varghese, ‘The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence v5’, Oct. 02, 2024, arXiv: arXiv:2403.13784. doi: 10.48550/arXiv.2403.13784.
[5] https://isitopen.ai/
[6] Foundation Model Transparency Index

(*) I know the term “Open Source community” is too vague, and open to wild interpretations, but I wouldn’t know who to name and I’m certain the OSI could bring together experts in the FLOSS field
from the many communities of thought and practice to give their opinion.

quaid’s proposal here makes sense to me and addresses my biggest concerns with RC1.

The points in the How we passed the AI conundrums article seem to miss a key issue with RC1, which is that it doesn’t appear to make any meaningful extension to what is possible with existing open source licenses.

It seems we are aligned that there are 4 elements that comprise the entirety of an AI:

  1. Training data
  2. Data information - documentation on training data selection, pre-processing etc
  3. Code
  4. Parameters - model weights, configuration settings etc

It seems that:

  • Data information and code are already easily covered by established open software licenses.
  • RC1 only requires data information, code and model weights.
  • Therefore the only new element that is included in the RC1 definition in parameters - primarily model weights.

What is the purpose of releasing model weights without training data?

From the article the suggestion is that (in combination with data information and code) this would let users train a new model using their own private data.

For any real-world usage (like the bone cancer example), it’s hard to imagine anyone using model weights from a third party without any external trust relationship (e.g. a paid contract). The weights have exactly the same transparency, trust and security issues as downloading and executing binary blobs built by individuals you don’t know and just running them.

So it’s hard to come up with a purpose for including the weights (without training data). Any responsible user who wants to use them for anything meaningful should be retraining the model using their own training data. The only example I can come up with is a sandboxed evaluation tool - but it’s still totally unclear what the benefit/meaning of this blob of data being “open source” is.

If we follow the RC1 logic to it’s conclusion then - model weights clearly aren’t “source”, and (even in in combination with data information and code) aren’t an artifact that could be responsibly put into usage.

So why are model weights even included in the definition?

I think the answer to this is that if they are not included, the only thing that is left is data information and code and it’s clear these are already easily covered by established open software licenses. This makes the whole purpose of needing an Open Source AI definition moot - it simply wouldn’t be needed.

** Conclusion **

This is the situation it seem RC1 is in now - it has expanded the definition beyond what is already possible to have some reason to exist, but the primary artifact it adds doesn’t seem to add practical value or meaning.

So to come full circle - if we want to have a meaningful Open Source AI definition it needs to include open training data. Yes, this might mean that models that inherently rely on private data can’t be “Open Source AI”, but they can still have open source data information and code, which are the primary artifacts that has value to others.

As a reminder, for it’s first decade (or two) the whole idea of FOSS was considered a completely impractical niche that would never take off by the vast majority of serious software users. Yet nowadays it is present (to varying degrees) in pretty much every computer system and website.

I think the right choice here is to take a bold step with a meaningful definition, knowing that over time it has potential to develop into something much more valuable to humanity.

2 Likes

this seems to be the crux of the matter, thanks for clarifying that. You can live with such compromise (no Open Source AI trained on private data), while the OSI today and in the past hasn’t accepted to limit Open Source (or free software, however you want to call it) to few fields of use. Let’s agree to disagree.

There is a new version of the FAQ to cover this point:

We can close this thread.

3 Likes

3 posts were split to a new topic: Recounting the history of Open Source

Hold on, the possibilities here aren’t at all well understood. There is a difference between opening floodgates for open-washing and accomodating for open source medical models.

If you could agree to a definition now that there could be broad consensus around, and then iterate on that definition once the distinction is understood, we wouldn’t be in the situation where there is a risk for openwashwater flooding.

I fully agree that a pure mandate for open datasets cannot be the end result, for exactly that reason. But I also have concerns that the current definition will bring Open Source into strong disrepute as it will bring security holes, backdoors and poor regulability.

If, OTOH, someone managed to build a system that could do federated learning on sensitive data in such a way that it can be proven that the original data cannot be reconstructed, then certainly, that would be Open Source AI, even in the absence of an open dataset.

By closing the discussion now, you are making sure that we cannot ask and work towards such a goal. I’m sure academia will be working on it, but that needs to be a central part of the understanding of Open Source AI.

1 Like

Great, we agree on something. Let’s focus on that.

You and I have the same concerns, too. I don’t have a crystal ball though so I don’t know exactly what will happen in the future. It’s easily predictable though that delaying the release to hold more debates over a topic we’ve debated for 2+ years will never let us move forward.

We can only release 1.0 based on the RC text and watch the space carefully, be ready to change and adapt. It’s the same thing that happened with the OSD and the FSD, by the way (freedom 0 was added later, when RMS realized it was needed, and the OSD had 9 points, not 10.)

1 Like

You seem to assume that medical models that would respect the four freedoms (by making training data available) are not possible, but that is not the case. Data from clinical trials are published (see e.g. https://datashare.nida.nih.gov/) and if you analyze these using an AI technique (and make your source code available etc.), you could enable anyone to study / modify the model (including on the basis of the training data) and all this can be qualified as ‘open source AI’ in the sense of this proposal to consensus. There are plenty of machine learning models out there based on open data (e.g. using Chembl data to make predictive models for bio-activity based on modeling of chemical structures) and these are used in the pharma / medical field, so I don’t buy this ‘medical AI’ argument. Of course there are data on which there are restrictions in the medical field, but that does not mean there are not many possibilities to advance the field based on data that are curated and shared.

2 Likes

OSI today and in the past hasn’t accepted to limit Open Source (or free software, however you want to call it) to few fields of use

I don’t think this is restricting fields of use, at least not more than current open source licenses.

There are many examples of cases where a developer may be legally or contractually forbidden from releasing open source software - even if they desired to: NDAs, software patents, export or national security controls etc. Many of these are pretty ubiquitous in certain fields - defense, aerospace, semiconductors, medical device software etc. Should the open source definition also be expanded to allow them to call their binaries (or other non-source artifacts) “open source” even if the source can’t ever be made available to anyone else?

3 Likes

If that was directed at me, then it was not intentional. I wrote that as a response and took that for the sake of the argument. The distinction isn’t between medical and other things, it is between where you can reconstruct sensitive information by interrogating the model and not.

1 Like

Hello @grugnog,

The short answer is “definitively no”.

Explaining a little bit more: any restriction on the user’s ability regarding the code will fail the test of the 4 freedoms, which attempt of classifying as open or free software.

What you are describing falls under the current definition of InnerSource which “takes the lessons learned from developing open source software and applies them to the way companies develop software internally.”