AI and the Commons

April 8, 2024

Towards a Books Data Commons for AI Training

This white paper describes ways of building a books data commons: a responsibly designed, broadly accessible data set of digitized books to be used in training AI models.

April 4, 2024

blog

April 4, 2024

AI and the Commons: Participation in the AI Governance

During the last AI and the Commons call, we spoke to Tim Davies about including the public in AI governance. In his presentation, Tim talked about Connected by Data’s People’s Panel on AI.

March 27, 2024

Seeing like an algorithm: A closer look at LAION 5B

Researchers at Knowing Machines have published Models all the way down, a visual investigation that takes a detailed look at the construction of the LAION 5B dataset "to better understand its contents, implications, and entanglements.” The investigation provides detailed insight into the internal structure and strategies used to build one of the largest and most influential training datasets used to train the current crop of image generation models. Among other things, the researchers show that the model's curators relied heavily on algorithmic selection to assemble the model, and as a result…

…there is a circularity inherent to the authoring of AI training sets. [...] Because they need to be so large, their construction necessarily involves the use of other models, which themselves were trained on algorithmically curated training sets. [...] There are models on top of models, and trainings sets on top of training sets. Omissions and biases and blind spots from these stacked-up models and training sets shape all of the resulting new models and new training sets.

One of the key takeaways from the researchers (who, for all their critical observations, give LAION credit for releasing the dataset as open data) is that we need more dataset transparency to understand the structural configuration of today's generative AI systems, which is very much in line with what we’ve been advocating for in the context of the AI Act and will continue to push for in the implementation of the Act.

Screenshot from Models all the way down, © Knowing Machines

March 20, 2024

Common Corpus public domain data set released

A group of AI researchers coordinated by the French start-up Pleias wants to challenge the belief that you need copyrighted materials to train an LLM that competes with the models developed by leading AI companies. Yesterday, they released what has been dubbed the largest open AI training data set consisting entirely of public-domain texts. The collection is called “Common Corpus” and is available on Hugging Face for download. The resource is multilingual – besides English, it includes the largest open collections in French, German, Spanish, Dutch, and Italian, as well as collections for other languages.

Training data is a key resource for developing AI systems. Until very recently, it was commonly believed that LLMs, such as those behind popular services such as ChatGPT or Bard, could not be trained without relying on copyrighted content. If this is the case, access to high-quality data may continue to be a significant barrier for independent AI developers seeking to compete in the LLM market.

Datasets consisting only of public domain texts have significant limitations, the most important being that they miss more contemporary information because they are comprised of historical sources or older publications where copyrights have already expired. It remains to be seen whether public domain datasets can indeed compete with datasets containing more contemporary content that is protected by copyright.

March 7, 2024

blog

March 7, 2024

AI Act fails to set meaningful dataset transparency standards for open source AI

There is an urgent need to address the issue and set a clear standard for transparency with regard to AI training and access to training datasets. The European policymakers avoided answering the questions that could help ensure openness in the context of AI development.

March 6, 2024

blog

March 6, 2024

AI and the Commons: Open Data Commons Licences and the issues of data governance

The post summarizes the AI and the Commons Community call focused on a recent paper by Melanie Dulong de Rosnay and Yaniv Benhamou.

March 4, 2024

blog

March 4, 2024

Friction in AI Governance: Performing Participation

In this article, Nadia takes a closer look at and debunks a few popular participation practices.

Exploring commons-based approaches to machine learning

Timeline