AI_Commons: Open licensing in the age of AI

April 29, 2021

Francesco Vogelezang, Alek Tarkowski

Machine training with openly licensed photographs of faces is a controversial use case of Creative Commons licensed content, identified several years ago. Since the case received media attention in 2019, it has been often raised as an example of inherent conflict between openness and privacy protection as conflicting values.

In the background, there are growing concerns about the ethics of artificial intelligence and machine learning technologies, especially in relation to biometric data. Emblematically, a 2018 academic publication by Wiley, concerning algorithmic training to distinguish faces of Uyghur people from those of Korean and Tibetan ethnicity, sparked a great deal of controversy about the ethics of academic research in image training recognition. Likewise, the 2018 revelations involving a group of scientists from Stanford University, that approximately collected around 12,000 images from a webcam in a San Francisco café to train biometric categorization algorithms, signal the inherent struggle of user's consent to data processing with facial recognition training.

We are launching the AI_Commons project to find a solution to this issue. By studying this case, we hope to define better how governance of shared resources can balance open sharing with protection of personal data and privacy. We also see this as a case that concerns irrevocability of CC licenses and their unintended uses, and thus the challenge of making the CC licensing stack future-proof.

Finally, this is a case that explores the limit of the traditional approach to sharing, the Open Access Commons. And asks whether for some types of data we need a stronger, more managed commons and data governance. As this is also a case that deals with imbalances of power on the web – a Paradox of Open that we have identified.

AI is a challenge for open licensing

These two episodes highlight the need for greater legal clarity on machine training with openly licensed material. On the one hand, this is especially urgent in light of the ever growing need for facial recognition algorithms to be trained and tested on large image datasets. These are increasingly accessed from the Internet, where permissive licenses, such as from, allow the training of facial recognition algorithms with users' images.

On the other hand, the existing legislation seems not to provide sufficient guidance on the issue. Even in the European Union (EU), the world champion of privacy and data protection, the General Data Protection Regulation, despite not granting an obvious legal basis for researchers, lacks adequate clarity as this issue has not been tested in courts yet.

However, this situation might soon change: the newly released EU Proposal for a Regulation a European Approach for Artificial Intelligence (AI) – the AI Act – sets the scene for greater legal certainty. Specifically, Title IV of the Regulation stipulates important transparency obligations for certain AI systems that interact with natural persons, or that generate content based on the processing of their personal data. In particular, art. 52(2) is crucial here as it lays down notification obligations for users when they are exposed to emotion recognition or biometric categorisation systems, other than those available for the public to report a criminal offence.

Considering the potential destabilizing effects of human impersonation and deception based on users' data processing, the measure definitely sets the ground for greater users' awareness in the context of AI deployment. However, despite its ambitious character, the AI Act does not directly tackle the ethical controversies surrounding the use of permissive copyright licenses for AI facial recognition training. The latter still remains a crucial element of discussion that policymakers have so far failed to address, though these technologies continue to proliferate across more fields than mere academic research, such as for defense, military and law enforcement purposes. This is why we decided to launch AI_Commons!

AI_Commons – designing policies for machine learning with open datasets

AI_Commons is a new research and policy design process, launched in collaboration with, where we explore more in depth the interconnection of permissive Creative Commons (CC) licenses with facial recognition training. We have approached the team after learning about their work on image training datasets – the Financial Times extensively talks about them here.

Specifically, is an initiative created by the Berlin-based artist Adam Harvey, and technologist and programmers Jules LaPlace, and is based on the earlier MegaPixels project (2017-2020). This project is based on years and years of research about image training datasets used for facial recognition and related biometric analysis technologies. After having collected and analyzed hundreds of datasets, Adam and Jules detected a common pattern: millions of images were being downloaded from where permissive licenses allowed the processing of users' biometric data for facial recognition training.

All of this happened without users being fully aware that their images were being treated for such purpose. In other words, if you were a Flickr user and uploaded images containing faces or other biometric information between 2004 and 2020, your photos are highly likely to have been used to train, test, or enhance artificial intelligence surveillance technologies – on, you can check whether your photos were used. Their work even led Microsoft to withdraw a database of 10m faces, which has been used to train facial recognition systems around the world, including by military researchers and Chinese firms, such as SenseTime and Megvii.

As previously mentioned, considering the current policy vacuum, this issue deserves further attention by open movement advocates. A thorough assessment around the intersection of CC licenses with AI training is necessary to better consider the ethical implications of image recognition applications with openly licensed material. The fact that many users were simply unaware of their data processing taking place to train facial recognition technologies requires a deep reflection on the suitability of CC licenses with facial recognition algorithms.

A recent survey published by Nature highlights how AI researchers working with facial recognition are currently divided on its ethical implications. Creative Commons previously tried to address the problem with an official statement by shifting responsibility to other legal or ethical organizations. Yet, the problem still remains open and requires urgent action. As AI, and especially facial recognition technologies, continue to proliferate, a new set of ethical standards concerning the use of publicly available images for AI training is becoming more and more urgent. It is time to take action!

Next Steps

AI_Commons aims to facilitate a better understanding of these issues by providing concrete policy solutions. This joint research and policy design process will investigate the use of open content for AI training. Broader context for this work is defined in our Paradox of Open essay.

Specifically, we are trying to understand how far open licenses are regarded as a signifier that licensed photos can be used to train (facial recognition) algorithms and how such uses relate to the intentions and expectations of licensors. We will present our findings in an inception report, planned for later this year. We are also conducting a study to understand better the attitudes of people openly sharing their photographs to different ways these are reused, in particular for AI training. Afterwards, we will invite experts and activists from the broad open movement, and those with interest in regulating machine learning, to join us in a policy design process. Together, we want to identify whether there is a need for change within the open licensing ecosystem, which might consist in new tailormade licenses, additional guidance or specific regulatory interventions. AI_Commons will finish with concrete recommendations and a public report.

To stay updated on the next developments concerning AI_Commons and similar policy initiatives, subscribe to our monthly newsletter. And if you are interested in collaborating with us in the policy design phase, please get in touch at

AI_Commons syllabus

These 2019 articles by the Financial Times and New York Times provide a general overview of the ethical and normative issues affecting facial recognition research. They enlist a variety of cases where the training of AI with openly licensed content has raised important privacy considerations. In doing so, they also introduce the work performed by Adam Harvey and Jules la Place with There, you can check whether your pictures have been included in datasets which were then used to train facial recognition algorithms.

Ethical considerations

Concerning the various ethical considerations within the AI community, this Nature article written by Richard Van Noorden (2020) explores core normative beliefs and questions at the core of the facial recognition research community. In doing so, it reports the results of a survey investigating ethical attitudes amongst 480 researchers who have published papers on facial recognition.

Use CC-licenses for Machine Learning training

In this YouTube video, Brigitte Vezina (2021), CC Director of Policy, discusses whether the use of copyright material and Creative Commons-licensed content should be used as input to train Machine Learning systems. The answer is that “it depends…”

Also, Ryan Merkley (former CEO of Creative Commons), in this blogpost, reports CC official position on the 2019 IBM-Flickr Case where openly licensed material was used to train facial recognition algorithms. Ryan discussed the issue from a use and fair use perspective.

Data protection and copyright considerations

Finally, coming to the various legal tensions at stake, Andres Guadamuz (2019), in this blogpost, discusses the IBM case from a copyright perspective.

On this very point, Margoni and Kretschmer (2021) further analyze the EU copyright regime as they discuss the exceptions provided for Text and Data Mining (TDM) activities. Specifically, as they present the tension between Machine Learning and Copyright, they come to the conclusion that there should be no need for a TDM exception for the act of extracting informational value from protected works.

Catherine Jasserand (2020), discusses this issue from a data protection perspective by explaining when and how the GDPR allows for the processing of biometric data made available in the form of openly licensed content. Further, this blogpost by newtech law discusses the issue of consent from a personality rights perspective.

Last but not least, this paper by Flynn et al. (2020) calls for the international community to take action by implementing user rights for research in the field of Artificial Intelligence. Specifically, the World Intellectual Property Organization (WIPO) is identified as a key international forum to advance multi stakeholder debate on the issue.

(This text has been updated on 21 September 2021).