Putting ethical principles at the core of the research lifecycle

Putting Ethical Principles at the Core of the Research Lifecycle

Machine learning research and applications have the potential to lead to significant benefits, but they also come with risks such as data privacy issues, algorithmic biases, automation risks, and malicious uses. To mitigate these risks, it is essential to formalize the ethical principles that guide our research and development.

Purpose of the Ethical Charter

The purpose of this document is to formalize the ethical principles that we, the multimodal learning group at Hugging Face, adopt for our project. By defining these principles at the beginning of the project, we make them core to our machine learning lifecycle. This document is the result of discussions led by the multimodal learning group at Hugging Face, with the contributions of multiple experts in ethics operationalization, data governance, and personal privacy.

Limitations of this Ethical Charter

This document is a work in progress and reflects a state of reflection as of May 2022. There is no consensus nor official definition of "ethical AI," and our considerations are very likely to change over time. In case of updates, we will reflect changes directly in this document while providing the rationale for changes and tracking the history of updates through GitHub. This document is not intended to be a source of truth about best practices for ethical AI. We believe that even though it is imperfect, thinking about the impact of our research, the potential harms we foresee, and strategies we can take to mitigate these harms is going in the right direction for the machine learning community.

Content Policy

Studying the current state-of-the-art multimodal systems, we foresee several misuses of the technologies we aim at as part of this project. We provide guidelines on some of the use cases we ultimately want to prevent:

Promotion of content and activities which are detrimental in nature, such as violence, harassment, bullying, harm, hate, and all forms of discrimination.
Violation of regulations, privacy, copyrights, human rights, cultural rights, fundamental rights, laws, and any other form of binding documents.
Generating personally identifiable information.
Generating false information without any accountability and/or with the purpose of harming and triggering others.
Incautious usage of the model in high-risk domains - such as medical, legal, finance, and immigration - that can fundamentally damage people’s lives.

Values for the Project

We adopt the following values for our project:

Be transparent: We are transparent and open about the intent, sources of data, tools, and decisions. By being transparent, we expose the weak points of our work to the community and thus are responsible and can be held accountable.
Share open and reproducible work: Openness touches on two aspects: the processes and the results. We believe it is good research practice to share precise descriptions of the data, tools, and experimental conditions. Research artifacts, including tools and model checkpoints, must be accessible - for use within the intended scope - to all without discrimination (e.g., religion, ethnicity, sexual orientation, gender, political orientation, age, ability). We define accessibility as ensuring that our research can be easily explained to an audience beyond the machine learning research community.
Be fair: We define fairness as the equal treatment of all human beings. Being fair implies monitoring and mitigating unwanted biases that are based on characteristics such as race, gender, disabilities, and sexual orientation. To limit as much as possible negative outcomes, especially outcomes that impact marginalized and vulnerable groups, reviews of unfair biases - such as racism for predictive policing algorithms - should be conducted on both the data and the model outputs.
Be self-critical: We are aware of our imperfections and we should constantly lookout for ways to better operationalize ethical values and other responsible AI decisions. For instance, this includes better strategies for curating and filtering training data. We should not overclaim or entertain spurious discourses and hype.
Give credit: We should respect and acknowledge people's work through proper licensing and credit attribution.

We note that some of these values can sometimes be in conflict (for instance being fair and sharing open and reproducible work, or respecting individuals' privacy and sharing datasets), and emphasize the need to consider risks and benefits of our decisions on a case-by-case basis.

Implementing Ethical Principles in Practice

To implement these ethical principles in practice, we will:

Conduct regular reviews of our research and development to ensure that we are meeting our ethical standards.
Implement mechanisms for reporting and addressing any concerns or issues related to our research and development.
Provide training and education to our team members on ethics and responsible AI practices.
Engage with external experts and stakeholders to ensure that our research and development are aligned with best practices and industry standards.

By implementing these ethical principles and practices, we aim to ensure that our research and development are conducted in a responsible and ethical manner, and that we contribute to the development of AI that benefits society as a whole.

Conclusion

In conclusion, the multimodal learning group at Hugging Face is committed to conducting research and development in a responsible and ethical manner. We believe that the adoption of ethical principles and practices is essential for ensuring that our research and development are aligned with the needs and values of society. By implementing these principles and practices, we aim to contribute to the development of AI that benefits society as a whole.

Code Blocks

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('data.csv')

# Preprocess the data
df['column'] = df['column'].astype(str)
df['column'] = df['column'].str.lower()

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

# Train a model on the training data
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model on the testing data
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

const df = require('pandas-js').DataFrame;

// Load the dataset
const data = df.fromCSV('data.csv');

// Preprocess the data
data.column = data.column.map(x => x.toString().toLowerCase());

// Split the data into training and testing sets
const X_train = data.drop('target');
const X_test = data.target;
const y_train = data.drop('target');
const y_test = data.target;

// Train a model on the training data
const model = new RandomForestClassifier(100, 42);
model.fit(X_train, y_train);

// Evaluate the model on the testing data
const y_pred = model.predict(X_test);
console.log('Accuracy:', accuracy_score(y_test, y_pred));