Mapping initiatives on AI ethics

By Nicole Blommendaal, in collaboration with Lea, Bijal Mehta, Marco-Dennis Moreno and Marta Ziosi.

In this article, you can find a list of interesting initiatives that work to truly make AI a tool for the social good.

Brought to you by:

Follow us on LinkedIn, Facebook, Twitter or Instagram to check all the other projects that we launch.

There is a growing momentum coming from the academic, private, and public sectors to define what the principles by which AI should be governed and designed are. While AI systems are subject to relevant ethical concerns, the efforts of developers, governments and policy makers are alone insufficient to address those concerns in their complexity and in their consequences on the wider population.

In this respect, we think that Civil Society initiatives are key to ensuring that the most fundamental layer of society, citizens, can meaningfully shape the systems that affect them.

We, AI for People, have taken inspiration from the ethical concerns presented in the paper “The Ethics of Algorithms: Mapping the Debate” by Mittelstadt et al. (2016) to start a repository of Civil Society initiatives that are actively working on AI ethical principles. The principles are Accuracy & Robustness, Explainability & Transparency, Bias & Fairness, Privacy, and Accountability.

We here present you with a starting repository of what we consider meaningful Civil Society initiatives in the field of AI ethics, working on these principles. Often, one initiative is concerned with more than one principle, so overlap is to be expected. Most importantly, this short article is by no means an exhaustive representation of the Civil Society ecosystem.

It is rather a starting point for citizens to find out how to become active in the AI Ethics sphere and it is an invitation to other Civil Society initiatives to help us expand our repository by adding their name or other initiatives’ name here.

If you are interested, you can check-out our broader efforts on AI Ethics by visiting our website section on Ethical AI.

Without any further delay, here are the initiatives we’ve found:

On Accuracy & Robustness

Picture from the IDinsight website: https://www.idinsight.org/innovation-team-projects/data-on-demand
  • Data on Demand is an initiative — currently focused in India with a possible future expansion to sub-Saharan Africa — by IDInsight, a research organisation identifying itself as ¨helping development leaders maximise their social impact¨, which aims to develop new approaches to survey data collection with the goal of making this collection radically faster and cheaper. Major surveys in India can take a year to implement and the wait for this data can take up to 4 years. Data on demand aims to significantly optimise this process.
  • The team carries their mission out through building robust targeting tools (sampling frames) by leveraging electoral databases and satellite imagery combined with a custom machine learning model which automatically detects data quality issues and by developing a totally automated survey deployment it is aiming to provide a more efficient alternative to the current surveying system. The organisation is also building machine learning algorithms to predict in real-time which surveyors are collecting high-quality data and which need to be retrained or let go. The purpose of all of this is to increase the accuracy and efficiency of the surveying system.
  • Reach them via: Email & twitter

On Transparency & Explainability

  • AlgorithmWatch is a non-profit and advocacy organization based in Berlin, Germany, whose work involves keeping watch and shedding light on the ethical impact of algorithmic decision-making (ADM) systems around the world. AlgorithmWatch believes that “the more technology develops, the more complex it becomes”, but that “complexity must not mean incomprehensibility”. By explaining the effects of algorithms to the general public, creating a network of experts from different cultures and disciplines, and assisting in the development of regulation and other oversight institutions, AlgorithmWatch is driven to keep AI and algorithms accountable when they’re used in society. New and notable projects include they’re mapping of COVID-19 ADM systems and they’re 2020 Automating Society Report which analyzes ADM applications in Europe’s public sphere.
  • Reach them via: Email, Twitter, Instagram and Facebook

On Bias and Fairness

  • EqualAI is not only a nonprofit organization but also a movement working towards reducing unconscious bias in AI development and use. Their mission is to work together with companies, policy makers and experts to reduce bias in AI. EqualAI pushes for more diversity in tech teams and addresses existing biases in the hiring process already. They bring experts, influencers, technology providers and businesses together to write standards on how to create unbiased AI. These standards are aimed at getting brand buy-in and commitments to follow them.
  • Reach them via: Email &Twitter

Other bias initiatives are: Black in AI, Data Science Nigeria, Miiafrica, Indigenous AI and Q: The genderless Voice. Other fairness initiatives are: Black Girls Code, Data Justice Lab, AI and Inclusion, Open Sources Diversity and Open Ethics.

On Privacy

  • The World Privacy Forum is a nonprofit, non-partisan public interest research group that operates both nationally (US) and internationally. The organization is focused on conducting in-depth research, analysis, and consumer education in the area of data privacy, and focuses on pressing and emerging issues. It is among one of the only privacy-focused NGOs conducting independent, original, longitudinal research. World Privacy Forum research has provided insight into important issue areas, including predictive analytics, medical identity theft, data brokers, and digital retail data flows, among others. Areas of focus for the World Privacy Forum include technology and data analytics broadly, with a focus on health care data and privacy, large data sets, machine learning, biometrics, workplace privacy issues, and the financial sector.
  • Reach them via: Email, Twitter and Facebook

Other privacy initiatives are: Big Brother Watch, Future of Privacy Forum and Tor Project.

On Accountability

  • The Algorithmic Justice League (AJL) is a cultural movement and organization that works towards an equitable and accountable AI. Their mission is to raise public awareness about the impact of AI but also to give a voice to the impacted communities. One of their core pillars is to call for meaningful transparency. Here, the Algorithmic Justice League aims to have a knowledgeable public that understands what AI can and cannot do. Furthermore, because they believe individuals should understand the processes of creating and deploying AI in a meaningful way, they too organize workshops, talks, exhibitions, and head various projects. The Algorithmic Justice League is also extremely active in the field of Bias, here below. AJL’s founder, Joy Buolamwini, is in fact part of the documentary “Coded Bias”.
  • If you want to learn more about tools and resources that address a lack of transparency visit their website.
  • Reach them via: Email

Other accountability-related initiatives are: Access Now, Open Rights Group, Digital Freedom Fund, and AWO Agency.


Mapping initiatives on AI ethics was originally published in AI for People on Medium, where people are continuing the conversation by highlighting and responding to this story.

1st AI for People Workshop

On 08.08.2020 (Saturday) and 09.08.2020 (Sunday) the first AI for People Workshop will be held. The event will be held online and is entirely free. Register here.

AI for People was born out of the idea of shaping Artificial Intelligent technology around human and societal needs. We believe that technology should respect the anthropocentric principle. It should be at the service of people, not vice-versa. In order to foster this idea, we need to narrow the gap between civil society and technical experts. This gap is one in knowledge, in action and in tools for change.

In this spirit, we want to share our knowledge through a hands-on workshop with everyone. We are glad to announce 7 speakers offering talks of a diversity of topics about and around Artificial Intelligence. There are 2 basic courses which require little to no programming experience, 4 advanced courses with varying degree of technical depth and one invited speaker.

The Schedule:

All times are CET (Central European Time)

The Topics and Speaker:

A hands-on Exercise in Natural Language Processing. The lecturer Philipp Wicke will give a brief introduction to Natural Language Processing (NLP). This lecture is very practice-oriented and Philipp will show an example of topic modeling on real data. More …
Introduction to AI Ethics. Our chair Marta Ziosi will provide a broad introduction the topics ethically relevant to AI development. Whether you have a technical or social science background, this course will present you with the trade-offs that technology faces society with and it will provide you with the conceptual tools relevant to the field of AI ethics. More …
Open-Source AI. Our invited speaker Dr. Ibrahim Haddad, Executive Director of the LF AI Foundation, will talk about open source AI. More …
Coding AI. The lecturer Kevin Trebing will give an introduction on how to create AI applications. For this, you will learn the basics of PyTorch, one of the biggest AI frameworks next to TensorFlow. More …
Cultural AI. Maurice Jones will outline how the meaning of AI as a technology is socially constructed and which role cultural factors play in this process. He will give a practical example on how different cultures create different meanings surrounding technologies. More …
Creative AI. The lecturer Gabriele Graffieti will give an introduction on creative AI, what it means for an artificial intelligence to be creative and how to instill creativity into the training process. In this lecture we’ll cover in detail generative models and in particular Generative Adversarial Networks. More …
Continual AI. The lecturer Vincenzo Lomonaco will give a brief introduction to the topic of Continual Learning for AI. This lecture is very practice-oriented and based on a slides, and runnable code on Google Colaboratory. You will be able to code and play alongside the lecture in order to acquire the basic knowledge and skills. More …

The course is free for everyone, yet we ask you to register in advance. As a non-profit organisation all of our work and effort is voluntary. If you like the workshop, we suggest a donation of 15€ for the entire workshop (aiforpeople.org/supporters).

Attendance is limited through the virtual classroom and entry is based on first-registered-first-served, therefore we ask you to register as soon as possible here: REGISTRATION

If you have any questions, feel free to contact us at: aiapplications@46.101.110.35


1st AI for People Workshop was originally published in AI for People on Medium, where people are continuing the conversation by highlighting and responding to this story.

1st AI for People Workshop

On 08.08.2020 (Saturday) and 09.08.2020 (Sunday) the first AI for People Workshop will be held. The event will be held online and is entirely free. Register here.

AI for People was born out of the idea of shaping Artificial Intelligent technology around human and societal needs. We believe that technology should respect the anthropocentric principle. It should be at the service of people, not vice-versa. In order to foster this idea, we need to narrow the gap between civil society and technical experts. This gap is one in knowledge, in action and in tools for change.

In this spirit, we want to share our knowledge through a hands-on workshop with everyone. We are glad to announce 7 speakers offering talks of a diversity of topics about and around Artificial Intelligence. There are 2 basic courses which require little to no programming experience, 4 advanced courses with varying degree of technical depth and one invited speaker.

The Schedule:

All times are CET (Central European Time)

The Topics and Speaker:

A hands-on Exercise in Natural Language Processing. The lecturer Philipp Wicke will give a brief introduction to Natural Language Processing (NLP). This lecture is very practice-oriented and Philipp will show an example of topic modeling on real data. More …
Introduction to AI Ethics. Our chair Marta Ziosi will provide a broad introduction the topics ethically relevant to AI development. Whether you have a technical or social science background, this course will present you with the trade-offs that technology faces society with and it will provide you with the conceptual tools relevant to the field of AI ethics. More …
Open-Source AI. Our invited speaker Dr. Ibrahim Haddad, Executive Director of the LF AI Foundation, will talk about open source AI. More …
Coding AI. The lecturer Kevin Trebing will give an introduction on how to create AI applications. For this, you will learn the basics of PyTorch, one of the biggest AI frameworks next to TensorFlow. More …
Cultural AI. Maurice Jones will outline how the meaning of AI as a technology is socially constructed and which role cultural factors play in this process. He will give a practical example on how different cultures create different meanings surrounding technologies. More …
Creative AI. The lecturer Gabriele Graffieti will give an introduction on creative AI, what it means for an artificial intelligence to be creative and how to instill creativity into the training process. In this lecture we’ll cover in detail generative models and in particular Generative Adversarial Networks. More …
Continual AI. The lecturer Vincenzo Lomonaco will give a brief introduction to the topic of Continual Learning for AI. This lecture is very practice-oriented and based on a slides, and runnable code on Google Colaboratory. You will be able to code and play alongside the lecture in order to acquire the basic knowledge and skills. More …

The course is free for everyone, yet we ask you to register in advance. As a non-profit organisation all of our work and effort is voluntary. If you like the workshop, we suggest a donation of 15€ for the entire workshop (aiforpeople.org/supporters).

Attendance is limited through the virtual classroom and entry is based on first-registered-first-served, therefore we ask you to register as soon as possible here: REGISTRATION

If you have any questions, feel free to contact us at: aiapplications@46.101.110.35


1st AI for People Workshop was originally published in AI for People on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Rise of Sinophobia on Twitter during the Covid-19 Pandemic —  Technical Part 1

The Rise of Sinophobia on Twitter during the Covid-19 Pandemic — Technical Part 1

Written by Philipp Wicke and Marta Ziosi for AI For People — 23.05.2020

Follow the link: https://medium.com/ai-for-people/the-rise-of-sinophobia-on-twitter-during-the-covid-19-pandemic-conceptual-p
Follow the link: https://medium.com/ai-for-people/the-rise-of-sinophobia-on-twitter-during-the-covid-19-pandemic-conceptual-part-1-545f81a61619

In this article, we will have a look at the technical underpinnings of the first episode of our series on “Analyzing online discourse for everyone”. In this first part, we will concern ourselves with the data acquisition process. We will outline how to approach a data mining task and how to implement it.

In order to obtain data about an online discourse, we first need to answer a few questions:

  • What is considered online discourse?
  • How can we access online discourse?
  • How much online discourse do we need to observe?
[1] Data Mining. How to obtain information from the internet?

We can consider online discourse to be everything discussed in an online environment. As much as newspapers, government websites and academic articles are concerned, we want to focus on a broader section of the online discourse, one that has almost no filters: Twitter.

Twitter is ideal, because there are about 330 million monthly active users (as of Q1 in 2019). Of these, more than 40 percent use the service on a daily basis creating about 500 million tweets per day. Furthermore, those tweets are mostly freely accessible! In fact, this provides so much data that we need to create our own filters. Our first step is therefore to see what we can access and what we actually need to access:

We want:

  • Data over a period of time
  • Data relating to a certain topic
  • Data of a considerable proportion

We get:

  • The free Twitter API allows access to the last 7–10 days of tweets
  • Everyday has about 500.000.000 tweets
  • With collection restriction (rate limits), we can collect about 10.000/hour.

Now, we need to put together what-we-want and what-we-get. In this tutorial we will write python code that has three simple requirements: Python 3.7+, the Tweepy Package and a Twitter account.

The Tweepy API connects Python with Twitter.

There are a lot of great tutorials that explain the use of Tweepy, how to create Twitterbots and probably also how to obtain data from Twitter and using it. Therefore, we will keep it short here and do not explain what the tool is that we use, but we will explain how we use this tool in detail.

We go to the Twitter Developer page and login with out Twitter credentials (you might need to apply for a developer account, which is fairly easy and briefly done). Next, we will have to create an App and generate our access credentials. Those credentials will be the key to connect the Tweepy API with the Python program. Make sure to store a copy of the access tokens:

Generating Twitter API key and Access Tokens for Tweepy on the twitter-dev website.

Now, we have everything to start writing our Python code. First of all, we need to import the Tweepy package (install with “pip install tweepy”) and we will have to write our access credentials and tokens into the code:

import tweepy as tw
consumer_key= "writeYourOwnConsumerKeyHere12345"
consumer_secret= "writeYourOwnConsumerSecretHere12345"
access_token= "writeYourOwnAccessTokenHere12345"
access_token_secret= "infowriteYourOwnAccessTokenSecretHere12345"

With the right keys and tokens, we can now authenticate our access to Twitter through our Python code:

# Twitter authentication
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

What is happening in this code above? Well, we use tweepy to create an Authentication Handler and store it in “auth” using the consumer key and consumer secret — basically we define through which door we want to access Twitter and with the “auth.set_access_token(…)” we provide the key to access the door. Now, the open door will be stored with certain parameters in “api”. One of those parameters is the door (“auth”) and the other one here is called “wait_on_rate_limit=True”. We can see in the Tweepy API that this parameter decides “Whether or not to automatically wait for rate limits to replenish”. Why?

The free Twitter access comes with a rate limit, i.e. you can only download a certain number of tweets before you need to wait or before Twitter kicks you out. But: “Rate limits are divided into 15 minute intervals.” When we set wait_on_rate_limit to True, we will have our program wait automatically 15 minutes so that Twitter does not lock us out, whenever we exceed the rate limit and we automatically continue to get new data!

Now, we need to specify the parameters for our search:

start_day: Date of beginning to crawl data in format YYYY-MM-DD. It can only be 7 days in the past.

end_day: Date of ending to crawl data in format YYYY-MM-DD. If you want to crawl for a single day, set this to the day after the start_day.

amount: Specify how many tweets you want to collect. Maybe take 15 for the beginning to test everything.

label: In order to store the data, you need to label it, otherwise you'll override it every time!

search_words: This is a string that combines your search words with AND or OR connection. We will look at an example of this.

start_day = "2020–05–23"
end_day = "2020–05–24"
amount = 50
# stores the data here as: test_2020–04–06_15
label = "test_"+start_day+"_n"+str(amount)
search_words = "#covid19 OR #coronavirus OR #ncov2019 OR #2019ncov OR #nCoV OR #nCoV2019 OR #2019nCoV OR #COVID19 -filter:retweets"

The parameters above will collect a test sample of 50 (amount) tweets from the 23rd of May to 24th of May — so just 50 tweets from one day. And those tweets will be stored in the file “test_2020–05–23_n50”. We now apply the first filter, otherwise we would just collect any sort of tweet from that day. Our search words are common hashtags of the Covid19 discourse: #covid19 #coronavirus etc. Furthermore, we want to look at tweets and not re-tweets, therefore we exclude re-tweets with “-filter:retweets”. Now, we can start to obtain the data and run:

tweets = tw.Cursor(api.search,
tweet_mode='extended',
q=search_words,
lang="en",
since=start_day,
until=end_day).items(amount)

Here, we further set the language to “en” = English and the tweet mode to “extended”, which makes sure the entire tweet is stored. The rest of the parameters are as we have defined them before. Now, in the next two lines of code, we simply reformat the obtained tweets into a list and print the first tweet just to have a look:

tweets = [tweet for tweet in tweets]
print(tweets[0])
Status(_api=<tweepy.api.API object at 0x0FFF37D0>, _json={‘created_at’: ‘Mon Apr 06 23:59:59 +0000 2020’, ‘id’: 124367356449479936, ‘id_str’: ‘124367356449479936’, ‘full_text’: ‘we will get through this together #Covid19’, ‘truncated’: False, ‘display_text_range’: [0, 42], ‘entities’: {‘hashtags’: [{‘text’: ‘Covid19’, ‘indices’: [34, 42]}], ‘symbols’: [], ‘user_mentions’: [], ‘urls’: []}, ‘metadata’: {‘iso_language_code’: ‘en’, ‘result_type’: ‘recent’}, ‘source’: ‘<a href=”https://mobile.twitter.com" rel=”nofollow”>Twitter Web App</a>’, ‘in_reply_to_status_id’: None, ‘in_reply_to_status_id_str’: None, ‘in_reply_to_user_id’: None, ‘in_reply_to_user_id_str’: None, ‘in_reply_to_screen_name’: None, ‘user’: {‘id’: 45367I723, ‘id_str’: ‘45367I723’, ‘name’: ‘John Doe’, ‘screen_name’: ‘jodoe’, ‘location’: ‘’, ‘description’: ‘’, ‘url’: None, ‘entities’: {‘description’: {‘urls’: []}}, ‘protected’: False, ‘followers_count’: 188, ‘friends_count’: 611 …

As you can see, this is a ton of information. Number of retweets, number of likes, coordinates, profile background image url… everything about that single tweet! That is why we now filter for the user.id and the full_text. If you want you can also access other information such as location etc, but for now we are not interested in that. Have a look at the following code, before you can find its explanation below:

first_entry = None
last_entry = None
all_user_ids = []
raw_tweets = []
for tweet in tweets:

if not first_entry:
first_entry = tweet.created_at.strftime(“%Y-%m-%d %H:%M:%S”)
print("First tweet collected at: "+str(first_entry))
print(" — — — — — — — — — — — — — — — — — — — — — -")

if tweet.user.id not in all_user_ids:
all_user_ids.append(tweet.user.id)
full_tweet = tweet.full_text.replace('\n','')
if full_tweet:
print("User #"+str(tweet.user.id)+" : ")
print(full_tweet+"\n — — — — — — ")
raw_tweets.append(full_tweet)

last_entry = tweet.created_at.strftime("%Y-%m-%d %H:%M:%S")
print("Last tweet collected at: "+str(last_entry))

This code creates an empty list for all the user ids and then iterates over all tweets. It looks at the created_at field of a tweet to check whether it is the first entry (because we initially set first_entry to None). Now it checks if tweet.user.id is not in the list of all_user_ids. This means it only looks at tweets from users we have not seen yet. Why did we do that?

A scientific analysis of fake news spread during the 2016 US presidential election showed that about 1% of users accounted for 80% of fake news and report that other research suggests that 80% of all tweets can be linked to the top 10% of most tweeting users. Therefore, in order to have a representation of a diverse opinion that cannot be linked to a few but many users, we filter out multiple tweets from the same user.

Then our code appends the user id (as we now have seen the user) and stores the full tweet. The replace statement (replace(“\n”, “ “)) just gets rid of line-breaks in tweets. The if full_tweet is checked, because we could have an empty tweet (which sometimes is a bug of the api). We print the full tweet (the “\n — — — — — “ is a line break and some dashes so it looks nicer when printed). And store each full tweet in a list called raw_tweets. Finally, we access the created_at field to get the date of creation when we have reached the very last tweet. Our script will then print some of the collected tweets, which could look like this:

First tweet collected at: 2020-04-06 23:59:59
-------------------------------------------
User #20I3120348:
we will get through this together #Covid19
------------
User #203480I312:
They fear #Trump2020. They created this version of #coronavirus Just to get him out of office. Looks like the plan worked in the UK...
------------
User #96902235II37193185:
Like millions of others I don't see eye to eye with Boris Johnson but I hope he pulls through. Why? Because I'm human. I wouldn't wish this on my worst enemy. I've witnessed someone die of pneumonia and believe me it's NOT pretty. #GetWellBoris #PrayForBoris #COVID19

And that is it for the first part! We have now collected 50 tweets from the 23rd of May 2020 that relate to the Covid19 discourse. Hopefully, it is clear how this script can be extended to create an entire corpus of thousands of tweets over multiple days. Such corpus has thankfully be created by various researchers, including ourselves. With this corpus we can then start to investigate the relation between the Covid19 discourse and Sinophobia.

In the next article of this series, we’ll look at some Natural Language Processing, Data Analysis and Topic Modeling to assess the data we have collected!

References

[1] Bucket-wheel excavator 286, Inden surface mine, Germany; the bucket-wheel is under repair. 10. April 2016. https://pixabay.com/en/open-pit-mining-raw-materials-1327116/ pixel2013 (Silvia & Frank) Edit: Cropped and overlay of numbers. CC0 1.0.


The Rise of Sinophobia on Twitter during the Covid-19 Pandemic —  Technical Part 1 was originally published in AI for People on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Rise of Sinophobia on Twitter during the Covid-19 Pandemic —  Technical Part 1

The Rise of Sinophobia on Twitter during the Covid-19 Pandemic — Technical Part 1

Written by Philipp Wicke and Marta Ziosi for AI For People — 23.05.2020

Follow the link: https://medium.com/ai-for-people/the-rise-of-sinophobia-on-twitter-during-the-covid-19-pandemic-conceptual-p
Follow the link: https://medium.com/ai-for-people/the-rise-of-sinophobia-on-twitter-during-the-covid-19-pandemic-conceptual-part-1-545f81a61619

In this article, we will have a look at the technical underpinnings of the first episode of our series on “Analyzing online discourse for everyone”. In this first part, we will concern ourselves with the data acquisition process. We will outline how to approach a data mining task and how to implement it.

In order to obtain data about an online discourse, we first need to answer a few questions:

  • What is considered online discourse?
  • How can we access online discourse?
  • How much online discourse do we need to observe?
[1] Data Mining. How to obtain information from the internet?

We can consider online discourse to be everything discussed in an online environment. As much as newspapers, government websites and academic articles are concerned, we want to focus on a broader section of the online discourse, one that has almost no filters: Twitter.

Twitter is ideal, because there are about 330 million monthly active users (as of Q1 in 2019). Of these, more than 40 percent use the service on a daily basis creating about 500 million tweets per day. Furthermore, those tweets are mostly freely accessible! In fact, this provides so much data that we need to create our own filters. Our first step is therefore to see what we can access and what we actually need to access:

We want:

  • Data over a period of time
  • Data relating to a certain topic
  • Data of a considerable proportion

We get:

  • The free Twitter API allows access to the last 7–10 days of tweets
  • Everyday has about 500.000.000 tweets
  • With collection restriction (rate limits), we can collect about 10.000/hour.

Now, we need to put together what-we-want and what-we-get. In this tutorial we will write python code that has three simple requirements: Python 3.7+, the Tweepy Package and a Twitter account.

The Tweepy API connects Python with Twitter.

There are a lot of great tutorials that explain the use of Tweepy, how to create Twitterbots and probably also how to obtain data from Twitter and using it. Therefore, we will keep it short here and do not explain what the tool is that we use, but we will explain how we use this tool in detail.

We go to the Twitter Developer page and login with out Twitter credentials (you might need to apply for a developer account, which is fairly easy and briefly done). Next, we will have to create an App and generate our access credentials. Those credentials will be the key to connect the Tweepy API with the Python program. Make sure to store a copy of the access tokens:

Generating Twitter API key and Access Tokens for Tweepy on the twitter-dev website.

Now, we have everything to start writing our Python code. First of all, we need to import the Tweepy package (install with “pip install tweepy”) and we will have to write our access credentials and tokens into the code:

import tweepy as tw
consumer_key= "writeYourOwnConsumerKeyHere12345"
consumer_secret= "writeYourOwnConsumerSecretHere12345"
access_token= "writeYourOwnAccessTokenHere12345"
access_token_secret= "infowriteYourOwnAccessTokenSecretHere12345"

With the right keys and tokens, we can now authenticate our access to Twitter through our Python code:

# Twitter authentication
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

What is happening in this code above? Well, we use tweepy to create an Authentication Handler and store it in “auth” using the consumer key and consumer secret — basically we define through which door we want to access Twitter and with the “auth.set_access_token(…)” we provide the key to access the door. Now, the open door will be stored with certain parameters in “api”. One of those parameters is the door (“auth”) and the other one here is called “wait_on_rate_limit=True”. We can see in the Tweepy API that this parameter decides “Whether or not to automatically wait for rate limits to replenish”. Why?

The free Twitter access comes with a rate limit, i.e. you can only download a certain number of tweets before you need to wait or before Twitter kicks you out. But: “Rate limits are divided into 15 minute intervals.” When we set wait_on_rate_limit to True, we will have our program wait automatically 15 minutes so that Twitter does not lock us out, whenever we exceed the rate limit and we automatically continue to get new data!

Now, we need to specify the parameters for our search:

start_day: Date of beginning to crawl data in format YYYY-MM-DD. It can only be 7 days in the past.

end_day: Date of ending to crawl data in format YYYY-MM-DD. If you want to crawl for a single day, set this to the day after the start_day.

amount: Specify how many tweets you want to collect. Maybe take 15 for the beginning to test everything.

label: In order to store the data, you need to label it, otherwise you'll override it every time!

search_words: This is a string that combines your search words with AND or OR connection. We will look at an example of this.

start_day = "2020–05–23"
end_day = "2020–05–24"
amount = 50
# stores the data here as: test_2020–04–06_15
label = "test_"+start_day+"_n"+str(amount)
search_words = "#covid19 OR #coronavirus OR #ncov2019 OR #2019ncov OR #nCoV OR #nCoV2019 OR #2019nCoV OR #COVID19 -filter:retweets"

The parameters above will collect a test sample of 50 (amount) tweets from the 23rd of May to 24th of May — so just 50 tweets from one day. And those tweets will be stored in the file “test_2020–05–23_n50”. We now apply the first filter, otherwise we would just collect any sort of tweet from that day. Our search words are common hashtags of the Covid19 discourse: #covid19 #coronavirus etc. Furthermore, we want to look at tweets and not re-tweets, therefore we exclude re-tweets with “-filter:retweets”. Now, we can start to obtain the data and run:

tweets = tw.Cursor(api.search,
tweet_mode='extended',
q=search_words,
lang="en",
since=start_day,
until=end_day).items(amount)

Here, we further set the language to “en” = English and the tweet mode to “extended”, which makes sure the entire tweet is stored. The rest of the parameters are as we have defined them before. Now, in the next two lines of code, we simply reformat the obtained tweets into a list and print the first tweet just to have a look:

tweets = [tweet for tweet in tweets]
print(tweets[0])
Status(_api=<tweepy.api.API object at 0x0FFF37D0>, _json={‘created_at’: ‘Mon Apr 06 23:59:59 +0000 2020’, ‘id’: 124367356449479936, ‘id_str’: ‘124367356449479936’, ‘full_text’: ‘we will get through this together #Covid19’, ‘truncated’: False, ‘display_text_range’: [0, 42], ‘entities’: {‘hashtags’: [{‘text’: ‘Covid19’, ‘indices’: [34, 42]}], ‘symbols’: [], ‘user_mentions’: [], ‘urls’: []}, ‘metadata’: {‘iso_language_code’: ‘en’, ‘result_type’: ‘recent’}, ‘source’: ‘<a href=”https://mobile.twitter.com" rel=”nofollow”>Twitter Web App</a>’, ‘in_reply_to_status_id’: None, ‘in_reply_to_status_id_str’: None, ‘in_reply_to_user_id’: None, ‘in_reply_to_user_id_str’: None, ‘in_reply_to_screen_name’: None, ‘user’: {‘id’: 45367I723, ‘id_str’: ‘45367I723’, ‘name’: ‘John Doe’, ‘screen_name’: ‘jodoe’, ‘location’: ‘’, ‘description’: ‘’, ‘url’: None, ‘entities’: {‘description’: {‘urls’: []}}, ‘protected’: False, ‘followers_count’: 188, ‘friends_count’: 611 …

As you can see, this is a ton of information. Number of retweets, number of likes, coordinates, profile background image url… everything about that single tweet! That is why we now filter for the user.id and the full_text. If you want you can also access other information such as location etc, but for now we are not interested in that. Have a look at the following code, before you can find its explanation below:

first_entry = None
last_entry = None
all_user_ids = []
raw_tweets = []
for tweet in tweets:

if not first_entry:
first_entry = tweet.created_at.strftime(“%Y-%m-%d %H:%M:%S”)
print("First tweet collected at: "+str(first_entry))
print(" — — — — — — — — — — — — — — — — — — — — — -")

if tweet.user.id not in all_user_ids:
all_user_ids.append(tweet.user.id)
full_tweet = tweet.full_text.replace('\n','')
if full_tweet:
print("User #"+str(tweet.user.id)+" : ")
print(full_tweet+"\n — — — — — — ")
raw_tweets.append(full_tweet)

last_entry = tweet.created_at.strftime("%Y-%m-%d %H:%M:%S")
print("Last tweet collected at: "+str(last_entry))

This code creates an empty list for all the user ids and then iterates over all tweets. It looks at the created_at field of a tweet to check whether it is the first entry (because we initially set first_entry to None). Now it checks if tweet.user.id is not in the list of all_user_ids. This means it only looks at tweets from users we have not seen yet. Why did we do that?

A scientific analysis of fake news spread during the 2016 US presidential election showed that about 1% of users accounted for 80% of fake news and report that other research suggests that 80% of all tweets can be linked to the top 10% of most tweeting users. Therefore, in order to have a representation of a diverse opinion that cannot be linked to a few but many users, we filter out multiple tweets from the same user.

Then our code appends the user id (as we now have seen the user) and stores the full tweet. The replace statement (replace(“\n”, “ “)) just gets rid of line-breaks in tweets. The if full_tweet is checked, because we could have an empty tweet (which sometimes is a bug of the api). We print the full tweet (the “\n — — — — — “ is a line break and some dashes so it looks nicer when printed). And store each full tweet in a list called raw_tweets. Finally, we access the created_at field to get the date of creation when we have reached the very last tweet. Our script will then print some of the collected tweets, which could look like this:

First tweet collected at: 2020-04-06 23:59:59
-------------------------------------------
User #20I3120348:
we will get through this together #Covid19
------------
User #203480I312:
They fear #Trump2020. They created this version of #coronavirus Just to get him out of office. Looks like the plan worked in the UK...
------------
User #96902235II37193185:
Like millions of others I don't see eye to eye with Boris Johnson but I hope he pulls through. Why? Because I'm human. I wouldn't wish this on my worst enemy. I've witnessed someone die of pneumonia and believe me it's NOT pretty. #GetWellBoris #PrayForBoris #COVID19

And that is it for the first part! We have now collected 50 tweets from the 23rd of May 2020 that relate to the Covid19 discourse. Hopefully, it is clear how this script can be extended to create an entire corpus of thousands of tweets over multiple days. Such corpus has thankfully be created by various researchers, including ourselves. With this corpus we can then start to investigate the relation between the Covid19 discourse and Sinophobia.

In the next article of this series, we’ll look at some Natural Language Processing, Data Analysis and Topic Modeling to assess the data we have collected!

References

[1] Bucket-wheel excavator 286, Inden surface mine, Germany; the bucket-wheel is under repair. 10. April 2016. https://pixabay.com/en/open-pit-mining-raw-materials-1327116/ pixel2013 (Silvia & Frank) Edit: Cropped and overlay of numbers. CC0 1.0.


The Rise of Sinophobia on Twitter during the Covid-19 Pandemic —  Technical Part 1 was originally published in AI for People on Medium, where people are continuing the conversation by highlighting and responding to this story.