Crisis Collections

This page contains brief descriptions and links to download existing crisis-related collections.

BlackLivesMatterU/T1

Users tweeting about #BlackLivesMatter, labeled by type, race, gender and age.

Data Sources: Twitter
Sampling: hashtag-based query

Check the details »

CrisisLexT26

Tweets from 26 crises, labeled by informativeness, information type and source.

Data Sources: Twitter
Sampling: keyword-based queries

Check the details »

CrisisLexT6

Tweets from 6 crises, labeled by relatedness to the coresponding crisis.

Data Sources: Twitter
Sampling: keyword and geo-based queries

Check the details »

ClimateCovE350

Climate change related events, labeled by relevance, triggers, actions, and news values.

Data Sources: Twitter, GDELT (news)
Sampling: keyword-based queries

Check the details »

SoSItalyT4

Tweets from 4 crises, labeled by the type of information they convey.

Data Sources: Twitter
Sampling: keyword-based queries

Check the details »

ChileEarthquakeT1

Tweets from the Chilean earthquake of 2010, labeled by relatedness.

Data Sources: Twitter
Sampling: keyword-based queries

Check the details »

EnvironmentalPetitionTweets

Petition URLs and tweets containing them.

Data Sources: Twitter
Sampling: url-based queries

Check the details »

SandyHurricaneGeoT1

Geo-tagged tweets from the Sandy Hurricane.

Data Sources: Twitter
Sampling: geo-based queries

Check the details »

BlackLivesMatterU/T1 Users mentioning #BlackLivesMatter, labeled by type, race, age, gender. March 2016

This collection includes tweets containing the #BlackLivesMatter hashtag and that were posted from April 2012 to May 2015. It also includes a collection of about 6000 users annotated according to type (organizations vs. individuals), and 3 demographic factors (race, age, gender), which have used the hashtag during this period.

Hashtag #Tweets #Users #Labeled Users
#BlackLivesMatter 3.54 million 0.88 million 6000
  • Contents: ~6000 relevant users, labeled by type (organizations vs. individuals) and 3 demographic criteria (age, race, gender). The collection also contains tweet ids for 3.54 million tweets, representing the tweets containing the #BlackLivesMatter hashtag from April 11, 2012 —when the hashtag was first used— until May 10, 2015.
  • Sampling method: tweets containing the #BlackLivesMatter hashtag.
  • Labels: ~6000 users labeled by crowdsourcing workers according to their type as organization accouts (e.g. NGO, government agency, media, business) or individuals, and individuals according to three demographic criteria (age, gender, race).
  • Data format: (tweets) a comma-separated values (.csv) file containing the tweet ID, and the time stamp of the tweet; (users) a .csv file containing the users ID, along with the corresponding type, race, gender, age.

If you use the BlackLivesMatterU/T1 collections, please cite:

Annotated data available upon request BlackLivesMatterT1-v1.0.zip (29.1 MB)

ClimateCovE350 Climate events, labeled by relevance, triggers, actions, and news values April 2015

This collection includes about 350 events that received medium to high coverage in Twitter, mainstream media, or both, covering a period of 17 months in 2013 and 2014, and are labeled by relevance to climate-chance, triggers, actions, and 6 news values (i.e. extraordinary, unpredictable, high magnitude, negative, conflictive, related to elite persons).

Types (or triggers) Description Example
Disaster Disruption of the functioning of a community that involves widespread human, material, or environmental losses Typhoon, Tornadoes
Government (all branches) and intergovernmental agencies Any institution belonging to any government branch (executive, legislative, judicial), or any inter-governmental agency, or any government employee acting in official capacity     Law enforcement agencies, United Nations, Presidency, Ministry
Groups, NGOs, and universities Any non-profit, nongovernmental group, formally established or not. We include in this category educational and research institutions GreenPeace, Stanford, WWF
For-profit (excl. media, universities) Any for-profit organization, including business and corporations but excluding media and universities, which appear in the other categories Google, Shell
Media Any media organization CNN, New York Times, The Guardian, Associated Press
Individuals Any individual that is not acting as a representative of any of the organization types listed above Actors, Neil deGrasse Tyson
Sub-types (or actions) Description Example
Natural Hazards Extreme weather and climate events that occur naturally Typhoon, Drought
Human-Induced Hazards Hazards having an element of human intent, negligence, error, or involving a failure of a human-made system Deforestation, Oil Spill
Legal actions Any action that is legally binding, including new executive orders and new laws, plus any action brought to a court of law, such as lawsuits New legislation, lawsuits
Publications Any release of a document to the public, including reports, studies, memoranda, infographics and cartoons IPCC Reports, Polar bear cartoon
Meetings Any meeting, conference, convention, etc IPCC meeting, UN meetings
Other Other types of actions not belonging to the categories above, in our data this corresponded mostly to campaigns and brief public statements Campaigns, statements, projects
  • Contents: ~350 climate-related events that have generated spikes of coverage on Twitter, on mainstream media or both from April 2013 to September 2014.
  • Sampling method: via iterative sampling by keywords from tweets included in the 1% sample at the Internet Archive, and by themes/taxonomies from GDELT's news database.
  • Labels: ~350 events were labeled by crowdsourcing workers according to their relevance to climate-chance (as related, weakly related or borderline, not related), triggers (e.g. disasters, media), actions (e.g. legal actions, publications), and 6 news values (i.e. extraordinary, unpredictable, high magnitude, negative, conflictive, related to elite persons).
  • Data format: comma-separated values (.csv) files containing on each line the event name in form of a headline, few sample URLs, and labels for the labeled ones. We also include text (.txt) files with the list of terms for Twitter, and list of taxonomies and themes for GDELT.

If you use the ClimateCovE350 collection, please cite:

Browse on GitHub ClimateCovE350-v1.0.zip (48 KB)

CrisisLexT26 Tweets from 26 crises, labeled by informativeness, information type and source Nov 2014

This collection includes tweets collected during 26 large crisis events in 2012 and 2013, with about 1,000 tweets labeled per crisis for informativeness (i.e. “informative," or "not informative"), information type, and source.

Crisis Country Start / Duration #Tweets Category Sub-Category Type Development Spread
2012 Italy earthquakesItalyMay / 32 days7,351NaturalGeophysicalEarthquakeDiffusedInstantaneous
2012 Colorado wildfiresUSJun / 31 days4,172NaturalClimatologicalWildfireDiffusedProgressive
2012 Philipinnes floodsPhilipinnesAug / 13 days2,950NaturalHydrologicalFloodsDiffusedProgressive
2012 Venezuela refinery explosionVenezuelaAug / 12 days2,736Human-inducedAccidentalExplosionFocalizedInstantaneous
2012 Costa Rica earthquakeCosta RicaSep / 13 days2,193NaturalGeophysicalEarthquakeDiffusedInstantaneous
2012 Guatemala earthquakeGuatemalaNov / 20 days3,261NaturalGeophysicalEarthquakeDiffusedInstantaneous
2012 Typhoon PabloPhillipinesNov / 21 days1,944NaturalMeteorologicalTyphoonDiffusedProgressive
2013 Brazil nightclub fireBrazilJan / 16 days4,786Human-inducedAccidentalFireFocalizedInstantaneous
2013 Queensland floodsAustraliaJan / 19 days1,223NaturalHydrologicalFloodsDiffusedProgressive
2013 Russian meteorRussiaFeb / 19 days8,365NaturalOthersMeteoriteFocalizedInstantaneous
2013 Boston bombingsUSApr / 60 days157,454Human-inducedIntentionalBombingsFocalizedInstantaneous
2013 Savar building collapseBangladeshApr / 36 days4,070Human-inducedAccidentalCollapseFocalizedInstantaneous
2013 West Texas explosionUSApr / 29 days14,505Human-inducedAccidentalExplosionFocalizedInstantaneous
2013 Alberta floodsCanadaJun / 25 days5,887NaturalHydrologicalFloodsDiffusedProgressive
2013 Singapore hazeSingaporeJun / 19 days3,639MixedOthersHazeDiffusedProgressive
2013 Lac-Megantic train crashCanadaJul / 14 days2,342Human-inducedAccidentalDerailmentFocalizedInstantaneous
2013 Spain train crashSpainJul / 15 days3,681Human-inducedAccidentalDerailmentFocalizedInstantaneous
2013 Manila floodsPhillipinesAug / 11 days2,032NaturalHydrologicalFloodsDiffusedProgressive
2013 Colorado floodsUSSep / 21 days1,778NaturalHydrologicalFloodsDiffusedProgressive
2013 Australia wildfiresAustraliaOct / 21 days1,982NaturalClimatologicalWildfireDiffusedProgressive
2013 Bohol earthquakePhillipinesOct / 12 days2,214NaturalGeophysicalEarthquakeDiffusedInstantaneous
2013 Glasgow helicopter crashUKNov / 30 days2,558Human-inducedAccidentalCrashFocalizedInstantaneous
2013 LA Airport shootingsUSNov / 12 days2,730Human-inducedIntentionalShootingsFocalizedInstantaneous
2013 NYC train crashUSNov / 8 days1,066Human-inducedAccidentalDerailmentFocalizedInstantaneous
2013 Sardinia floodsItalyNov / 13 days1,143NaturalHydrologicalFloodsDiffusedProgressive
2013 Typhoon YolandaPhillipinesNov / 58 days38,951NaturalMeteorologicalTyphoonDiffusedProgressive
  • Contents: ~250K tweets posted during 26 crisis events in 2012 and 2013, with most events having 2K-4K tweets.
  • Sampling method: by keyword filtering from tweets included in the 1% sample at the Internet Archive.
  • Labels: ~28,000 tweets (about 1,000 in each collection) were labeled by crowdsource workers according to informativeness (informative or not informative), information types (e.g. caution and advice, infrastructure damage), and information sources (e.g. governments, NGOs).
  • Data format: comma-separated values (.csv) files containing tweet-ids for the unlabeled tweets, plus the text of the tweets and labels for the labeled ones. Also includes a JSON file with metadata about the collection, including the keywords used to select tweets.

If you use the CrisisLexT26 collection, please cite:

Browse on GitHub CrisisLexT26-v1.0.zip (4.6 MB)

CrisisLexT6 Tweets from 6 crises, labeled by relatedness June 2014

This collection includes English tweets across 6 large events in 2012 and 2013, with about 10,000 tweets labeled by relatedness (as "on-topic", or "off-topic") with each event.

Crisis Start / Duration Keyword-based sampling (keywords) #Tweets Geo-based sampling (regions or coordinates) #Tweets
2012 Sandy Hurricane 2012-10-28 / 3 days 4: hurricane, hurricane sandy, frankenstorm, #sandy 2,775,812 NY City; Bergen, Ocean, Union, Atlantic, Essex, Cape May, Hudson, Middlesex; Monmouth County, NJ, US 279,454
2013 Boston Bombings 2013-04-15 / 5 days 17: boston explosion, BostonMarathon, boston blast, boston terrorist, boston bomb, boston tragedy, PrayForBoston, boston attack, boston tragic 3,375,076 Suffolk and Norfolk Counties, Massachusetts, US 88,931
2013 Oklahoma Tornado 2013-05-20 / 11 days 36: oklahoma tornado, oklahoma storm, oklahoma relief, oklahoma volunteer, oklahoma disaster, #moore, moore relief, moore storm, #ok, #okc 2,742,588 long. in [-98.25, -96.75] and lat. in [34.5, 35.75] 62,237
2013 West Texas Explosion 2013-04-17 / 11 days 9: #westexplosion, #westtx, west explosion, waco explosion, texas explosion, tx explosion, texas fertilizer, #prayfortexas, #prayforwest 508,333 long. in [-97.5, -96.5] and lat. in [31.5, 32] 16,033
2013 Alberta Floods 2013-06-21 / 11 days 13: alberta flood, #abflood, canada flood, alberta flooding, alberta floods, canada flooding, canada floods, #yycflood, #yycfloods, #yycflooding 370,762 Alberta, Canada 166,012
2013 Queensland Floods 2013-01-27 / 6 days 4: #qldflood, #bigwet, queensland flood, australia flood 5,393 Queensland, Australia 27,000
  • Contents: ~60K tweets posted during 6 crisis events in 2012 and 2013.
  • Sampling method: ~10 million tweets in total sampled by keywords and geographical regions or coordinates. Tweets were provided by Twitter's partner Topsy (4 geo-based), or as lists of tweet ids by Twitris v3 (5 keyword-based datasets, thanks to Hemant Purohit) and Twitter's partner GNIP (1 keyword-based, 2 geo-based, thanks to Aron Culotta) .
  • Labels: ~60,000 tweets (10,000 in each collection) were labeled by crowdsourcing workers according to relatedness (as "on-topic", or "off-topic").
  • Data format: comma-separated values (.csv) files containing the text of the tweets and labels for the labeled ones.

If you use the CrisisLexT6 collection, please cite:

Browse on GitHub CrisisLexT6-v1.0.zip (3.1 MB)

Other Collections

We would like to host and/or provide links to other crisis-related collections. Please contact us to include other collections in this list.

ChileEarthquakeT1 Tweets from the 2010 Chilean earthquake, labeled by relatedness. June 2015

This collection includes about 2000 tweets in Spanish posted after the Chilean earthquake of 2010, all labeled by relatedness (relevant or not relevant).

Crisis Year #Tweets
Chile Earthquake 2010 2187
  • Contents: ~2.1K tweets in Spanish posted during the Chilean earthquake of 2010.
  • Sampling method: tweets sampled by keywords, language and similarity.
  • Labels: ~2.1K tweets were labeled by six annotators (and three independent labels per tweets) according to their relatedness (as TRUE if relevant, and FALSE if not relevant).
  • Data format: comma-separated values (.csv) files containing the text of the tweets and labels, as well as other tweet fields (user, time, etc).

If you use the ChileEarthquakeT1 collection, please cite:

coboetal2015_twitter.tar.gz (0.2 MB)

SoSItalyT4 Tweets from 4 crises in Italy, labeled by relatedness and type. June 2015

This collection includes tweets across 4 different natural disasters that occurred in Italy between 2009 and 2014, with between ~400 to ~3100 tweets labeled by the type of information they convey (as "damage", "no damage", or "not relevant").

Crisis Year #Tweets
Sardegna Flood 2013 976
L'Aquila Earthquake 2009 1,062
Emilia Earthquake 2012 3,170
Genova Floods 2014 434
  • Contents: ~5.6K tweets posted during 4 crisis events (2 earthquakes and 2 floods) in 2009, 2012, 2013 and 2014.
  • Sampling method: tweets sampled by keywords.
  • Labels: ~5.6K tweets (between 400 to 3100 in each collection) were labeled by three annotators according to the type of information they convey (as "damage", "no damage", or "not relevant").
  • Data format: comma-separated values (.csv) files containing the text of the tweets and labels, as well as other tweet fields (user, geo-location, time, etc).

If you use the SoSItalyT4 collection, please cite:

Browse on Dataset Website Cresci-SWDM15-CSV.zip (0.3 MB)

SandyHurricaneGeoT1 Geo-Located tweets from the 2012 Sandy Hurricane. June 2015

This collection includes 6,556,328 geotagged tweets that represent all geotagged tweets from the time and regions impacted by Hurricane Sandy, the largest Atlantic hurricane on record.

Crisis Year #Tweets
Sandy Hurricane 2012 6,556,328
  • Contents: tweet ids for 6,556,328 tweets, representing all tweets from October 22nd, 2012 —the day Sandy formed— until November 2nd, 2012 — the day that it dissipated.
  • Sampling method: tweets were geotagged and located in Washington DC or one of 13 US states affected by Sandy: Connecticut, Delaware, Massachusetts, Maryland, New Jersey, New York, North Carolina, Ohio, Pennsylvania, Rhode Island, South Carolina, Virginia,West Virginia. This filter was based on a set of bounding boxes that covered the desired area, which also covered small parts of adjacent states.
  • Labels: no labels. The corpus contains tweets both relevant and irrelevant to Hurricane Sandy (no content based filter was applied).
  • Data format: comma-separated values (.csv) files containing the tweet ID, the time stamp of the tweet, a field indicating whether the tweet contains word "sandy".

If you use the SandyHurricaneGeoT1 collection, please cite:

Browse on GitHub release.tgz (56.6 MB)

EnvironmentalPetitionTweets Petition URLs, the tweets containing them and basic stats. May 2016

This collection includes tweets containing URLs coresponding to various environmental campaigns from Jan 2015 to April 2015. The dataset also contains basic stats about the collected signatures and the petition signature goal.

Number Time-interval #Tweets
~200 Jan 1, 2015 to April 14, 2015 37700
  • Contents: URLs corresponding to about 200 petitions from Jan 1, 2015 to April 14, 2015, along with the petition signature goal and number of collected signatures. The collection also contains tweet ids for 37.7 thousand tweets (and retweets) containing the URL of one of the environmental petitions.
  • Sampling method: URL-based filtering.
  • Data format: a comma-separated values file (.csv) containing the petition URL, the tweet URL, petition's signature goal, petition's collected signatures.

If you use the EnvironmentalPetitionTweets collection, please cite:

Browse on GitHub