Create Your Own Lexicon

Do you want to create your own lexicon for this application? Do you want a lexicon for a different application, such as food-related or sports-related tweets?

Build a Lexicon

The lexicon builder generates a set of terms based on a series of collections of tweets in which about half of the tweets are representative of a specific domain (e.g., food, sports, politics), while the remaining tweets are not. The obtained lexicon contains terms found to be discriminative for the target domain:

$ python \
        --terms_scoring pmi \
        --output your_new_lexicon.txt \
        --input your_labeled_collections 

If you want to ensure that the terms are frequent while also filtering out the terms that co-occur:

$ python \
        --terms_scoring pmi \
        --output your_new_lexicon.txt \
        --input your_labeled_collections \
        --hit_ratio \

Browse on GitHub (7.8 KB)



A. Olteanu, C. Castillo, F. Diaz, S. Vieweg. CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. In Proceedings of the AAAI Conference on Weblogs and Social Media (ICWSM'14). AAAI Press, Ann Arbor, MI, USA.

Other Methods to Generate Lexical Resources

We would like to host and/or provide links to other tools that support the creation of lexical resources for crises or related-domains. Please contact us to include other tools in this list.