Software
Software and Tools we have released
Software and Tools
Software and Tools on this page are available free of charge for educational, research, and in-house uses. For information on commercial use of any of these tools, please contact Columbia Technology Ventures, email: techventures@columbia.edu, phone number: (+1) 212-854-8444.
We additionally have the following online repositories for our research artifacts:
- Columbia NLP Huggingface
- Columbia NLP Lab GitHub
Here is an incomplete list of software and tools Columbia NLP has developed.
- Narrative Summarization Corpus
- Gendered Corpus
- Opinionated Claims Corpus
- Wikipedia Talk Pages Agreement Corpus
- Create Debate Agreement Corpus
- Sentence Fusion Corpus
- Text-to-text Generation
- Quoted Speech Attribution Corpus
- MADA
- LCseg
- LexChainer
- LinkIT
- Centrifuser
- Annotated Bibliography Corpus
- FUF
- CFUF
- Surge
- CREP
- Segmenter
- Verber
Narrative Summarization Corpus
Developed by Jessica Ouyang, Serina Chang, and Kathleen McKeown
Described in Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora. Personal narratives with aligned extractive and abstractive summaries. Available under MIT License.
Gendered Corpus
Developed by Serina Chang and Kathleen McKeown
Described in Automatically Inferring Gender Associations from Language. Online articles written about celebrities and online reviews written by students about professors. Labeled for gender.
Opinionated Claims Corpus
Developed by Sara Rosenthal and Kathleen McKeown
Described in Detecting Opinionated Claims in Online Discussions. Wikipedia and LiveJournal. Sentence level annotations of opinionated claims and phrase based sentiment.
Wikipedia Talk Pages Agreement Corpus
Developed by Sara Rosenthal, Jacob Andreas, and Kathleen McKeown
Post-level agreement annotations for conversational analysis.
Create Debate Agreement Corpus
Developed by Sara Rosenthal and Kathleen McKeown
Sentence-level agreement annotations from discussion threads.
Sentence Fusion Corpus
Developed by Kathleen McKeown, Sara Rosenthal, Kapil Thadani, and Coleman Moore
Resource for text consolidation research.
Text-to-text Generation
Developed by Kapil Thadani and Kathleen McKeown
Software for learning compression and fusion models.
Quoted Speech Attribution Corpus
Developed by David K. Elson
Over 3,000 instances of quoted speech from 6 works of 19th and 20th century literature. Funded by NSF IIS-0935360.
MADA
Developed by Nizar Habash and Owen Rambow
Morphological annotation tool for Modern Standard Arabic.
LCseg
Developed by Michel Galley
Domain-independent discourse segmenter based on lexical cohesion.
LexChainer
Developed by Michel Galley
Locates semantically connected terms in unrestricted documents.
LinkIT
Tool for identifying and relating noun phrases within a document.
Centrifuser
Developed by Min-Yen Kan
Domain- and genre-specific multidocument summarization system focusing on healthcare documents.
Annotated Bibliography Corpus
Developed by Min-Yen Kan
2000 annotated bibliography entries in XML format with semantic annotations.
FUF
Developed by Michael Elhadad
Functional Unification Formalism language.
CFUF
Developed by Michael Elhadad and Mark Kharitonov
Graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter.
Surge
Developed by Michael Elhadad and Jacques Robin
Syntactic realization grammar for text generation.
CREP
Developed by Duford
Regular expression tool for identifying linguistic patterns.
Segmenter
Developed by Min-Yen Kan
Text segmentation program.
Verber
Developed by Min-Yen Kan, Judith Klavans, and Kathleen McKeown
Groups semantically associated verbs together.