Software - Natural Language Processing @ Columbia University

Software and Tools

Software and Tools on this page are available free of charge for educational, research, and in-house uses. For information on commercial use of any of these tools, please contact Columbia Technology Ventures, email: techventures@columbia.edu, phone number: (+1) 212-854-8444.

We additionally have the following online repositories for our research artifacts:

Columbia NLP Huggingface
Columbia NLP Lab GitHub

Here is an incomplete list of software and tools Columbia NLP has developed.

Narrative Summarization Corpus
Gendered Corpus
Opinionated Claims Corpus
Wikipedia Talk Pages Agreement Corpus
Create Debate Agreement Corpus
Sentence Fusion Corpus
Text-to-text Generation
Quoted Speech Attribution Corpus
MADA
LCseg
LexChainer
LinkIT
Centrifuser
Annotated Bibliography Corpus
FUF
CFUF
Surge
CREP
Segmenter
Verber

Narrative Summarization Corpus

Developed by Jessica Ouyang, Serina Chang, and Kathleen McKeown

Described in Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora. Personal narratives with aligned extractive and abstractive summaries. Available under MIT License.

Gendered Corpus

Developed by Serina Chang and Kathleen McKeown

Described in Automatically Inferring Gender Associations from Language. Online articles written about celebrities and online reviews written by students about professors. Labeled for gender.

Opinionated Claims Corpus

Developed by Sara Rosenthal and Kathleen McKeown

Described in Detecting Opinionated Claims in Online Discussions. Wikipedia and LiveJournal. Sentence level annotations of opinionated claims and phrase based sentiment.

Wikipedia Talk Pages Agreement Corpus

Developed by Sara Rosenthal, Jacob Andreas, and Kathleen McKeown

Post-level agreement annotations for conversational analysis.

Create Debate Agreement Corpus

Developed by Sara Rosenthal and Kathleen McKeown

Sentence-level agreement annotations from discussion threads.

Sentence Fusion Corpus

Developed by Kathleen McKeown, Sara Rosenthal, Kapil Thadani, and Coleman Moore

Resource for text consolidation research.

Text-to-text Generation

Developed by Kapil Thadani and Kathleen McKeown

Software for learning compression and fusion models.

GitHub Repository

Quoted Speech Attribution Corpus

Developed by David K. Elson

Over 3,000 instances of quoted speech from 6 works of 19th and 20th century literature. Funded by NSF IIS-0935360.

Licensing Agreement

MADA

Developed by Nizar Habash and Owen Rambow

Morphological annotation tool for Modern Standard Arabic.

LCseg

Developed by Michel Galley

Domain-independent discourse segmenter based on lexical cohesion.

Licensing Agreement

LexChainer

Developed by Michel Galley

Locates semantically connected terms in unrestricted documents.

Licensing Agreement

LinkIT

Tool for identifying and relating noun phrases within a document.

Licensing Agreement

Centrifuser

Developed by Min-Yen Kan

Domain- and genre-specific multidocument summarization system focusing on healthcare documents.

Licensing Agreement

Annotated Bibliography Corpus

Developed by Min-Yen Kan

2000 annotated bibliography entries in XML format with semantic annotations.

Licensing Agreement

FUF

Developed by Michael Elhadad

Functional Unification Formalism language.

CFUF

Developed by Michael Elhadad and Mark Kharitonov

Graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter.

Surge

Developed by Michael Elhadad and Jacques Robin

Syntactic realization grammar for text generation.

CREP

Developed by Duford

Regular expression tool for identifying linguistic patterns.

Licensing Agreement

Segmenter

Developed by Min-Yen Kan

Text segmentation program.

Licensing Agreement

Verber

Developed by Min-Yen Kan, Judith Klavans, and Kathleen McKeown

Groups semantically associated verbs together.

Licensing Agreement