Software

Software and Tools we have released

Software and Tools

Software and Tools on this page are available free of charge for educational, research, and in-house uses. For information on commercial use of any of these tools, please contact Columbia Technology Ventures, email: techventures@columbia.edu, phone number: (+1) 212-854-8444.

We additionally have the following online repositories for our research artifacts:

  1. Columbia NLP Huggingface
  2. Columbia NLP Lab GitHub

Here is an incomplete list of software and tools Columbia NLP has developed.


Narrative Summarization Corpus

Developed by Jessica Ouyang, Serina Chang, and Kathleen McKeown

Described in Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora. Personal narratives with aligned extractive and abstractive summaries. Available under MIT License.

Download


Gendered Corpus

Developed by Serina Chang and Kathleen McKeown

Described in Automatically Inferring Gender Associations from Language. Online articles written about celebrities and online reviews written by students about professors. Labeled for gender.

Download


Opinionated Claims Corpus

Developed by Sara Rosenthal and Kathleen McKeown

Described in Detecting Opinionated Claims in Online Discussions. Wikipedia and LiveJournal. Sentence level annotations of opinionated claims and phrase based sentiment.

Download


Wikipedia Talk Pages Agreement Corpus

Developed by Sara Rosenthal, Jacob Andreas, and Kathleen McKeown

Post-level agreement annotations for conversational analysis.

Download


Create Debate Agreement Corpus

Developed by Sara Rosenthal and Kathleen McKeown

Sentence-level agreement annotations from discussion threads.

Download


Sentence Fusion Corpus

Developed by Kathleen McKeown, Sara Rosenthal, Kapil Thadani, and Coleman Moore

Resource for text consolidation research.

Download


Text-to-text Generation

Developed by Kapil Thadani and Kathleen McKeown

Software for learning compression and fusion models.

GitHub Repository


Quoted Speech Attribution Corpus

Developed by David K. Elson

Over 3,000 instances of quoted speech from 6 works of 19th and 20th century literature. Funded by NSF IIS-0935360.

Licensing Agreement


MADA

Developed by Nizar Habash and Owen Rambow

Morphological annotation tool for Modern Standard Arabic.

More


LCseg

Developed by Michel Galley

Domain-independent discourse segmenter based on lexical cohesion.

Licensing Agreement


LexChainer

Developed by Michel Galley

Locates semantically connected terms in unrestricted documents.

Licensing Agreement


LinkIT

Tool for identifying and relating noun phrases within a document.

Licensing Agreement


Centrifuser

Developed by Min-Yen Kan

Domain- and genre-specific multidocument summarization system focusing on healthcare documents.

Licensing Agreement


Annotated Bibliography Corpus

Developed by Min-Yen Kan

2000 annotated bibliography entries in XML format with semantic annotations.

Licensing Agreement


FUF

Developed by Michael Elhadad

Functional Unification Formalism language.

Download


CFUF

Developed by Michael Elhadad and Mark Kharitonov

Graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter.

More


Surge

Developed by Michael Elhadad and Jacques Robin

Syntactic realization grammar for text generation.

Download


CREP

Developed by Duford

Regular expression tool for identifying linguistic patterns.

Licensing Agreement


Segmenter

Developed by Min-Yen Kan

Text segmentation program.

Licensing Agreement


Verber

Developed by Min-Yen Kan, Judith Klavans, and Kathleen McKeown

Groups semantically associated verbs together.

Licensing Agreement