MTAAC Research Process » Corpus tool

Corpus analysis tools review

1. PML Tree Query - Cons:

  • Requires Treex installation, web version is suitable only for teaching and Demonstration
  • Limited publicly available treebanks
  • Following strict universal dependency structure could be difficult for CDLI data

2. Corpus Workbench / Ziggurat - Pros:

  • The IMS Open Corpus Workbench(CWB) is a collection of open corpus tools for managing and querying large text corpora with linguistic annotations
  • Don’t know if new version added Unicode support
  • Not supported on Windows

3. XML DB with own SVG frontend -

  • SVGs are a good option
  • Dot / Graphviz is a promising approach but would involve persistent XML DB

4. SPARQL endpoint with visualization - Cons:

  • SPARQL endpoint is an ephemeral and unstable way to share data
  • Requires data to be stored in RDF format
  • Preferable for key-value kind of data

5. ANNIS (Annotation of Information Structure) - Pros:

  • Cross platform
  • Browser based search
  • Visualization architecture for complex, multilayer linguistic corpora
  • Concurrently annotate, query and visualize data from varied areas such as Syntax, morphology, semantics, etc.

  • Rejected as it is in Java and would have to be queried like an API
  • Embedding the search results in the site could be problematic
    6. Other suggestions -

A) TIGERsearch - Pros:

  • Searches or browses syntactically and POS tagged corpora
  • Graphic tree display
  • GUI

B) TGrep2 - Pros:

  • Grep for trees

Maithili Bidhe