MTAAC Research Process » Corpus tool

Corpus analysis tools review

1. PML Tree Query - Cons:

  • Requires Treex installation, web version is suitable only for teaching and Demonstration
  • Limited publicly available treebanks
  • Following strict universal dependency structure could be difficult for CDLI data

2. Corpus Workbench / Ziggurat - Pros:

  • The IMS Open Corpus Workbench(CWB) is a collection of open corpus tools for managing and querying large text corpora with linguistic annotations
    Cons:
  • Don’t know if new version added Unicode support
  • Not supported on Windows

3. XML DB with own SVG frontend -

  • SVGs are a good option
  • Dot / Graphviz is a promising approach but would involve persistent XML DB

4. SPARQL endpoint with visualization - Cons:

  • SPARQL endpoint is an ephemeral and unstable way to share data https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/
  • Requires data to be stored in RDF format
  • Preferable for key-value kind of data

5. ANNIS (Annotation of Information Structure) - Pros:

  • Cross platform
  • Browser based search
  • Visualization architecture for complex, multilayer linguistic corpora
  • Concurrently annotate, query and visualize data from varied areas such as Syntax, morphology, semantics, etc.

    Cons:
  • Rejected as it is in Java and would have to be queried like an API
  • Embedding the search results in the site could be problematic
    6. Other suggestions - https://linguistics.stanford.edu/resources/corpora/corpus-tools http://linguistlist.org/sp/SearchWRListing-action.cfm?subclassid=7223&SearchType=LF&WRTypeID=2

A) TIGERsearch - http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/manual.html Pros:

  • Searches or browses syntactically and POS tagged corpora
  • Graphic tree display
  • GUI

B) TGrep2 - http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf Pros:

  • Grep for trees

Maithili Bidhe