MTAAC Research Process » Corpus tool
Corpus analysis tools review
1. PML Tree Query - Cons:
- Requires Treex installation, web version is suitable only for teaching and Demonstration
- Limited publicly available treebanks
- Following strict universal dependency structure could be difficult for CDLI data
2. Corpus Workbench / Ziggurat - Pros:
- The IMS Open Corpus Workbench(CWB) is a collection of open corpus tools for managing and querying large text corpora with linguistic annotations
Cons: - Don’t know if new version added Unicode support
- Not supported on Windows
3. XML DB with own SVG frontend -
- SVGs are a good option
- Dot / Graphviz is a promising approach but would involve persistent XML DB
4. SPARQL endpoint with visualization - Cons:
- SPARQL endpoint is an ephemeral and unstable way to share data https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/
- Requires data to be stored in RDF format
- Preferable for key-value kind of data
5. ANNIS (Annotation of Information Structure) - Pros:
- Cross platform
- Browser based search
- Visualization architecture for complex, multilayer linguistic corpora
- Concurrently annotate, query and visualize data from varied areas
such as Syntax, morphology, semantics, etc.
Cons: - Rejected as it is in Java and would have to be queried like an API
- Embedding the search results in the site could be problematic
6. Other suggestions - https://linguistics.stanford.edu/resources/corpora/corpus-tools http://linguistlist.org/sp/SearchWRListing-action.cfm?subclassid=7223&SearchType=LF&WRTypeID=2
A) TIGERsearch - http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/manual.html Pros:
- Searches or browses syntactically and POS tagged corpora
- Graphic tree display
- GUI
B) TGrep2 - http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf Pros:
- Grep for trees
Maithili Bidhe