numerals

by Logan Born

project
gsoc
gsoc2020
numerals
eval#1
week#2

Analysis of Accounting Corpora

This project seeks to produce exploratory visualizations of CDLI accounting corpora. The final product will comprise a pipeline which identifies counted objects in a corpus and converts the associated count from Sumerian to arabic numerals. A Flask API will return a summary data file listing all of the counted objects in the corpus, all of the counts associated with each object, and all of the collocations between counted objects. An online visualization will display this information to readers in an accessible format, with options to filter the visualization to focus on objects or relationships of interest.

The goal is to present the information from these corpora in an accessible manner to facilitate exploration of the ancient economy and society.

Code available here

Objectives and Deliverables

Essential Objectives

# Objectives Associated Deliverables Notes issue(s)
1 Numeral conversion i. Python script to convert transliterated cuneiform numerals into arabic numerals (e.g. 1(disz) 1(u) gin2 ↦ 1.167) ii. API endpoint to access this script API documentation here. NB notations can be ambiguous: if user does not specify a number system to use, the script returns a list of possible readings. Update to handle variant spellings (e.g. disz vs diš) To-do: What is the meaning of |ASZxDISZ|? #1 #2 #3 #4 #5
2 Entry segmentation i. Python script to segment documents into entries. Script segments a text using numerals as delimiters. This works except for the final entry: during objective 3, when totals are recorded (… dur 1(disz) …) remember that dur “total” must be removed as it is not a counted object. (More generally, descriptors and explanations will have to be stripped: consider e.g. 1(asz@c)-ta szu ba-ti ezem sze gu7 {d}nansze-ka sa6-sa6 dam URU-KA-gi-na lugal lagasz{ki}-ka-ke4 e-ne-ba 2(|ASZxDISZ@t|))
3 Commodity identification i. Python script which uses a rule-based system to identify and label counted objects in an entry. Depends on #2 A heuristic, rule-based approach will miss some counted objects and will probably label some objects incorrectly. Any extra time in the project should go towards refining this component as much as possible (see optional objectives).
4 Data extraction i. Python script which produces JSON summary of counted object frequencies and number of times goods co-occur. ii. API endpoint to serve this JSON data. Depends on #3 To ensure the API returns accurate data, must always fetch the most recent version of the corpus. Needs to wait for the new search API to be ready.
5 Visualizations i. Static HTML page with interactive, exploratory visualizations of the extracted data. ii. Filtering options, implemented as standalone scripts to permit later integration with CDLI search. Depends on #4 See details below.
6 Framework integration Integrate the preceding deliverables into the existing CDLI framework wherever possible.
Visualization Details

Feedback from prospective users will guide the exact format of the visualizations. Information to be visualized includes:

Additional Objectives

# Objectives Associated Deliverables Notes issue(s)
1 Improved commodity identification Extend objective 3 (commodity identification) to use POS tags and NER to identify and label counted objects in an entry. If POS tagging is not available at this point in the project, we can align and project annotations from English until a better system comes along.
2 Extra languages Extend all components to include support for additional scripts and languages (e.g. Akkadian, proto-cuneiform, proto-Elamite). Resources are more limited for languages which are not Sumerian.

Tentative timeline

Month 1: complete pipeline to extract data from the corpus and API to serve as JSON

Month 2: first batch of visualizations complete, from sketches through filtering tools to final demo

Month 3: all visualizations complete, and final integration with CDLI framework

Week Objectives Deliverables
1 Entry segmentation & Commodity identification Scripts for (i) segmentation; (ii) commodity labeling
2 Data extraction Script to convert corpus text file into JSON data
3 Full pipeline complete API endpoint which serves extracted data. Need to update to fetch most recent corpus as soon as the new search API is ready.
4 Sketches and feedback; start filtering tools Sketches delivered to prospective users
5 Start Viz Work
6 Viz
7 Integration with CDLI framework Demo
8 Viz: implement demo feedback Demo
9 Finalize viz interface Final Demo
10 Filtering tools Script implementing filtering tools: query ↦ list of matching texts
11 Test cross-browser compatibility & visual polish
12 User guide

numerals Eval#1 Week#1

by Logan Born


numerals Eval#1 Week#2

by Logan


numerals Eval#1 Week#3

by Logan


Numerals Project Sketches

by Logan


numerals Eval#1 Week#4

by Logan


numerals Eval#1

by Logan


numerals Eval#2 Week#5

by Logan


numerals Eval#2 Week#6

by Logan


numerals Eval#2 Week#7

by Logan


numerals Eval#2 Week#8

by Logan


numerals Eval#2

by Logan


numerals Eval#3 Week#9

by Logan


numerals Eval#3 Week#10

by Logan


numerals Eval#3 Week#11

by Logan


numerals Eval#3 Week#12

by Logan


numerals Eval#3

by Logan