numerals

Analysis of Accounting Corpora

This project seeks to produce exploratory visualizations of CDLI accounting corpora. The final product will comprise a pipeline which identifies counted objects in a corpus and converts the associated count from Sumerian to arabic numerals. A Flask API will return a summary data file listing all of the counted objects in the corpus, all of the counts associated with each object, and all of the collocations between counted objects. An online visualization will display this information to readers in an accessible format, with options to filter the visualization to focus on objects or relationships of interest.

The goal is to present the information from these corpora in an accessible manner to facilitate exploration of the ancient economy and society.

Code available here

Objectives and Deliverables

Essential Objectives

#	Objectives	Associated Deliverables	Notes	issue(s)
1	Numeral conversion	~~i. Python script to convert transliterated cuneiform numerals into arabic numerals (e.g. 1(disz) 1(u) gin2 ↦ 1.167)~~ ~~ii. API endpoint to access this script~~	API documentation here. NB notations can be ambiguous: if user does not specify a number system to use, the script returns a list of possible readings. ~~Update to handle variant spellings (e.g. disz vs diš)~~ ~~To-do: What is the meaning of \|ASZxDISZ\|?~~	#1 #2 #3 #4 #5
2	Entry segmentation	~~i. Python script to segment documents into entries.~~	Script segments a text using numerals as delimiters. This works except for the final entry: during objective 3, when totals are recorded (… dur 1(disz) …) remember that dur “total” must be removed as it is not a counted object. (More generally, descriptors and explanations will have to be stripped: consider e.g. 1(asz@c)-ta szu ba-ti ezem sze gu7 {d}nansze-ka sa6-sa6 dam URU-KA-gi-na lugal lagasz{ki}-ka-ke4 e-ne-ba 2(\|ASZxDISZ@t\|))
3	Commodity identification	~~i. Python script which uses a rule-based system to identify and label counted objects in an entry.~~	Depends on #2 A heuristic, rule-based approach will miss some counted objects and will probably label some objects incorrectly. Any extra time in the project should go towards refining this component as much as possible (see optional objectives).
4	Data extraction	~~i. Python script which produces JSON summary of counted object frequencies and number of times goods co-occur.~~ ~~ii. API endpoint to serve this JSON data.~~	Depends on #3 To ensure the API returns accurate data, must always fetch the most recent version of the corpus. Needs to wait for the new search API to be ready.
5	Visualizations	i. Static HTML page with interactive, exploratory visualizations of the extracted data. ii. Filtering options, implemented as standalone scripts to permit later integration with CDLI search.	Depends on #4 See details below.
6	Framework integration	Integrate the preceding deliverables into the existing CDLI framework wherever possible.

Visualization Details

Feedback from prospective users will guide the exact format of the visualizations. Information to be visualized includes:

Summaries of item frequency
Measures of central tendency and variation in an item’s counts; if possible, identifying items with multimodal distributions
Item collocations
Similarity between item distributions

Additional Objectives

#	Objectives	Associated Deliverables	Notes	issue(s)
1	Improved commodity identification	Extend objective 3 (commodity identification) to use POS tags and NER to identify and label counted objects in an entry.	If POS tagging is not available at this point in the project, we can align and project annotations from English until a better system comes along.
2	Extra languages	Extend all components to include support for additional scripts and languages (e.g. Akkadian, proto-cuneiform, proto-Elamite).	Resources are more limited for languages which are not Sumerian.

Tentative timeline

Month 1: complete pipeline to extract data from the corpus and API to serve as JSON

Month 2: first batch of visualizations complete, from sketches through filtering tools to final demo

Month 3: all visualizations complete, and final integration with CDLI framework

Week	Objectives	Deliverables
1	~~Entry segmentation~~ & ~~Commodity identification~~	Scripts for ~~(i) segmentation~~; ~~(ii) commodity labeling~~
2	~~Data extraction~~	Script to convert corpus text file into JSON data
3	~~Full pipeline complete~~	API endpoint which serves extracted data. Need to update to fetch most recent corpus as soon as the new search API is ready.
4	~~Sketches and feedback; start filtering tools~~	Sketches delivered to prospective users
5	~~Start Viz Work~~
6	~~Viz~~
7	~~Integration with CDLI framework~~	Demo
8	Viz: implement demo feedback	Demo
9	Finalize viz interface	Final Demo
10	Filtering tools	Script implementing filtering tools: query ↦ list of matching texts
11	Test cross-browser compatibility & visual polish
12	User guide