This week, we will transition into data cleaning.
TODO 1: Our publications table consists of a lot of redundant publications, since it is a direct copy from the old artifacts
database.
The task here would be to run automatic python scripts and to merge these existing publications to populate the database.
I wrote a script that scans the “designation” entry and filters out those with redundant information. The script ended up merging a lot of redundant data, from 200000 entries down to 5000 entries.
TODO 2: Get data on connecting artifacts to proveniences
Rune gave me some data about publications and proveniences. I tried linking their project’s publications with the CDLI ones, but could not match those. Maybe I can consider adding the publications Rune gave to the publications?
# | Day | Date | A short description of the work done |
---|---|---|---|
1 | Monday | 2022/06/27 | Fixed merge conflicts of refactored website and most recent phoenix/develop branch |
2 | Tuesday | 2022/06/28 | Finalized the merge script of publication entries |
3 | Wednesday | 2022/06/29 | Exploratory Data Analysis on Rune’s data, but discovered that not a lot of match with CDLI ones. |
4 | Thursday | 2022/06/30 | Took a break because didn’t feel too well today. |
5 | Friday | 2022/07/01 | Discussed with Adam about merging publications and sent him the initially merged data |
6 | Saturday | 2022/07/02 | Coordinated refactor testing with Emilie |
7 | Sunday | 2022/07/03 | Took a break. |