Week 3

by Circle Chen

week
gsoc
gsoc2022
towardsAGeneralPurposeEntityBibliographyLinkingSystem
week#3
eval#1

Week Summary

This week, we will transition into data cleaning.

TODO 1: Our publications table consists of a lot of redundant publications, since it is a direct copy from the old artifacts database.

The task here would be to run automatic python scripts and to merge these existing publications to populate the database.

I wrote a script that scans the “designation” entry and filters out those with redundant information. The script ended up merging a lot of redundant data, from 200000 entries down to 5000 entries.

TODO 2: Get data on connecting artifacts to proveniences

Rune gave me some data about publications and proveniences. I tried linking their project’s publications with the CDLI ones, but could not match those. Maybe I can consider adding the publications Rune gave to the publications?

Daily Work Update

# Day Date A short description of the work done
1 Monday 2022/06/27 Fixed merge conflicts of refactored website and most recent phoenix/develop branch
2 Tuesday 2022/06/28 Finalized the merge script of publication entries
3 Wednesday 2022/06/29 Exploratory Data Analysis on Rune’s data, but discovered that not a lot of match with CDLI ones.
4 Thursday 2022/06/30 Took a break because didn’t feel too well today.
5 Friday 2022/07/01 Discussed with Adam about merging publications and sent him the initially merged data
6 Saturday 2022/07/02 Coordinated refactor testing with Emilie
7 Sunday 2022/07/03 Took a break.