Towards a General-Purpose Entity-Bibliography Linking System

by Circle Chen

project

gsoc

gsoc2022

towardsAGeneralPurposeEntityBibliographyLinkingSystem

Towards a General-Purpose Entity-Bibliography Linking System

Project Description

Objectives and Deliverables

–> Completed Tasks –> Ongoing Tasks

#	Status	Objectives	Associated Deliverables	issue(s)
1		Adapt the old reference database system to the new one.	- Publication View Page should work - Elastic Search should work	#1154 and !653; #1244 and !695
2		Clean the current bibliography data.	Data workflow as well as resulting data table.	This repository
3		Identify the reference relationships between our publications and our entities, and populate the database with new data	Data workflow as well as resulting data table.	1. .ipynb for finding provenience-pub relationships 2. Resulting CSV
4		Enable single publication file submission and suggestion for connecting new entities	hello world	#1270 and !710

Additional Objectives

#	Status	Objectives	Associated Deliverables	issue(s)
1		Enable bulk-uploading submission.	New interface on websites	Not yet done
2		Miscellaneous issue fixing.	Numerous functional improvements and patching	Issues: #997 #1053 #1127

Tentative timeline

–> Completed Tasks –> Ongoing Tasks –> Work Demonstration

Week	Objectives	Deliverables
1	Adapt the old reference database system to the new one.	Publication View Page should work Elastic Search should work
2	Adapt the old reference database system to the new one.
3	Testing whether refactoring is broken Initial exploration of publications dataset to see hwo to clean	Test PR Write existing ipynb notebooks on the data
4	Sent initial round of data cleaning to Adam for proofreading, refining script
5	Explore ML-based methods for reference parsing (it worked poorly). Wrote `regex` methods to match and sent to Emilie to proofread exact_reference
6	Modified script to account for Bibtex Key updates. Prefill merge exact_ref in merge page.	!997
7	Attempt to use machine learning based methods to do pdf mining and find publication-provenience relationships.
8	Switched to pattern-matching based to do pdf mining.
9	Finished pdf mining with pattern-matching and generated preliminary publication-proveniences connection dataset on Github.	1. .ipynb for finding provenience-pub relationships 2. Resulting CSV
10	Exploring how to incorporate node-js and python to run single publication file parsing script.
11	Finalization of node-js and python scripts to run the single pub file parsing. Trying to code cakePHP to read from resulting csv file.
12	Buffer week, clean up new converted dataset and fix some issues that just popped up.

Week 1

by Circle Chen

Week 2

by Circle Chen

Week 3

by Circle Chen

Week 4

by Circle Chen

Week 5

by Circle Chen

Week 6

by Circle Chen

Eval 1

by Circle Chen

Week 7

by Circle Chen

Week 8

by Circle Chen

Week 9

by Circle Chen

Week 10

by Circle Chen

Week 11

by Circle Chen

Week 12

by Circle Chen

Eval 2

by Circle Chen