Towards a General-Purpose Entity-Bibliography Linking System

by Circle Chen

project
gsoc
gsoc2022
towardsAGeneralPurposeEntityBibliographyLinkingSystem

Towards a General-Purpose Entity-Bibliography Linking System

Project Description

Objectives and Deliverables

:heavy_check_mark: –> Completed Tasks :white_check_mark: –> Ongoing Tasks

# Status Objectives Associated Deliverables issue(s)
1 :heavy_check_mark: Adapt the old reference database system to the new one. - Publication View Page should work - Elastic Search should work #1154 and !653; #1244 and !695
2 :heavy_check_mark: Clean the current bibliography data. Data workflow as well as resulting data table. This repository
3 :heavy_check_mark: Identify the reference relationships between our publications and our entities, and populate the database with new data Data workflow as well as resulting data table. 1. .ipynb for finding provenience-pub relationships 2. Resulting CSV
4 :heavy_check_mark: Enable single publication file submission and suggestion for connecting new entities hello world #1270 and !710

Additional Objectives

# Status Objectives Associated Deliverables issue(s)
1 :white_check_mark: Enable bulk-uploading submission. New interface on websites Not yet done
2 :heavy_check_mark: Miscellaneous issue fixing. Numerous functional improvements and patching Issues: #997 #1053 #1127

Tentative timeline

:heavy_check_mark: –> Completed Tasks :white_check_mark: –> Ongoing Tasks :raised_hands: –> Work Demonstration

Week Objectives Deliverables
1 :heavy_check_mark: Adapt the old reference database system to the new one. :heavy_check_mark: Publication View Page should work :heavy_check_mark: Elastic Search should work
2 :heavy_check_mark: Adapt the old reference database system to the new one.
3 :heavy_check_mark: Testing whether refactoring is broken :heavy_check_mark: Initial exploration of publications dataset to see hwo to clean :heavy_check_mark: Test PR :heavy_check_mark: Write existing ipynb notebooks on the data
4 :heavy_check_mark: Sent initial round of data cleaning to Adam for proofreading, refining script
5 :heavy_check_mark: Explore ML-based methods for reference parsing (it worked poorly). Wrote regex methods to match and sent to Emilie to proofread exact_reference
6 :heavy_check_mark: Modified script to account for Bibtex Key updates. :heavy_check_mark: Prefill merge exact_ref in merge page. !997
7 :heavy_check_mark: Attempt to use machine learning based methods to do pdf mining and find publication-provenience relationships.
8 :heavy_check_mark: Switched to pattern-matching based to do pdf mining.
9 :heavy_check_mark: Finished pdf mining with pattern-matching and generated preliminary publication-proveniences connection dataset on Github. 1. .ipynb for finding provenience-pub relationships 2. Resulting CSV
10 :heavy_check_mark: Exploring how to incorporate node-js and python to run single publication file parsing script.
11 :heavy_check_mark: Finalization of node-js and python scripts to run the single pub file parsing. Trying to code cakePHP to read from resulting csv file.
12 :heavy_check_mark: Buffer week, clean up new converted dataset and fix some issues that just popped up.

Week 1

by Circle Chen


Week 2

by Circle Chen


Week 3

by Circle Chen


Week 4

by Circle Chen


Week 5

by Circle Chen


Week 6

by Circle Chen


Eval 1

by Circle Chen


Week 7

by Circle Chen


Week 8

by Circle Chen


Week 9

by Circle Chen


Week 10

by Circle Chen


Week 11

by Circle Chen


Week 12

by Circle Chen


Eval 2

by Circle Chen