Week 1- Tokenisers and Benchmark

by Rachit

project
research
internship
unsupervised
nmt

Welcome of CDLI Blogs.

Please update the author name and add tags too.

This page should contain the report made for every week.

Replace Project# with your project name.

Week Summary

A complete report of the work done during the week must be written here.

Daily Work Update

# Day Date A short description of the work done
1 Monday 2020/06/01 Wrote and applied pre-processing scripts on monolingual and parallel data
2 Tuesday 2020/06/02 Trained BBPE, BPE and BertWordPiece Tokenizers on the pre-processed text, saved vocabulary and compared results
3 Wednesday 2020/06/03 Experimented with CLTK/Akkadian Tokenizer. Created train and test data files. Added alternate version of data using last year’s preprocessing.
4 Thursday 2020/06/04 Aligned and prepared data according to FairSeq and OpenNMT
5 Friday 2020/06/05 Cleaned and prepared newly obtained ((non-)administrative) data, analysed supervised techniques
6 Saturday 2020/06/06 Completed model pipeline shell scripts and set up GPU server
7 Sunday 2020/06/07 Prepared and Finalised Benchmark Dataset