Week 1- Tokenisers and Benchmark

by Rachit

project

research

internship

unsupervised

nmt

Welcome of CDLI Blogs.

Please update the author name and add tags too.

This page should contain the report made for every week.

Replace Project# with your project name.

Week Summary

A complete report of the work done during the week must be written here.

Daily Work Update

#	Day	Date	A short description of the work done
1	Monday	2020/06/01	Wrote and applied pre-processing scripts on monolingual and parallel data
2	Tuesday	2020/06/02	Trained BBPE, BPE and BertWordPiece Tokenizers on the pre-processed text, saved vocabulary and compared results
3	Wednesday	2020/06/03	Experimented with CLTK/Akkadian Tokenizer. Created train and test data files. Added alternate version of data using last year’s preprocessing.
4	Thursday	2020/06/04	Aligned and prepared data according to FairSeq and OpenNMT
5	Friday	2020/06/05	Cleaned and prepared newly obtained ((non-)administrative) data, analysed supervised techniques
6	Saturday	2020/06/06	Completed model pipeline shell scripts and set up GPU server
7	Sunday	2020/06/07	Prepared and Finalised Benchmark Dataset