Week 1- Tokenisers and Benchmark

by Rachit


Welcome of CDLI Blogs.

Please update the author name and add tags too.

This page should contain the report made for every week.

Replace Project# with your project name.

Week Summary

A complete report of the work done during the week must be written here.

Daily Work Update

# Day Date A short description of the work done
1 Monday 2020/06/01 Wrote and applied pre-processing scripts on monolingual and parallel data
2 Tuesday 2020/06/02 Trained BBPE, BPE and BertWordPiece Tokenizers on the pre-processed text, saved vocabulary and compared results
3 Wednesday 2020/06/03 Experimented with CLTK/Akkadian Tokenizer. Created train and test data files. Added alternate version of data using last year’s preprocessing.
4 Thursday 2020/06/04 Aligned and prepared data according to FairSeq and OpenNMT
5 Friday 2020/06/05 Cleaned and prepared newly obtained ((non-)administrative) data, analysed supervised techniques
6 Saturday 2020/06/06 Completed model pipeline shell scripts and set up GPU server
7 Sunday 2020/06/07 Prepared and Finalised Benchmark Dataset