How to install SRILM on Ubuntu easy

0
51
80 / 100

[Install SRILM on Ubuntu] SRILM is a statical language model toolkit that can help people easily build language models for speech recognition, machine translation, spelling correction, etc tasks. The language model is very helpful in many natural language processing tasks. This post will guild you on how to install SRILM on Ubuntu and a little bit about how to use this toolkit to train a new language model.

Install SRILM on Ubuntu

To install SRIML on Ubuntu, you need to download the installation file from the SRILM download page. At the time I write this post, the newest SRIML version is 1.7.3.

Because the download file above is a zip file, so we need to extract and then copy it to the system path. We can be done the job by these commands:

From the “/usr/share/srilm” directory, you need to edit the Makefile content as follows:

Replace (line no. 7)

with

Note: You can use vim or gedit with sudo to edit the file.

Next, you need to install the tcsh package by this command:

Ok, now if everything is successful, the command below should have no error.

Please remember this binary ngram-count file path, we will use it when training a language model. After the above step, the install SRILM should success (It should output like image bellow).

Next, you can add the SRILM alias to the end of your ~/.bashrc file (optional) for quick use without a full path call.

Then update the change of your bashrc file and then test alias is ok:

Install SRILM on Ubuntu
The ngram -help output if success

That’s all, In case you install SRILM have an error related to libiconv, refer to the following URL to fix it.

After install SRILM successful, now I will guild you on how to use SRILM to train your first language model.

Train language model with SRILM

Simply, SRILM has two main programs: (1) ngram and (2) ngram-count. We use ngram-count to train a new language model and evaluate the trained model with ngram.

Training a new languge model

To train a new statical language model, we need a corpus and a vocab file. SRILM will train a language model by calculate all word in vocab from the corpus. So, if a word not visible in your vocab, model will consider it as an out of vocab (OOV).

I have prepare my own data here, it is a small Vietnamese corpus and you can download it for learning in this tutorial.

I am assume that your vocab file name is vocab.txt (each word in a line) and text corpus (each sentence in a line) named train.txt. Then we can train it by using this command:

Where:

  • -order: tell SRILM train n-gram language model, default n = 3
  • -lm: output file name, it can be .arpa file or a zip (.gz) file
  • -kndiscountn: where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman’s modified Kneser-Ney discounting for N-grams of order n
  • -gtnmin: where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0
  • -text: Path to training corpus
  • -vocab: Path to vocab file
  • -sort: Output counts in lexicographic order.

To learn more about ngram-count, you can refer to this page.

Evaluate a trained languge model

After trained a language model, we will want to see how the model fit to out problem. We can use ngram command to see how the model perform on your text.

The perplexity metric is use to evaluate a language model. The lower of perplexity mean that the model fit on your text and higher value mean that this language model not fit to your text. In case you get a high perplexity, it mean that the domain of corpus you use to train this language model not close to your test text and it will not work as your expected.

To evaluate your test text, you can put all your text in a file (Each sentence one line) and use this command:

Where:

  • -order: Tell SRILM evaluate ngram, default n = 3
  • -lm: Path to the trained language model file
  • -ppl: Path to the test file
  • debug: Set level of evaluate (0-4), ex: only entire corpus (0), individual sentence (1), word-level (2)

To learn more about ngram, you can refer to this page.

This is a tutorial on how to install SRILM on Ubuntu and a simple guild about how to use SRILM to train a new languge model. In the next, I will have another tutorial for building a large language model that combine from many corpus domain, model pruning and how to use a trained language model with python programing.

Sáng lập cộng đồng Lập Trình Không Khó với mong muốn giúp đỡ các bạn trẻ trên con đường trở thành những lập trình viên tương lai. Tất cả những gì tôi viết ra đây chỉ đơn giản là sở thích ghi lại các kiến thức mà tôi tích lũy được.
Subscribe
Notify of
guest
0 Bình luận
Inline Feedbacks
View all comments