联系方式

您当前位置:首页 >> Python编程Python编程

日期:2024-03-03 08:35

CS5012 Mark-Jan Nederhof Practical 1

Practical 1: Part of speech tagging:

three algorithms

This practical is worth 50% of the coursework component of this module. Its due

date is Wednesday 6th of March 2024, at 21:00. Note that MMS is the definitive source

for deadlines and weights.

The purpose of this assignment is to gain understanding of the Viterbi algorithm,

and its application to part-of-speech (POS) tagging. The Viterbi algorithm will be

related to two other algorithms.

You will also get to see the Universal Dependencies treebanks. The main purpose

of these treebanks is dependency parsing (to be discussed later in the module), but

here we only use their part-of-speech tags.

Getting started

We will be using Python3. On the lab (Linux) machines, you need the full path

/usr/local/python/bin/python3, which is set up to work with NLTK. (Plain

python3 won’t be able to find NLTK.)

If you run Python on your personal laptop, then next to NLTK (https://www.

nltk.org/), you will also need to install the conllu package (https://pypi.org/

project/conllu/).

To help you get started, download gettingstarted.py and the other Python

files, and the zip file with treebanks from this directory. After unzipping, run

/usr/local/python/bin/python3 gettingstarted.py. You may, but need not, use

parts of the provided code in your submission.

The three treebanks come from Universal Dependencies. If you are interested,

you can download the entire set of treebanks from https://universaldependencies.

org/.

1

Parameter estimation

First, we write code to estimate the transition probabilities and the emission probabilities of an HMM (Hidden Markov Model), on the basis of (tagged) sentences from

a training corpus from Universal Dependencies. Do not forget to involve the start-ofsentence marker ⟨s⟩ and the end-of-sentence marker ⟨/s⟩ in the estimation.

The code in this part is concerned with:

• counting occurrences of one part of speech following another in a training corpus,

• counting occurrences of words together with parts of speech in a training corpus,

• relative frequency estimation with smoothing.

As discussed in the lectures, smoothing is necessary to avoid zero probabilities for

events that were not witnessed in the training corpus. Rather than implementing a

form of smoothing yourself, you can for this assignment take the implementation of

Witten-Bell smoothing in NLTK (among the implementations of smoothing in NLTK,

this seems to be the most robust one). An example of use for emission probabilities is

in file smoothing.py; one can similarly apply smoothing to transition probabilities.

Three algorithms for POS tagging

Algorithm 1: eager algorithm

First, we implement a naive algorithm that chooses the POS tag for the i-th token

on the basis of the chosen (i − 1)-th tag and the i-th token. To be more precise, we

determine for each i = 1, . . . , n, in this order:

tˆi = argmax

ti

P(ti

| tˆi−1) · P(wi

| ti)

assuming tˆ0 is the start-of-sentence marker ⟨s⟩. Note that the end-of-sentence marker

⟨/s⟩ is not even used here.

Algorithm 2: Viterbi algorithm

Now we implement the Viterbi algorithm, which determines the sequence of tags for a

given sentence that has the highest probability. As discussed in the lectures, this is:

tˆ1 · · ·tˆn = argmax

t1···tn

Yn

i=1

P(ti

| ti−1) · P(wi

| ti)

!

· P(tn+1 | tn)

2

where the tokens of the input sentence are w1 · · ·wn, and t0 = ⟨s⟩ and tn+1 = ⟨/s⟩ are

the start-of-sentence and end-of-sentence markers, respectively.

To avoid underflow for long sentences, we need to use log probabilities.

Algorithm 3: individually most probable tags

We now write code that determines the most probable part of speech for each token

individually. That is, for each i, computed is:

tˆi = argmax

ti

X

t1···ti−1ti+1···tn

Yn

i=1

P(ti

| ti−1) · P(wi

| ti)

!

· P(tn+1 | tn)

To compute this effectively, we need to use forward and backward values, as discussed

in the lectures on the Baum-Welch algorithm, making use of the fact that the above is

equivalent to:

tˆi = argmax

ti

P

t1···ti−1

Qi

k=1 P(tk | tk−1) · P(wk | tk)



·

P

ti+1···tn

Qn

k=i+1 P(tk | tk−1) · P(wk | tk)



· P(tn+1 | tn)

The computation of forward values is very similar to the Viterbi algorithm, so you

may want to copy and change the code you already had, replacing statements that

maximise by corresponding statements that sum values together. Computation of

backward values is similar to computation of forward values.

See logsumexptrick.py for a demonstration of the use of log probabilities when

probabilities are summed, without getting underflow in the conversion from log probabilities to probabilities and back.

Evaluation

Next, we write code to determine the percentages of tags in a test corpus that are

guessed correctly by the above three algorithms. Run experiments for the training

and test corpora of the three included treebanks, and possibly for treebanks of more

languages (but not for more than 5; aim for quality rather than quantity). Compare

the performance of the three algorithms.

You get the best experience out of this practical if you also consider the languages of

the treebanks. What do you know (or what can you find out) about the morphological

and syntactic properties of these languages? Can you explain why POS tagging is more

difficult for some languages than for others?

3

Requirements

Submit your Python code and the report.

It should be possible to run your implementation of the three algorithms on the

three corpora simply by calling from the command line:

python3 p1.py

You may add further functionality, but then add a README file to explain how to run

that functionality. You should include the three treebanks needed to run the code, but

please do not include the entire set of hundreds of treebanks from Universal

Dependencies, because this would be a huge waste of disk space and band

width for the marker.

Marking is in line with the General Mark Descriptors (see pointers below). Evidence of an acceptable attempt (up to 7 marks) could be code that is not functional but

nonetheless demonstrates some understanding of POS tagging. Evidence of a reasonable attempt (up to 10 marks) could be code that implements Algorithm 1. Evidence

of a competent attempt addressing most requirements (up to 13 marks) could be fully

correct code in good style, implementing Algorithms 1 and 2 and a brief report. Evidence of a good attempt meeting nearly all requirements (up to 16 marks) could be

a good implementation of Algorithms 1 and 2, plus an informative report discussing

meaningful experiments. Evidence of an excellent attempt with no significant defects

(up to 18 marks) requires an excellent implementation of all three algorithms, and a

report that discusses thorough experiments and analysis of inherent properties of the

algorithms, as well as awareness of linguistic background discussed in the lectures. An

exceptional achievement (up to 20 marks) in addition requires exceptional understanding of the subject matter, evidenced by experiments, their analysis and reflection in

the report.

Hints

Even though this module is not about programming per se, a good programming style

is expected. Choose meaningful variable and function names. Break up your code into

small functions. Avoid cryptic code, and add code commenting where it is necessary for

the reader to understand what is going on. Do not overengineer your code; a relatively

simple task deserves a relatively simple implementation.

You cannot use any of the POS taggers already implemented in NLTK. However,

you may use general utility functions in NLTK such as ngrams from nltk.util, and

FreqDist and WittenBellProbDist from nltk.

4

When you are reporting the outcome of experiments, the foremost requirement is

reproducibility. So if you give figures or graphs in your report, explain precisely what

you did, and how, to obtain those results.

Considering current class sizes, please be kind to your marker, by making their task

as smooth as possible:

• Go for quality rather than quantity. We are looking for evidence of understanding

rather than for lots of busywork. Especially understanding of language and how

language works from the perpective of the HMM model is what this practical

should be about.

• Avoid Python virtual environments. These blow up the size of the files that

markers need to download. If you feel the need for Python virtual environments,

then you are probably overdoing it, and mistake this practical for a software

engineering project, which it most definitely is not. The code that you upload

would typically consist of three or four .py files.

• You could use standard packages such as numpy or pandas, which the marker will

likely have installed already, but avoid anything more exotic. Assume a version

of Python3 that is the one on the lab machines or older; the marker may not

have installed the latest bleeding-edge version yet.

• We strongly advise against letting the report exceed 10 pages. We do not expect

an essay on NLP or the history of the Viterbi algorithm, or anything of the sort.

• It is fine to include a couple of graphs and tables in the report, but don’t overdo

it. Plotting accuracy against any conceivable hyperparameter, just for the sake

of producing lots of pretty pictures, is not what we are after.

Pointers

• Marking

http://info.cs.st-andrews.ac.uk/student-handbook/

learning-teaching/feedback.html#Mark_Descriptors

• Lateness

http://info.cs.st-andrews.ac.uk/student-handbook/

learning-teaching/assessment.html#lateness-penalties

• Good Academic Practice

https://www.st-andrews.ac.uk/students/rules/academicpractice/

5


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8