We introduced the basic idea of Czech sentence segmentation and the Czech sentence boundaries. We showed the segmentation chart on an example.
We are unsure of the exact definition of Edge and Path between the segments in this part.
We agreed that “edge” is in this case meant as an oriented edge, whereas “path” is an oriented path between segments, if the nodes from the original graph had the nodes from the same segment “contracted”.
We were unsure why in the example on the end of the section 3.1 the segments 1 and 3 are the same group; it's because there is an oriented path from 1 to 3 and from 3 to 1 (the word “oriented” is missing; however, without “oriented”, all nodes would have path with each other, since the analytical tree is a tree) .
The graph, that will be created on SEGMENTS is not a tree, but it is oriented graph, it is a fully connected graph, but a cyclic graph.
Cyclic graph is it only if we have “weird” segments - because the segments are created according to the commas, not according to the tree. “Group” is another name for a cycle in the segment tree.
In the sentence 2, the group is only 1-3. Not 1-2-3, because “a” is on the edge of the segment so it doesn't count.
The most important question, though, is why do we do all this, because the data from the PDT tree are more thorough than the segments that we want to create! So, what is the exact reason?
1) To prepare training data?
Probably no, because they don't use any machine learning approach.
2) To prepare testing data?
No. Because they already have some manually annotated sentences. Moreover, the described approach (using PDT gold a-trees on input) has only 70% accuracy.
3) As an “oracle experiment” - using gold a-trees is an upper bound for using plaintext only.
Probably no. There are better algorithms (with higher precision than 70%) exploiting gold a-trees.
4) To show some difficult cases with creating segmentation charts (even when gold a-trees are available).
Maybe.
5) It can be just to fill up the space :)
?
On the beginning we talk about the basic set of rules for subordination. They are some that could be made better; for example, the quotes for highlightning.
We note they have a very strange definition of upper/lower and increasing/decreasing layers. Lower boundary is the one lower in the graph, but increasing means putting it down. So increasing lower boundary means making the interval bigger.
As opposed to 3.1, where the output was 1 tree, here the output is more possibilities.
Is 57% enough? 73% sounds like a more important number, but it is still not enough.
We show that it's possible to make a simplier algorithm - Martin Popel shows a simpler, recursive algorithm - we go recursively through the tree and make a new number for every verb we meet. We will create clauses, not “segments”, but the clauses can be split to segments later.
A reverse question - can't I have a segment that goes through more clausules? (meaning - are the segments split enough?)
Not with a correct Czech, that has commas. However, this segmentation can't be extended to English, since the rules for commas are different in English. But it also fails on incorrectly written Czech.
“Kočka, která honila myš, zemřela”.
“Kočka, která honila myš zemřela” –> it is not a correct Czech, tectomt parser could live with that, but segmentation according to this article wouldn't.
The question - why are these results better than the first experiment?
Here we measure just recall, but we are missing precision. So if we wanted goot results, we could add “0-100” to each one.
On the other hand, we also show that we have 2.17 paths for one sentence. So the precision should not be that low either.
What means “ambiguity 1.32”? That's another, confusing name for path number for one sentence.
Nice idea - we can do some quick, but reliable preprocessing. However, the authors don't show how much it's helping the parsers (if it does). We don't even see the precision written.
It is slightly light on information, and it's strange, that it is continuing a paper from the year 2001.
We can do step with the size 2 with indirect speech.
řekl, že když se budeš modlit, tak se ti přání splní
We have step with the size two on the first edge (the whole thing is 021). But they give 011 with both PDT and plaintext, too, because they do two steps at the same time.
Řekl: “ 'kdo lže, ten krade' je přísloví”
We can go UP more than 2 steps easily, but when we want the step down, we have to construct it weirdly like this.
věděl, že když byl na vojně, tak když přišel kapitán do místnosti, tak se všichni postavili.
It can be argued that the step size 2 is at “tak když přišel”.