Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:segments [2012/12/29 15:19] bilek vytvořeno |
courses:rg:2012:segments [2013/01/03 22:38] (current) popel |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | aa | + | =====Introduction, |
+ | |||
+ | We introduced the basic idea of Czech sentence segmentation and the Czech sentence boundaries. We showed the segmentation chart on an example. | ||
+ | |||
+ | =====Experiments with Automatic Identification of Segmentation Charts===== | ||
+ | |||
+ | ====How to Obtain Segments from Syntactic Tree?==== | ||
+ | |||
+ | We are unsure of the exact definition of Edge and Path between the segments in this part. | ||
+ | |||
+ | We agreed that " | ||
+ | |||
+ | We were unsure why in the example on the end of the section 3.1 the segments 1 and 3 are the same group; it's because there is an oriented path from 1 to 3 and from 3 to 1 (the word " | ||
+ | |||
+ | The graph, that will be created on SEGMENTS is not a tree, but it is oriented graph, it is a fully connected graph, but a cyclic graph. | ||
+ | |||
+ | Cyclic graph is it only if we have " | ||
+ | |||
+ | In the sentence 2, the group is only 1-3. Not 1-2-3, because " | ||
+ | |||
+ | The most important question, though, is why do we do all this, because the data from the PDT tree are more thorough than the segments that we want to create! So, what is the exact reason? | ||
+ | |||
+ | 1) To prepare training data? | ||
+ | > Probably no, because they don't use any machine learning approach. | ||
+ | 2) To prepare testing data? | ||
+ | > No. Because they already have some manually annotated sentences. Moreover, the described approach (using PDT gold a-trees on input) has only 70% accuracy. | ||
+ | 3) As an " | ||
+ | > Probably no. There are better algorithms (with higher precision than 70%) exploiting gold a-trees. | ||
+ | 4) To show some difficult cases with creating segmentation charts (even when gold a-trees are available). | ||
+ | > Maybe. | ||
+ | 5) It can be just to fill up the space :) | ||
+ | > ? | ||
+ | |||
+ | ====How to Obtain Segments from Plain Text?==== | ||
+ | |||
+ | On the beginning we talk about the basic set of rules for subordination. They are some that could be made better; for example, the quotes for highlightning. | ||
+ | |||
+ | We note they have a very strange definition of upper/lower and increasing/ | ||
+ | |||
+ | As opposed to 3.1, where the output was 1 tree, here the output is more possibilities. | ||
+ | |||
+ | |||
+ | =====Evaluation and Analysis of the Results===== | ||
+ | ====Evaluation of Rules for Syntactic Trees==== | ||
+ | Is 57% enough? 73% sounds like a more important number, but it is still not enough. | ||
+ | |||
+ | We show that it's possible to make a simplier algorithm - Martin Popel shows a simpler, recursive algorithm - we go recursively through the tree and make a new number for every verb we meet. We will create clauses, not " | ||
+ | |||
+ | A reverse question - can't I have a segment that goes through more clausules? (meaning - are the segments split enough?) | ||
+ | |||
+ | Not with a correct Czech, that has commas. However, this segmentation can't be extended to English, since the rules for commas are different in English. But it also fails on incorrectly written Czech. | ||
+ | |||
+ | " | ||
+ | " | ||
+ | |||
+ | ====Evaluation of Rules for Plain Text==== | ||
+ | The question - why are these results better than the first experiment? | ||
+ | |||
+ | Here we measure just recall, but we are missing precision. So if we wanted goot results, we could add " | ||
+ | |||
+ | On the other hand, we also show that we have 2.17 paths for one sentence. So the precision should not be that low either. | ||
+ | |||
+ | What means " | ||
+ | |||
+ | =====Conclusion===== | ||
+ | |||
+ | Nice idea - we can do some quick, but reliable preprocessing. However, the authors don't show how much it's helping the parsers (if it does). We don't even see the precision written. | ||
+ | |||
+ | It is slightly light on information, | ||
+ | |||
+ | =====Questions===== | ||
+ | We can do step with the size 2 with indirect speech. | ||
+ | řekl, že když se budeš modlit, tak se ti přání splní | ||
+ | |||
+ | We have step with the size two on the first edge (the whole thing is 021). But they give 011 with both PDT and plaintext, too, because they do two steps at the same time. | ||
+ | |||
+ | Řekl: " 'kdo lže, ten krade' je přísloví" | ||
+ | |||
+ | We can go UP more than 2 steps easily, but when we want the step down, we have to construct it weirdly like this. | ||
+ | |||
+ | věděl, že když byl na vojně, tak když přišel kapitán do místnosti, tak se všichni postavili. | ||
+ | |||
+ | It can be argued that the step size 2 is at "tak když přišel" |