Structured search through multiple annotation files

If you want to perform a detailed search over multiple EAF-files, but the options offered by Search multiple EAF (see Searching through multiple annotation files) are not comprehensive enough, you can use yet another search mode. This allows you to restrict the search domain to certain tiers, to use regular expressions, etc. while examining multiple annotation files at once. This search function will search for (whole) words to match the given query, but also will match parts of words that match the query.

The function can be reached via Search > Structured search multiple eaf.... When you click on this option for the first time, you will be asked to define a search domain in the form of one or more .eaf files. The next time you open the Structured search, it uses the last defined search domain. The search window offers the possibility to define a new search domain: click on Define Domain and do one of the following:

After defining a search domain for the first time or when you open the Structured search with a search domain from the previous usage, the following window will open:

Figure 4.18. Search eaf files

Search eaf files


As you can see there are three tabs offering different kinds of search:

Substring Search Tab

This tab offers the simplest search. It just asks for a search string. After entering the search string you can click on Find (or press Enter) to start the search process. This will result in a screen like the one below:

Figure 4.19. Substring Search Results

Substring Search Results


It shows tokens that contain the search string and some tokens in the context printed in italic typeface. The default number of tokens in the context is three on both sides. When the number of hits exceeds the maximum number the window can contain, you can view the rest of the hits by clicking the < and > button that appear above the list of hits to go back or forward one page. To view an annotation in the timeline view of the main window simply double click it:

Figure 4.20. Hit in transcription

Hit in transcription

For further investigation of the results the search window offers a context menu that enables you to view the results in other manners and to save the results. To open the context menu right click on one of the results. The menu has the following options:

  • Show Frequency view: clicking this option shows both frequency and relative frequency (as a percentage) of the tokens found. The relative frequency is relative to the number of hits.
  • Show Frequency view (by frequency): This will display the frequencies, sorted by count.
  • Show Alignment view: This option will show you an aligned view of the search results, and there are a number of options you can set. You can change the time scale, hide or show info balloons and set the visible columns (through the context-menu).
  • Show hit in transcription: clicking this option shows the transcription in the timeline viewer similar to double clicking an annotation.
  • Show Info balloons: by clicking this option you enable ELAN to show you information about a token in an info balloon. This balloon will appear when your mouse cursor is hovering over a token. The information shown in the balloon contains:
    • Transcription file
    • Tier name
    • Tier type
    • Participant
    • Position in tier
    • begin time
    • end time
    • duration
  • Context size: this option offers a sub menu that enables you to decrease and increase the context size of the results. Minimum size is 0 and maximum size is 8 tokens.
  • Font: click this option to change the font and font size of the results.
  • Save hits: when clicking this option, you will be asked to select a directory and enter a file name. The result is a file that contains the following information per token found:
    • Annotation: the annotation token containing the search string.
    • HitPositionInAnnotation: the position of the first character of the search string in the annotation.
    • HitLength: number of characters in the hit
    • HitNumberInAnnotation: if the search string is found more than once in an annotation, this number will give the rank of the hit within the annotation.
    • AnnotationBeginTime: the begin time in ms of an annotation containing the search string.
    • AnnotationEndTime: the end time in ms of an annotation containing the search string.
    • HitPositionInTier: the position of the annotation in a tier.
    • TierName: the name of the tier containing the annotation.
    • TierType: the type of tier containing the annotation.
    • LeftContext: the left context of the annotation.
    • RightContext: the right context of the annotation.
    • TranscriptionName: the path and file name of the transcription in which the annotation is found.
  • Save hit statistics: clicking this option lets you save a file that contains hit statistics. The export dialog contains the following options:
    • Separate hit count per hit value: if checked there is a line of statistic for each hit. If not checked, there is line per file.
    • Include file name column.
    • Include file path column.
    • Time format: specify whether the time format should be in milliseconds (ms) or seconds and milliseconds (sec.ms).

    After clicking OK you can enter a file name and click Save to save the statistics file.

Frequency View

When you are in frequency view or frequency view (by frequency) (Figure 4.21, “Frequency View”), the context menu (right-click) has the following options:

  • Show Concordance view: clicking this option will show the annotation results.
  • Show hit in transcription: clicking this option shows the transcription in the timeline viewer similar to double clicking an annotation.
  • Save frequency info: when clicking this option, you will be asked to select a directory and enter a file name. The result is a file that contains the following information:
    • Annotation
    • Percentage
    • Count

Figure 4.21. Frequency View

Frequency View


Alignment View

The alignment view allows you to view your search results in an aligned time-based view. For detailed information about the Alignment View, see View search results in Alignment View.

Single Layer Search tab

The Single Layer tab offers a more elaborate search than the Substring Search tab. The first thing that is different from the Substring Search tab is that the Single Layer Search tab has a query history. Clicking the < and > button makes the tab respectively go backward and forward one query. There is also the possibility to save queries, as well as loading previously saved queries.

Figure 4.22. Single Layer Search

Single Layer Search


Furthermore, the tab offers different modes to restrict the search. The first mode lets you choose the form of the results. There are three options:

  • Annotation: the search string is part of or exact match in an annotation.
  • N-gram over annotations: each element of the search string (elements are divided by spaces) is part of or exact match in one of several consecutive annotations.
  • N-gram within annotation: each element of the search string (elements are divided by spaces) is part of or exact match in one of several consecutive tokens within one annotation.

The following mode offers the straightforward distinction between case sensitive and case insensitive search. The third mode lets the user choose if the element of the first mode should contain the search string (substring match), if the element should exactly match the search string (exact match) or if some regular expression should be used in the match (regular expression).Finally, one can choose to restrict the search to one tier, a tier type or a participant.

Wildcards and negation

When you choose an N-gram to be the form of the result, you can use two more options: a wild card and a negation. The wildcard takes the form of a #-sign. For instance, the search string the # man with the mode N-gram over annotations would return three annotations per hit: the first annotation contains the (or exactly matches that, if the mode exact match is chosen), the second annotation may contain anything due to the use of the wildcard and the third annotation contains or exactly matches man. If the mode N-gram within annotation is chosen, each hit contains one annotation. In this annotation there is a N-gram consisting of three tokens where the first token contains or exactly matches the, the second may be anything and the third contains or exactly matches man.

If you want to find N-grams where a token matches anything but one string, you can use the negation operator NOT(...), where you can fill in the search string not to be matched on the dots. For instance, the search string the NOT(strange) man would return 3-grams in same way as describe above, but the hits where the second annotation or token matches strange are left out.

Multiple Layer Search tab

The Multiple Layer Search tab houses the most comprehensive search in ELAN. Similar to the Single Layer Search tab a Query History is kept, enabling the user to go back and forward a query by clicking the < and > respectively. It it also possible to either save or load a previously saved query. To do so, click either the Save query or the Load query button. Queries are saved in XML format.

The two modes case sensitive/case insensitive and substring match/exact match/regular expression are also similar to the second tab. The first new element is the Clear-button. Clicking this button will clear all data of a query.

A new option has been included into the menu containing all the different types of matches (i.e. substring match, exact match, regular expression): variable match. As the name says, it has to do with using variables, and it can be used every time you want to search for two or more annotations, contained in two or more different tiers, reporting the same text and/or the same time alignment. See the image below for an example:

Figure 4.23. Variable Match

Variable Match

As you can see in the example, the variable 'X' can match any same value of annotations that meet all other constraints. They are in the same time-frame (overlap) and reside in the same file (the base constraint is Must be in the same file) . In this case 'BONE' is found in the tier 'Gloss RH English' and in 'Gloss LH English', the same for the value '(p-) leg dog'.

It is possible to use more than one variable, e.g. X and Y. This is especially useful in those cases where more than two query fields are filled in.

Figure 4.24. Multiple Variable Match

Multiple Variable Match

X and Y can either match different values or the same value. If a variable should be unique, i.e. should never match the same value as any of the other variables, it should be preceded by an exclamation mark, e.g. !Y.

The buttons Minimal Duration and Maximal Duration enables you to constrict the minimal and maximal duration of each result. When you click on one of the buttons, a dialog window appears, e.g.:

Figure 4.25. Minimal Duration

Minimal Duration


Here you can enter the minimal or maximal duration as the total number of milliseconds or in hours:minutes:seconds.milliseconds. A value of 0 milliseconds or 00:00:00.000 yields as undefined. Searching for annotations with a maximum duration being less then the minimum duration is impossible. Hence, entering conflicting values results in an error message saying that the combination is impossible. After entering a correct duration, it will be displayed in the corresponding button.

The buttons Begin After and End Before give a dialog similar to that of the previous two buttons. They give the possibility to restrict the annotations in the result to begin after a certain time and end before a certain time. Entering a Begin After-time that is greater than the End Before-time or vice versa results in an error message saying it is impossible. After entering a correct time, it will be displayed in the corresponding button.

Search string and constraints

Beneath the buttons discussed above, you will find a table consisting of white and green fields. Search strings are entered in the white fields while a green field between two non-empty white fields must contain a constraint. The fields on one row give the search strings and constraints to be matched by annotations on one tier. The result of having two or more rows in the query table is that the search engine may find annotations on two or more tiers as one hit. Furthermore, it is possible to restrict the search to one (type of) tier for each row by choosing the appropriate option in the pull-down menu on the right of each row.

Let us first take a look at search strings and constraints in one row. If you enter two search strings in two white fields separated by a green field, you must fill in that green field i.e. make a constraint. Clicking the arrow on the green field gives a menu offering the following constraints:

  • = N annotations: between the annotations containing the two search strings, there must be exactly N annotations.
  • > N annotations: between the annotations containing the two search strings, there must be more than N annotations.
  • < N annotations: between the annotations containing the two search strings, there must be less than N annotations.
  • = X milliseconds: between the annotations containing the two search strings, there must be exactly X milliseconds.
  • > X milliseconds: between the annotations containing the two search strings, there must be more than X milliseconds.
  • < X milliseconds: between the annotations containing the two search strings, there must be less than X milliseconds.
  • No constraints: there are no constraints.
  • Clear: clear the current constraint.

When you click on Find and there is an empty constraint between two non-empty search string fields, you will get an error message. You will also get an error message if there is an empty search string field and constraint fields between two non-empty search string fields.

As we saw earlier the search mechanism on this tab has the possibility to construct a query for two or more tiers (up to eight). Besides the constraints on annotations on a tier, one can also apply constraints on annotations on different tiers. This means that if the search engine has found an annotation that matches a search string on one tier, the engine looks if the search string for another tier can be matched on another tier while considering the constraint that is between the two search strings.

The top down hierarchy of the rows in the query table does not reflect the hierarchy of the tiers in your data. That means, for instance, that search strings and constraints in the upper query table row may be matched by a child tier of the tier that matches search strings and constraints in the middle query table row.

Clicking the arrow in the green field between two search strings gives a menu with the following constraints:

  • Fully aligned: the begin time and end time of both annotations are the same:

  • Overlap: part of both annotations overlap. This includes the other options Fully aligned, Left overlap, Right overlap, Surrounding and Within.
  • Left overlap: the begin time and end time of the annotation matching the lower search string lie before the begin time and end time of the annotation matching the upper search string:

  • Right overlap: the begin time and end time of the annotation matching the lower search string lie after the begin time and end time of the annotation matching the upper search string:

  • Surrounding: the begin time of the annotation matching the lower search string lies before the begintime of the annotation matching the upper search string and end time of the annotation matching the lower search string lies after the end time of the annotation matching the upper search string:

  • Within: the begin time of the annotation matching the lower search string lies after the begintime of the annotation matching the upper search string and end time of the annotation matching the lower search string lies before the end time of the annotation matching the upper search string:

  • No overlap: the begin time of the annotation matching a search string lies after the end time of the annotation matching the other search string:

    or

  • No annotation: a special case that retrieves annotations matching the upper search string that have no (overlapping) annotation on the lower tier. It is not possible to enter a lower search string; contrary to the No overlap constraint, which still looks for annotations on the lower tier (namely those that don't overlap), this constraint really looks for no annotation in the timespan of the upper annotation (empty slots). The user interface allows specifying constraints on lower levels and to the left and right of this constraint, but the behavior in that case is undefined!
  • begin time - begin time = X milliseconds: the begin time of the annotations matching the upper search string must lie exactly X milliseconds before the begin time of the annotation matching the lower search string.
  • begin time - begin time < X milliseconds: the begin time of the annotations matching the upper search string must lie less than X milliseconds before the begin time of the annotation matching the lower search string.
  • begin time - begin time > X milliseconds: the begin time of the annotations matching the upper search string must lie more than X milliseconds before the begin time of the annotation matching the lower search string.
  • begin time - end time = X milliseconds: the end time of the annotations matching the upper search string must lie exactly X milliseconds before the begin time of the annotation matching the lower search string.
  • begin time - end time < X milliseconds: the end time of the annotations matching the upper search string must lie less than X milliseconds before the begin time of the annotation matching the lower search string.
  • begin time - end time > X milliseconds: the end time of the annotations matching the upper search string must lie more than X milliseconds before the begin time of the annotation matching the lower search string.
  • end time - begin time = X milliseconds: the begin time of the annotations matching the upper search string must lie exactly X milliseconds before the end time of the annotation matching the lower search string.
  • end time - begin time < X milliseconds: the begin time of the annotations matching the upper search string must lie less than X milliseconds before the end time of the annotation matching the lower search string.
  • end time - begin time > X milliseconds: the begin time of the annotations matching the upper search string must lie more than X milliseconds before the end time of the annotation matching the lower search string.
  • end time - end time = X milliseconds: the end time of the annotations matching the upper search string must lie exactly X milliseconds before the end time of the annotation matching the lower search string.
  • end time - end time < X milliseconds: the end time of the annotations matching the upper search string must lie less than X milliseconds before the end time of the annotation matching the lower search string.
  • end time - end time > X milliseconds: the end time of the annotations matching the upper search string must lie more than X milliseconds before the end time of the annotation matching the lower search string.
  • No constraint: there are no constraints.
  • Clear: clear the current constraint.

An example of a Multiple Layer Search with constraints is shown below:

Figure 4.26. Multiple Layer query

Multiple Layer query


As you can see the tiers in the result are indicated by #1 and #2, corresponding to the first and second query table row respectively. The annotations in a tier are surrounded by vertical bars indicating their start and end.

It is possible to add or remove columns and/or layers to your search query. To do so, click the respective button:

  • Fewer Columns
  • More Columns
  • Fewer Layers
  • More Layers

    It is also possible to hide the query once there are search results. This allows you to see more query results within a single window. This can be helpful when using the Alignment View View search results in Alignment View.

Figure 4.26, “Multiple Layer query” also illustrates what to do if you would like to use both Exact match and Substring match in one query: use the Regular expression. In places where you would like to have an exact match use the ^ and $ signs to match the beginning and end of a string (e.g. ^of$) otherwise just enter a word for the substring match.

The figure also show how to use a wildcard to match anything. Instead of using the # as in the Single Layer Search, you can use the regular expression .+ to indicate any character (the dot) one or more times (the plus). See also Appendix A, REGULAR EXPRESSION SEARCH for more on regular expressions. The NOT(...) construction on the other hand can be used in the Multiple Layer Search in the same way as describe in Single Layer Search tab.

One final but not less important remark concerns the placing of more and less restrictive search strings. Figure 4.26, “Multiple Layer query” shows a very restrictive search string in the upper row: ^n$. The less restrictive, or should we say non-restrictive, search string .+ is in the middle row. As we saw earlier, the hierarchy of the rows in the query does not reflect the hierarchy in the data. That means that the search string ^n$ could also be placed in the lower row and not affect the outcome of the search. While this is perfectly true, we advise you to place restrictive search strings in the left most field on the upper most row possible and the least restrictive search string in the right most field of the lowest row possible. The reason for this is the order in which the search engine considers the search strings in the query. If it finds a restrictive search string it can filter out all the other possibilities, but if it finds a less restrictive search string it has to consider all the matches of this search string. In the example of Figure 4.26, “Multiple Layer query” it is clear that if ^n$ is in the bottom row, the search engine first considers all annotations matching .+ which is in fact all annotations in the search domain. Because of this, the search takes much more time than if ^n$ was in the upper row.

View search results in Alignment View

From the context-menu (right-click the search results), you van view query results from the Multiple Layer Search in Alignment View:

Figure 4.27. Alignment View

Alignment View

There are a number of options you can set when viewing the query results. Firstly, you can adjust the time scale of the results:

  • 1 sec / 2sec / 5sec / 10 sec / 15 sec / 20 sec / User defined / Scale to fit.

    When choosing 'Scale to fit', every query result will be scaled to fit the window, which means the time scale for every result will differ.

    There is also the possibility to hide the alignment time scale altogether. To do so, go to the context-menu (right-click) and uncheck Show alignment timesby clicking on it.

You can set the visible columns to the right of the query results through the context-menu (right-click anywhere in the results). You can show or hide the following columns:

  • Tier Type
  • Annotator
  • Participant
  • Begin Time
  • End Time
  • Duration

The blue bars above every query result graphically show the duration of each annotation and the position of the annotations with respect to each other.

There are also two indicators visible, depending on the length of the query result and the setting of the time scale. These indicators are either red or green.

A green indicator means that the annotation does not fit in the current time scale. In the example above, the bottom annotation 'and then you see um a man in maybe his fifties' has a duration of 5.060 seconds. The time scale is set to 1 second, so 4.060 seconds are outside the current view.

The red indicator means that the annotation in the query result starts outside of the current time scale. The top annotation 'fifties' overlaps the bottom annotation, but starts at 9.177 seconds. This causes it not to be visible in the current time scale, which is set to display 1 second. You would need to set the time scale to 10 seconds to see both annotations visualised completely (as the blue bars) and how they overlap.