Data Model

Conceptualization

From both the user and task requirements one can derive that four basic functions of data processing need to be carried out. Data have to be read, persistently saved, searched, and deleted. Furthermore, some kind of user management and multi-user processing is necessary. In addition, the framework should support web technologies, be well documented, and easy to extent. Ideally, the MVC pattern is realized.

The guidelines of the TEI standard on the word level are defined in line with the defined word structure. In listing TEI-example for comfortable an example is given for a possible markup at the word level for comfortable

TEI-example for comfortable
<w type="adjective">
 <m type="base">
  <m type="prefix" baseForm="con">com</m>
  <m type="root">fort</m>
 </m>
 <m type="suffix">able</m>
</w>

This data model reflects just one theoretical conception of a word structure model. Crucially, the model emanates from the assumption that the suffix node is on par with the word base. On the one hand, this implies that the word stem directly dominates the suffix, but not the prefix. The prefix, on the other hand, is enclosed in the base, which basically means a stronger lexical, and less abstract, attachment to the root of a word. Modeling prefixes and suffixes on different hierarchical levels has important consequences for the branching direction at subword level (here right-branching). Left the theoretical interest aside, the choice of the TEI-standard is reasonable with view to a sustainable architecture that allows for exchanging data with little to no additional adjustments.

The negative account is that the model is not eligible for all languages. It reflects a theoretical construction based on Indo-European languages. If attention is paid to which language this software is used, it will not be problematic. This is the case for most languages of the Indo-European stem and corresponds to the overwhelming majority of all research carried out (unfortunately).

Implementation

It is advantageous to use established standardsn and it makes sense to keep the meta data of each corpus separate from the data model used for the words to be analyzed.

For the present case, the TEI-standard was identified as an appropriate markup for words. In terms of the implementation this means that the TEI-guidelines have to be implemented as an object type compatible with the chosen repository framework. However, the TEI standard is not complete regarding the diachronic dimension, i.e. information on the development of the word. To be compatible with the elements of the TEI standard on the one hand and to best meet the requirements of the application on the other hand, some attributes are added. This solution allows for processing the xml files according to the TEI-standard by ignoring the additional attributes and at the same time, if needed, additional markup can be extracted. The additional attributes comprise a link to the corpus meta data, but also position and occurrence of the affixes. Information on the position and some quantification thereof are potentially relevant for a wealth of research questions, such as predictions on the productivity of derivatives and their interaction with the phonological or syntactic modules. So they were included with respect to future use.

For reasons of efficiency in subsequent processing, the historic dates begin and end were included in both the word data model and the corpus data model. The result of the word data model is given in listing Word Data Model. Whereas attributes of the objecttype are specific to the repository framework, the TEI structure can be recognized in the hierarchy of the meta data element starting with the name w (line 17).

Word Data Model
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
<?xml version="1.0" encoding="UTF-8"?>
<objecttype
 name="morphilo"
 isChild="true"
 isParent="true"
 hasDerivates="true"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="datamodel.xsd">
 <metadata>
  <element name="morphiloContainer" type="xml" style="dontknow"
 notinherit="true" heritable="false">
   <xs:sequence>
    <xs:element name="morphilo">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="w" minOccurs="0" maxOccurs="unbounded">
        <xs:complexType mixed="true">
         <xs:sequence>
          <!-- stem -->
          <xs:element name="m1" minOccurs="0" maxOccurs="unbounded">
           <xs:complexType mixed="true">
            <xs:sequence>
             <!-- base -->
             <xs:element name="m2" minOccurs="0" maxOccurs="unbounded">
              <xs:complexType mixed="true">
               <xs:sequence>
                <!-- root -->
                <xs:element name="m3" minOccurs="0" maxOccurs="unbounded">
                 <xs:complexType mixed="true">
                  <xs:attribute name="type" type="xs:string"/>
                 </xs:complexType>
                </xs:element>
                <!-- prefix -->
                <xs:element name="m4" minOccurs="0" maxOccurs="unbounded">
                 <xs:complexType mixed="true">
                  <xs:attribute name="type" type="xs:string"/>
                  <xs:attribute name="PrefixbaseForm" type="xs:string"/>
                  <xs:attribute name="position" type="xs:string"/>
                 </xs:complexType>
                </xs:element>
               </xs:sequence>
               <xs:attribute name="type" type="xs:string"/>
              </xs:complexType>
             </xs:element>
             <!-- suffix -->
             <xs:element name="m5" minOccurs="0" maxOccurs="unbounded">
              <xs:complexType mixed="true">
               <xs:attribute name="type" type="xs:string"/>
               <xs:attribute name="SuffixbaseForm" type="xs:string"/>
               <xs:attribute name="position" type="xs:string"/>
               <xs:attribute name="inflection" type="xs:string"/>
              </xs:complexType>
             </xs:element>
            </xs:sequence>
            <!-- stem-Attribute -->
            <xs:attribute name="type" type="xs:string"/>
            <xs:attribute name="pos" type="xs:string"/>
            <xs:attribute name="occurrence" type="xs:string"/>
           </xs:complexType>
          </xs:element>
         </xs:sequence>
         <!-- w -Attribute auf Wortebene -->
         <xs:attribute name="lemma" type="xs:string"/>
         <xs:attribute name="complexType" type="xs:string"/>
         <xs:attribute name="wordtype" type="xs:string"/>
         <xs:attribute name="occurrence" type="xs:string"/>
         <xs:attribute name="corpus" type="xs:string"/>
         <xs:attribute name="begin" type="xs:string"/>
         <xs:attribute name="end" type="xs:string"/>
        </xs:complexType>
       </xs:element>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
  </element>
  <element name="wordtype" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="wordtype"/>
  </element>
  <element name="complexType" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="complexType"/>
  </element>
  <element name="corpus" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="corpus"/>
  </element>
  <element name="pos" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="pos"/>
  </element>
  <element name="PrefixbaseForm" type="classification" minOccurs="0"
  maxOccurs="1">
   <classification id="PrefixbaseForm"/>
  </element>
  <element name="SuffixbaseForm" type="classification" minOccurs="0"
  maxOccurs="1">
   <classification id="SuffixbaseForm"/>
  </element>
  <element name="inflection" type="classification" minOccurs="0" maxOccurs="1">
   <classification id="inflection"/>
  </element>
  <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded" >
   <target type="corpmeta"/>
  </element>
 </metadata>
</objecttype>

Additionally, it is worth mentioning that some attributes are modeled as a classification. All these have to be listed as separate elements in the data model. This has been done for all attributes that are more or less subject to little or no change. In fact, all known suffix and prefix morphemes should be known for the language investigated and are therefore defined as a classification. The same is true for the parts of speech named pos in the morphilo data model above. Here the PENN-Treebank tagset was used. Last, the different morphemic layers in the standard model named m are changed to m1 through m5. This is the only change in the standard that could be problematic if the data is to be processed elsewhere and the change is not documented more explicitly. Yet, this change was necessary for the MyCoRe repository throws errors caused by ambiguity issues on the different m-layers.

The second data model describes only very few properties of the text corpora from which the words are extracted. Listing Corpus Data Model depicts only the meta data element. For the sake of simplicity of the prototype, this data model is kept as simple as possible. The obligatory field is the name of the corpus. Specific dates of the corpus are classified as optional because in some cases a text cannot be dated reliably.

Corpus Data Model
<metadata>
  <!-- Pflichtfelder -->
  <element name="korpusname" type="text" minOccurs="1" maxOccurs="1"/>
  <!-- Optionale Felder -->
  <element name="sprache" type="text" minOccurs="0" maxOccurs="1"/>
  <element name="size" type="number" minOccurs="0" maxOccurs="1"/>
  <element name="datefrom" type="text" minOccurs="0" maxOccurs="1"/>
  <element name="dateuntil" type="text" minOccurs="0" maxOccurs="1"/>
  <!-- number of words -->
  <element name="NoW" type="text" minOccurs="0" maxOccurs="1"/>
  <element name="corpuslink" type="link" minOccurs="0" maxOccurs="unbounded">
    <target type="morphilo"/>
  </element>
</metadata>

As a final remark, one might have noticed that all attributes are modelled as strings although other data types are available and fields encoding the dates or the number of words suggest otherwise. The MyCoRe framework even provides a data type historydate. There is not a very satisfying answer to its disuse. All that can be said is that the use of data types different than the string leads later on to problems in the convergence between the search engine and the repository framework. These issues seem to be well known and can be followed on github.