Menota data services

Menu Search

This is not currently part of the peer-reviewed material of the project. Do not cite as a research publication.

These pages are designed to assist in the lemmatising of Menota texts, including linking lemmas to ONPs wordlist.

The features and processes outlined here are only available to logged-in users with appropriate permissions.

The process involves importing the Menotic TEI/XML file and then using the database interface to assist in the lemmatising process.

Stage 1: importing XML file and initial modifications

This stage involves selecting a manuscript and opening a form which manages the automated processes. The user uploads a Menotic TEI file and inputs a text to which it is linked.

The TEI file is uploaded
The TEI file is modified: entity declarations are inserted (PHP’s XML parser needs local declarations)
The TEI file is modified: all <w> elements have an xml:id attribute inserted, consisting of an xml-compatible token to identify the manuscript + an iterated number for the position of the <w> in the file
An entry is created in the database for the text, linking it to the corresponding manuscript entry in the database, as well as to the corresponding text (or Menota: Catalogue if multiple texts are included)
The resulting XML file is compressed and saved with the manuscript-text link

This ensures that there is a link between the text and the manuscript, and all <w> elements in the XML file can be uniquely identified when it comes to the database processing them.

No further changes to the XML file are made, but lemmatising information is added to the exported XML at the point of use in the final stage.

Stage 2: importing the words into the database

This stage is fully automated.

The XML file is parsed and errors reported, if any (for some reason I can’t get this to work in the first stage)
The entity declarations are parsed into a list for later conversion
XPath is used to identify all <w> and <me:punct> elements
Parser iterates over words and punctuation
facs, dipl and norm levels identified — entity references resolved and some minor processing, preserving the inner TEI tagging
lemma and me:msa identified and the latter processed, converting to corresponding gramm_* columns in the word table
Each word is inserted into the word table, linked to the text in the database, including the xml:id (word.idx), using an adaptation of the three-level structure used by the skaldic project
Punctuation is inserted into the *_after column of the previously-inserted word

Stage 3: auto-lemmatising

This stage is fully automated.

A list of headwords is built from the imported ONP wordlist: only unambiguous (for word class and gender) headwords are used, i.e. if there are multiple homographs with the same word class, they are excluded as potentially ambiguous. These must also be linked to the database’s own headword list
This list of headwords and keys is linked to the wordlist according to matches on the headword form, word class, and in the case of nouns, the gender
The resulting key is inserted into the word table as a link to the lemma table

Initial trials suggest this process matches around 80% of words, with a very high level of accuracy for these matches.

Stage 4: manual lemmatising

Manual lemmatising is done through a separate form which is linked from the processing form.

The lemmatising interface from the Skaldic Project has been adapted for this process. It is an assisted lemmatiser which remembers the wordforms linked to a particular headword and prompts the user with options based on the previous matches.

The form lists up to 100 words in the text, in order. For longer texts, the already-lemmatised words are not shown.

The first column contains information about the word: the word number, the lookup form (lemma if it is in the xml file, otherwise the normalised form) the facsimile and normalised forms. Clicking on the box will give a popup with information about the word, including the surrounding text at the three different levels, plus any morphosyntactic analysis available.

The second column gives a list of headwords previously linked to either this lemma form or word form, in order of frequency (based on ONP’s citation count). There are sometimes some odd results in this list because of previous errors, but these can be edited in the third column.

The third column provides a lookup facility for finding headwords that are not found by the semi-automated process. Type in letters and headwords starting with those letters, with some normalisation, are shown in a drop-down list. Some headwords may be available from the recent ONP list that have not yet been imported into the database word list; these appear with an arrow, and link to a form where the word can be imported (new window). The user then modifies the search term to refresh the list and the word will appear in the top part of the list for linking.

The third column also provides popups for further information and editing of the lemma: deleting the link, reverting to the previous link, information about other words linked to the lemma, editing lemma, listing other instances of the same wordform linked to the lemma, and looking up the lemma in imported dictionaries. The edit lemma button can be used to create new lemmas, but this should be done with caution: apart from some rarer proper nouns, all Old Norse prose words should already be available in the ONP wordlist.

When the update button is pressed, the database saves the lemma links with the words and updates the index of wordform-lemma links with the (potentially) new word forms. New words are then presented in the form for further lemmatising.

Stage 5: exporting the TEI/XML with links

In the final stage, the processing form processes the XML file using regular expression parsing. For each matching xml:id value, it inserts a me:ref attribute with URIs representing the word in the database and the linked lemma. If there is no lemma attribute, the ONP lemma is inserted as this attribute, and if there is no me:msa attribute, relevant information from the wordlist is inserted, namely, word class and, in the case of nouns, gender.

The resulting XML file, which should be valid Menotic XML, is presented in a text box and can be copied and pasted into a plain text file.

The resulting XML file is not saved to the database, as the additional information derives from the database. Rather, the XML is processed at the point of the user’s request and presented in the web page.

Other features

The imported texts are shown on the main page and can be viewed as the whole text at three levels with most formatting removed. A concordance of the linked lemmas is shown with all word forms in the text. Clicking on one of the words shows all the Menota words linked to the headword.

Using the unofficial ONP interface, you can find headwords and see which Menota words (as well as other corpora) are linked to it.

To be developed

A lemma and grammar search / browse facility
A form for morphosyntactic analysis, with some assistance

Cookies on our website

Menota data services

References

Log in