XML for Information Management

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
of 37

Please download to get full document.

View again

XML for Information Management. 26.4.-30.4.2010. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Outline. 1. Structured documents 2. Formal grammars in XML 3. Natural languages in XML documents
XML for Information Management26.4.-30.4.2010University of Erlangen-NurembergComputational LinguisticsInstructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/Outline1. Structured documents2. Formal grammars in XML3. Natural languages in XML documents4. Adding meaning by markup5. Text indexing6. Logical structure of XML documents1. Structured documentsStructured document
  • structure, content, and external presentation can be separated from each other and processed separately
  • structural components have names
  • structural components can be recognized by software modules
  • possible to define the structure
  • 1. Structured documentsStructureContentLayoutStructured documentdifferent languages for defining the structure, e.g., DTD, XML Schema, RELAX NG for XMLan open language standard, e.g. SGML, XMLdifferent languages for defining the layout, e.g., CSS and XSL for XML1. Structured documentsStructureContentLayoutStructured documentExampleDTD.txtrhymes-with-ext-dtd.txtrhymes-with-ext-dtd.xmlrhymes-style.txtrhymes-style.cssrhymes-with-style-and-ext-dtd.txtrhymes-with-style-and-ext-dtd.xml1. Structured documentsManagement of structured documents
  • document management
  • management of the data contained in documents
  • 1. Structured documentsCharacteristics in the management of structured documents
  • Design.Adopting the approach of structured document management in an environment often requires careful planning before the creation of documents. Includes schema design and layout design.
  • Content production. Content can be produced by different types of software, e.g. by a syntax-directed editor. Checking the validity against the schema.
  • Evolution. Schema versioning, layout versioning.
  • Operations. Most typical operation is some kind of transformation.
  • Software. Many kinds of software systems used.
  • 2. Formal grammars in XMLA formal grammar is a way to describe the syntax of language.
  • terminal symbols (alphabet)
  • nonterminal symbols
  • production rules
  • start symbol
  • The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present.2. Formal grammars in XMLIn XML there are two kinds of formal grammars with their own notations:
  • the grammar defining the XML syntax in the XML specification
  • DTD
  • 2. Formal grammars in XMLThe XML specification uses the EBNF (Extended Backus-Naur Form) notation with metasymbols ?, *, +, |, and ( )The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b].The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.2. Formal grammars in XMLA? A is optionalA| B A and B are alternativesA + A occurs once or moreA* A may be missing or occurs once or moreA - B A but not B A B B after A( ) groupingExample rules in XML 1.0:document ::= prologelementMisc*prolog ::= XMLDecl? Misc* (doctypedeclMisc*)?Misc ::= Comment | PI | SComment ::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->' 2. Formal grammars in XMLProduction rules in a DTD:<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ELEMENT line (#PCDATA)> DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification.2. Formal grammars in XMLXML spesification defines the concrete syntax of XML documents. The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed four slightly different models to describe the abstract syntax:
  • XML Information Set
  • DOM model
  • XPath 1.0 model
  • XQuery 1.0 and XPath 2.0 data model
  • Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press. 3. Natural languages in XML documentsNatural language may occur in XML marked up text in the:
  • content of elements
  • markup
  • element, attribute, and entity names
  • attribute values
  • comments
  • 3. Natural language in XML documentsNatural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by
  • human individuals in
  • reading the markedup text
  • information access
  • communicating with other individuals about the schema or marked up content
  • some software applications, for example, text analysis software
  • 4. Adding meaning by markupIt is important that the element and attribute names are meaningful to human readers. <AAA XXX= "5" ><rki YYY="Hamlet" >Where wilt thou lead me? speak; I'll go no further.</rki><rki YYY="ghost">Mark me.</rki></AAA>The names are not useful in information access4. Adding meaning by markup
  • Natural language in XML documents provides semantic information to human readers and for human communication.
  • Meaningful markup is useful for human users in information retrieval and in specifying transformations.
  • Markup may provide rich semantic and linguistic information.
  • 4. Adding meaning by markupExample of combining structural, semantic and linguistic markup:She smelled like trees.<Chapter section = '1' > <Paragraph id='143' FragmentCode='1.12'> <Narration narrator='Benjy'> <Subject person='Caddy'>She</Subject> <Senses mode='smell'>smelled</Senses> like <Imagery referent='tree'>trees</Imagery> </Narration> </Paragraph></Chapter>Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218. 4. Adding meaning by markupAnother markup for the same text:She smelled like trees.<Chapter section = '1' > <Narration narrator='Benjy'> <Imagery place='tree' mode='simile' sense='smell'> <Fragment code='1.12'> <Paragraph id='143'> <Subject person='Caddy'>She</Subject>smelled like trees. </Paragraph> </Fragment> </Imagery> </Narration></Chapter>Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218. 4. Adding meaning by markupSome other examples:http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htmhttp://www.cs.cmu.edu/~awb/festival_demos/sable.htmlhttp://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html4. Adding meaning by markup
  • In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form.
  • The concepts and meanings are defined in formal ontologies.
  • Software applications can understand the meanings.
  • 5. Text indexingdocumentssearch enginequeryanswerindexIn information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.6. Logical structure of XML documentsComponents of the logical structure
  • declarations
  • elements
  • comments
  • processing instructions
  • 6. Logical structure of XML documentsdocument ::= prolog element Misc*declarationscommentsprocessing instructionscommentsprocessing instructionselementscommentsprocessing instructions246. Logical structure of XML documentsDeclarations:
  • XML declaration [23]
  • document type declaration [28]
  • markup declaration [29]
  • element type declaration [45]
  • attribute list declaration [52]
  • entity declaration [70]
  • notation declaration [82]
  • encoding declaration [80]
  • standalone document declaration [32]
  • text declaration [77]
  • to constrain the logical structureto constrain the physical structure6. Logical structure of XML documentsTypical element type declarations:element content defined<!ELEMENT product (mfg, model, description, clock?)><!ELEMENT model (#PCDATA)><!ELEMENT description (#PCDATA | feature)*><!ELEMENT clock EMPTY>mixed content definedempty element defined6. Logical structure of XML documentsempty element defined:<!ELEMENT clock EMPTY>two forms of the element allowed in a well-formed document:<clock></clock><clock/>276. Logical structure of XML documentselement content: definition by content models with metasymbols * iteration (none or more) + iteration (once or more) | alternatives ? optional , successive ( ) groupingExample from XHTML 1.0 Strict DTD:<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>#PCDATA is not accepted in the content model!286. Logical structure of XML documentsmixed content: definition has basically two forms (#PCDATA) (#PCDATA | e1 | … | en)*examples:<!ELEMENT text (#PCDATA)><!ELEMENT section (#PCDATA | subsection)*><!ELEMENT section (#PCDATA | subsection | paragraph)*>#PCDATA is always included in the content specification and comes first in the list of alternatives296. Logical structure of XML documentsAttribute list declarations
  • to define the set of attributes pertaining to a given elemen type
  • to establish type constraints for these attributes
  • to provide default values for attributes
  • 306. Logical structure of XML documents<!ATTLIST poem author CDATA #REQUIRED >element typeattribute nameconstraint: the attribute must be specified for all elements of type poemattribute type: string6. Logical structure of XML documentsDefining constraints[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue)#REQUIRED: attribute must always be provided in all elements of the given type#IMPLIED: attribute can be provided in a element; no default value is providedAttValue: default value is given between single or double quotes#FIXED AttValue: instances of the attribute must match the given default value 6. Logical structure of XML documentsAttribute types[54] AttType ::= StringType | TokenizedType | EnumeratedTypetokenized types:
  • ENTITY, ENTITIES: entity names
  • NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names
  • ID: names that uniquely identify elements
  • IDREF, IDREFS: references to ID type identifiers
  • enumerated types:
  • NOTATION, NOTATIONS: identify notations
  • enumeration
  • 6. Logical structure of XML documents<?xml version="1.0"?><!DOCTYPE text [<!ELEMENT text (line+)><!ELEMENT line (#PCDATA)><!ATTLIST line id ID #REQUIREDseeline IDREFS #IMPLIED> ]><text><line id= "r1">This is the first line</line><line id= "r2" seeline= "r1" >This is the second line, but look at the first too</line></text>6. Logical structure of XML documentsXML-aware web browsers support the visualization of the tree structure: example<Chapter section = '1' ><Narration narrator='Benjy'><Imagery place='tree' mode=simile sense='smell'><Fragment code='1.12'><Paragraph id='143'><Subject person='Caddy'>She</Subject>smelled like trees.</Paragraph></Fragment></Imagery></Narration></Chapter>356. Logical structure of XML documentsDifferent abstract models to decribe the tree in slightly different ways. <poem author = "Murasaki Shikibu" born = "974"><!-- The poem is translated from Japanese by Kenneth Rexroth --><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms</line><line>which bloom and fade in a day. </line></poem>6. Logical structure of XML documentspoemNode types of XPath 1.0poemborn 974AuthorMurasaki Shikibulinelinelinelinewhich bloom and fade in a day.the mountain cherry blossomsif you thought of it as likeThis life of ours would not cause you sorrowThe poem is translated from Japanese by Kenneth Rexroth Root nodeText nodeElement nodeComment nodeAttribute node
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    La Couleur de la victoire FRENCH DVDRiP x264 2016 | └ Baby's & Kinderen | Duquan Chambers