General Structure Of An Sgml Document example essay topic

4,205 words
A general model of publication Although the medium and the material may differ vastly, essentially the same common process is always involved in publication (Fig. 1). For on-line publications this model makes it possible to automate many of the steps involved (see module on Automated document processing). Figure 1. A model for the publication process. The same general pattern of steps occurs whatever the publication and whatever the type of material involved. This model encompasses all the stages described earlier, but in a somewhat more formalized form.

We can summarize the steps as follows: Submission The author submits material to the editor. Acquisition The publisher acquires material. Here we take this to include permissions. Details of the submission are recorded and an acknowledgment is sent to the author. Quality assurance The material is checked. Errors are referred back to the author for correction.

Production The material is prepared for publication. This stage includes copy-editing, design, typesetting, printing and binding. Proofs are checked both by the author and editor and any typesetting errors are corrected. For books, an ISBN number is obtained. Distribution The publication is shipped to stores etc for sale. It is publicized so that people know that it is available.

The Internet offers advantages for publications of all-kinds. These include: ! P instant world-wide availability; ! P publication features of the World-Wide Web; !

P eliminating distribution costs; ! P reducing production costs - no need to print 'hard copy'; ! P potential world-wide audience; and! P 'niche' / special interest publishing becomes viable. The World Wide Web expands the traditional notion of a publication in several ways: ! P it is possible to include multimedia elements; !

P it is possible to include hyperlinks to information anywhere; ! P it is possible to draw together information from many different sources; ! P hypermedia books are not limited to the traditional 'linear's structure of printed books. They can, for instance, provide several alternative paths through a set of documents, or allow readers to pursue material to whatever depth they wish; and!

P it diminishes the distinction between traditional text-oriented publications and other products, such as databases and on-line software. Legal issues Legal issues abound in the publishing business. Although legal issues are not dealt with in detail here the editor should make every effort to keep up to date with issues and changes. Each publication should be carefully checked to ensure that legal risks are minimized. Some of the legal matters involved in publishing include: !

P Contracts Publishers always need to ensure the legal status of material that they publish. For instance, there is normally some form of contract with the author that spells out the terms and conditions under which the material is published.! P Copyright and permissions Ensuring permission to reproduce material belonging to others is one of the most regular and time-consuming legal issues that editors have to face.! P Defamation and libel Publishers often bear the legal burden for offensive or damaging remarks made by an author.! P Liability Authors, editors amd publishers can all be held liable for damages caused by a publication. An example would be a reference of textbook in which erroneous facts and figures led (say) to errors in a building or circuitry and consequently to severe financial loss.!

P Plagiarism Writers sometimes borrow ideas, words or material from other writers. Whenever they do so, authors should clearly acknowledge the source using references, footnotes or other appropriate device. Failure to do so constitutes plagiarism. The easiest type of plagiarism to prove is where an author reproduces whole slabs of text from another author without acknowledgement. Copying another's ideas can be much harder to prove. The nature of on-line publishing has several implications for the way documents are prepared for publication: o Distribute the effort When an author writes an article, it is not much extra effort for she / he to format the material as well.

Every extra job the editor does has to be multiplied by the number of authors. o Automation Wherever possible the computer should carry out operations instead of the editor. Any process that can be automated saves valuable time. For example, correcting a single URL in 100 documents could take an editor all day; an editing program could complete the entire job in seconds. o Need for flexible markup It is important to prepare documents in a way that allows elements (e.g. references) to be extracted for other purposes or inserted from other sources. o Need for standards The above issues all require that information is always structured and processed in a standard way. For instance references should be structured in a consistent way. For many purposes on-line forms provide a convenient way of standardizing the content and format of data entry. o Permissions The potential volume and speed of on-line publishing demands a different approach to obtaining copyright permissions. One approach that is being rapidly adopted is to centralize permissions through a central agency.

Rather than request that permission be given, the use of material is simply registered and issues such as acknowledgement and royalties are handled in a standard way. o Quality (see below) On-line publishing changes two aspects of quality assurance. First, many different types of material can be published on-line (e.g. text, software, video), so there are potentially many types of quality processing involved. Checking data for correctness and accuracy is very different from evaluating software, or refereeing a scientific manuscript. Secondly, some quality assurance procedures can be automated. Standards are essential wherever issues of quality arise. Dictionaries define standard ways of spelling words.

Any publication needs to define the standards that it applies to the material it publishes. These standards need to cover both the information and its presentation. Examples include: ! +/- standard layout and structure for manuscripts (mss); ! +/- protocol for submission of entries, corrections, etc; and! +/- quality control criteria and procedures (see below). o Quality control for different media The following table indicates some of the ways in which we can check the quality of different sorts of information. Material Nature of check Text Format, clarity Document Structure, typos Fiction Plot, clarity, characterization Research Peer review of methodology, significance, originality, etc Data Test for accuracy and validity Software Test performance, user-friendliness Images Resolution, content " Markup' is the process of entering commands and codes into a file to indicate how it should be processed Markup systems for text provide several kinds of elements Formatting These elements define how the text is to be presented, especially its layout and style. See the TROFF example given above. Structure These elements refer to the content of the text and how it is organized.

See the examples (DNA sequence and bibliographic entry) above. Variable references These elements refer to the values of variables whose current values are to be substituted into the text. Variables are common in merge Anchors These elements create points of reference within a document or piece of text. One example is any index tag placed within a document so that a word processor can create an index of page numbers. Comments These are explanatory items that do not form part of the text proper. HyperText Transfer Protocol - HTTP The 'Hypertext Transfer Protocol' (HTTP) is the communications protocol used on the World Wide Web.

It passes hypertext links from a browser to a server and allows the requested documents and images to be passed back to the browser. The 'HyperText Markup Language' (HTML) allows three kinds of hypertext references: absolute, relative, and internal. Servers use the syntax of the given reference to determine which kind of link is implied. Absolute links (also referred to as physical links) give the full URL of a document, so that if can be located from anywhere in the world.

Here is an example of an absolute link: Hypertext and multimedia This full link is very important when you want to be able to link to a resource in the world however it can become a problem if the resource moves to a new location. An alternative approach is to use a relative link Relative links (also referred to as logical links) assume that a requested file is on the same server as the document in which the pointer is located. They give the location of the requested file relative to the location of the current file. So for a file in the same directory, we need give only the file name.

If the file is in a subdirectory then we give both the directory name and the file name. Here is an example of a relative link. Hypertext and multimedia In the above example the file 'topic 14. html' must reside in the same directory as the document you are currently working with. Internal links (also referred to as name links) locate positions within a document. In HTML they are indicated by a hash marker (#hash; ). If no document is indicated, then it is assumed that the location is somewhere within the same document as the pointer.

Here are examples of internal links. Hypertext Hypertext and multimedia The internal links are easily set up using the name attribute. The name attribute allows you to jump to specific location within the current document. The classic example is where you have got an index and you want to jump to one of the index sections in the document. Refer to the following diagram to illustrate how the name attribute is implemented. Hypertext documents form a network structure.

The nodes of the network are items of information (documents, images etc) and the edges are links between the documents. Figure 1. Three ways of organizing information: linear, hierarchy, web.! P Linear patterns are the traditional structure provided in printed books or spoken narrative.

In hypertext they would normally indicate a particular chain of thought or line of argument.! P Tree structures arise naturally when we go from the general to the particular, and as we go into greater and greater detail about a given topic. For example starting from the idea of 'science', we could move down to 'geography' and hence to different aspects, such as climate or mapping. From climate we could move down to different factors, such as rainfall, temperatures and so on.! P Web structures arise from association. Association is a way of making rapid leaps between ideas.

For instance in the above example, mapping has potential links to many variety of different things, such as art, printing, and travel. From these ideas we could quickly move on to such widely different ideas as sculpture and transport. Traditional books attempt to encompass all of the above ideas, but are constrained by the medium Image maps are another feature of a hypertext document that makes it very user friendly. Rather than clicking on a particular hypertext link the user is able to click on an image, for example a particular part (or region) of a map or a picture, the result being that the browser is able to convert the request to a particular hypertext link.

Traditionally image maps required some processing to be completed on the server side, in particular in the cgi-bin directory. There are additional disadvantages to server side image maps including: ! P possible delays in the server having to process the image coordinates; ! P not all image processing is the same; and!

P not everyone (in fact very few people) can have access to the cgi-bin directory. Both Netscape and Explorer support client-side image maps however there is different HTML syntax for these. a. Creating an image map You obviously need an image map or something that is suitable for a user to click on. Specifying the image details Three tags must be included to tell your browser that you are about to define a client image map. Refer to the following code first and then note the explanations below.!

P - you have encountered part of this tag before - it simply indicates you are going to display an image. The use map command is indicates where the coordinates for the different parts of the test. gif are located and what should happen when they are selected (this is simply a name reference).! P - is where the details of the map regions are specified. As you would appreciate each of the shapes in the image need to be specified.! P 'into a file ('target file'). type source file | FRED targetfileWhen used this way, the program FRED is called a filter. In general filters are non-interactive programs that read from standard input and write to standard output.

PERL scripts are often used as filters. Demonstration Program 2 is a simple filter that could be used in place of FRED above. Notice, however that it does no processing of the data at all! In the following example it copies the first demonstration script to another file. type demo 1. pl | perl demo 2. pl demo 1. new If the output file name is omitted, then the destination for standard output depends on the context in which the script is run. For example if the script is run (say) interactively from a DOS prompt (see below), then the output will be listed on the PC screen. type demo 1. pl | perl demo 2. pl Likewise if the input pipe is omitted then the input source depends on the context. Run interactively on a PC, the script will echo the input typed in by the user from the keyboard, until an end-of-file symbol (Control-Z) is typed.

(The end-of-file symbol for Unix systems is Control-D rather than Control-Z). If the output pipe is given, then this method can be used to type data into a file, as in this example: perl demo 2. pl test. out Note that the sign ' ' indicates that the pipe will create a NEW file, or replace an existing one. To append data to an existing file we use a double pipe symbol So the following example adds our input to the end of an existing file: perl demo 2. pl test. out The Document Type Definition (DTD) is generally positioned between the SGML Declaration and the Document Instance (text marked up with the tags defined in the DTD). Sometimes the design of the SGML document will include processing calls to specific DTD's rather than include it as a part of the marked-up instance. Regardless of the way it is implemented the DTD is critical as the formal definition of the structure of an SGML document. SGML systems use DTD's in several ways.

The first thing any SGML system has to do is to parse the text that is presented to it. Parsing consists of picking out tags, and hence building up a picture of a document's structure: the elements, entities, attributes and other features.! P The DTD tells the parser which tags are meaningful and how to interpret them; ! P The DTD helps the parser to detect syntax errors: mistakes in markup often produce elements that are inconsistent with the DTD; and!

P The DTD defines elements and how they relate to one another, so enabling the system to identify these structures. DTDs convey several advantages. For instance they make it unnecessary to include at the start of a document definitions which explain how each tag is to be processed. More generally they allow similar documents to be processed in a consistent way. For instance consider the fragment of bibliographic reference from the previous section... 1990 Goldfarb, C. The SGML Handbook Oxford: Oxford University Press A DTD for this document needs to include definitions for all of the tags used: , , , and. new Internet language (the eXtensible Markup Language - XML) shows promise of significant advantages over HTML and SGML.

The main keys to this promise are that: ! P it is very flexible; ! P it is easy for humans to read and write; ! P anyone can create words for the language; !

P it includes machine readable information about the structure and content of Web pages; ! P sourcing information on the Web will be based upon the meaning of search keywords; ! P users will be able to analyse and manipulate document information like raw data. The eXtensible Markup Language (XML) is an alternative to HTML and the proliferation of add-on technologies that have progressively developed.

You will be aware that HTML effectively describes document format and visual presentation. XML on the other hand, describes data in a human readable format with no indication of how the data is to be displayed. XML has a database-neutral and device-neutral format. XML is quite simply a meta-language that is used to define other domain- or industry-specific languages. For those familiar with SGML, constructing your own XML language (also called a 'vocabulary') requires a specific Document Type Definition (DTD). This is essentially a context-free grammar that provides the rules defining the elements and structure of your new language.

XML gives you the freedom and power to create your own language. Any browser (or application) with an XML parser can interpret a document instance (produced by 'applying' the tags from the DTD) by 'learning' the rules defined by the DTD. XML is less complex than SGML (although more complex than HTML) and provides 80% of the benefit of SGML with 20% of the effort The representations of language structures (or grammars) with which humans and computers deal, are called meta-languages and are used to define other languages so that data transmission (communication) is made easier The Standard Generalized Markup Language (SGML) is a text structural standard. It is defined in ISO 8879: 1986 'Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML) '.

It is relevant to the World Wide Web in two ways. First, HTML is an application of SGML. Secondly, it is used as a storage medium for use with Common Gateway Interface (CGI) processes. For example, it is often used to process forms or to store documents in a more flexible format than HTML provides. The demand for SGML is motivated by the: !

P reduction of paper storage; ! P elimination of redundant information; ! P ability to query, manipulate, retrieve and reuse information; and! P dramatic increases in labour efficiency (costs?

?) (Carr, N. 1994. Implementations of SGML. Pro print magazine, April / May). SGML is a meta-language that enables the structure of information to be described. It is a 'meta-language' because it is used to define context-free syntax of markup languages, each of which describes the structure of different classes of documents. Particular markup languages can be formally defined via a 'document type definition (DTD) '.

These DTDs are the equivalent of style sheets used by word processing systems. An example is given later. SGML descriptions of structure are independent of the way the information is to be processed. It is concerned with how structure is represented; not what that structure means. For instance the same document markup could be used to indicate how to process text for formatting and printing as a paper publication (via a word-processing package), for on-line display (after translation into HTML) via World Wide Web, or for transfer into a textual database (say via Oracle), or for transfer to a CD-ROM based information system. The Hypertext Markup Language (HTML) is an application of SGML for marking up documents for the World Wide Web.

An SGML document (an SGML document entity) consists of the following items: ! P An SGML Declaration, which defines the syntax and character sets used in the document; ! P A Document Prolog defines the context-free grammar (or syntax) with which the document instance is parsed. It contains a Document Type Definition (DTD) which defines the structure of the document; and! P The Text (complete with markup tags), referred to as the document Instance of the DTD. The general structure of an SGML Document: An element of an SGML document may have various attributes associated with it.

An attribute is a piece of data that tells us something about how an element is to be processed". h In SGML, an attribute is associated with an element and provides extra information about that element. An attribute's description appears inside the start tag of the element with which it is associated. An attribute is described as follows: "h a start tag open (sta go); "h the name of the element; "h the following items repeated as a group as many times as is necessary: at least one space character, the name of the attribute, a 'value indicator' (vi, the 'equal's ign); "h a lit or a lit a demi liter (double or single quotation mark); "h the value of the attribute; "h a lit or a lit a demi liter (double or single quotation mark); and "h a start tag close (stage). CGI programs are accessed through the web just as a normal HTML document is, through a URL. The only condition placed upon CGI programs is that they reside somewhere below the cgi-bin directory configured within the http server (http stands for hypertext transfer protocol daemon and refers to the CGI program that executes the http requests).

The http server is the server that holds the programs that process the http requests. This is so the server knows whether to return the file as a document, or execute it as a CGI program. The most common use of the CGI is to handle the results of form and query submissions for HTTP. The CGI process Information flow through the CGI is illustrated in the figure below. 1. The result of a form or query submission is passed from the client browser to the http server.

2. The http server forwards the information from the browser to the appropriate CGI program. 3. The CGI program processes the input and may require access to certain data files residing on the server, such as a database. 4. The CGI program writes either of the following to standard output: o a HTML document; or o a pointer to a HTML document in the form of a Location header that contains the URL of the document.

5. The http server passes the output from the CGI process back to the client browser as the result of the form or query submission. CGI programming CGI programs can be written in virtually any programming language. The most common ones being PERL, C and shell scripts.

The basic structure of a CGI program is illustrated by the following diagram. PERL is the language of choice for many CGI programmers due to its powerful string manipulation and regular expression matching functionality. This makes it well suited to handling both CGI input from HTML forms, as well as producing dynamic HTML documents. Handling forms CGI input HTML forms are implemented in a way that produces key-value pairs for each of the input variables. These key-value pairs can be passed to the CGI program using one of two methods, GET and POST. GET The GET method appends the key-value pairs to the URL.

A question mark '?' separates the URL proper from the parameters which are extracted by the http server and passed to the CGI program via the QUERY STRING environment variable. Here is a short form to demonstrate the GET method. (When you submit the form, note the URL displayed in the browser Location bar, the QUERY STRING environment variable, and also the hidden variable). POST The POST method operates such that the http server passes the key-value pairs to the CGI program via standard input. The following form is identical to the previous GET form, but uses the POST method. (When you submit the form, note the URL and QUERY STRING environment variable compared to using GET, also have a look at what comes in as standard input).