You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
721 lines
29 KiB
721 lines
29 KiB
<?xml version="1.0" standalone="no"?>
|
|
<!--
|
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
|
* contributor license agreements. See the NOTICE file distributed with
|
|
* this work for additional information regarding copyright ownership.
|
|
* The ASF licenses this file to You under the Apache License, Version 2.0
|
|
* (the "License"); you may not use this file except in compliance with
|
|
* the License. You may obtain a copy of the License at
|
|
*
|
|
* http://www.apache.org/licenses/LICENSE-2.0
|
|
*
|
|
* Unless required by applicable law or agreed to in writing, software
|
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
* See the License for the specific language governing permissions and
|
|
* limitations under the License.
|
|
-->
|
|
|
|
<!DOCTYPE s1 SYSTEM "sbk:/style/dtd/document.dtd">
|
|
|
|
<s1 title="Programming Guide">
|
|
<anchor name="Macro"/>
|
|
<s2 title="Version Macro">
|
|
<p>&XercesCName; defines a numeric preprocessor macro, _XERCES_VERSION, for users to
|
|
introduce into their code to perform conditional compilation where the
|
|
version of &XercesCName; is detected in order to enable or disable version
|
|
specific capabilities. For example,
|
|
</p>
|
|
<source>
|
|
#if _XERCES_VERSION >= 30102
|
|
// Code specific to Xerces-C++ version 3.1.2 and later.
|
|
#else
|
|
// Old code.
|
|
#endif
|
|
</source>
|
|
<p>The minor and revision (patch level) numbers have two digits of resolution
|
|
which means that '1' becomes '01' and '2' becomes '02' in this example.
|
|
</p>
|
|
<p>There are also other string macros or constants to represent the Xerces-C++ version.
|
|
Please refer to the <code>xercesc/util/XercesVersion.hpp</code> header for details.
|
|
</p>
|
|
</s2>
|
|
|
|
|
|
<anchor name="Schema"/>
|
|
<s2 title="Schema Support">
|
|
<p>&XercesCName; contains an implementation of the W3C XML Schema
|
|
Language. See the <jump href="schema-&XercesC3Series;.html">XML Schema Support</jump> page for details.
|
|
</p>
|
|
</s2>
|
|
|
|
<anchor name="Progressive"/>
|
|
<s2 title="Progressive Parsing">
|
|
|
|
<p>In addition to using the <code>parse()</code> method to parse an XML File.
|
|
You can use the other two parsing methods, <code>parseFirst()</code> and <code>parseNext()</code>
|
|
to do the so called progressive parsing. This way you don't
|
|
have to depend on throwing an exception to terminate the
|
|
parsing operation.
|
|
</p>
|
|
<p>
|
|
Calling <code>parseFirst()</code> will cause the DTD (both internal and
|
|
external subsets), and any pre-content, i.e. everything up to
|
|
but not including the root element, to be parsed. Subsequent calls to
|
|
<code>parseNext()</code> will cause one more pieces of markup to be parsed,
|
|
and propagated from the core scanning code to the parser (and
|
|
hence either on to you if using SAX/SAX2 or into the DOM tree if
|
|
using DOM).
|
|
</p>
|
|
<p>
|
|
You can quit the parse any time by just not
|
|
calling <code>parseNext()</code> anymore and breaking out of the loop. When
|
|
you call <code>parseNext()</code> and the end of the root element is the
|
|
next piece of markup, the parser will continue on to the end
|
|
of the file and return false, to let you know that the parse
|
|
is done. So a typical progressive parse loop will look like
|
|
this:</p>
|
|
|
|
<source>// Create a progressive scan token
|
|
XMLPScanToken token;
|
|
|
|
if (!parser.parseFirst(xmlFile, token))
|
|
{
|
|
cerr << "scanFirst() failed\n" << endl;
|
|
return 1;
|
|
}
|
|
|
|
//
|
|
// We started ok, so lets call scanNext()
|
|
// until we find what we want or hit the end.
|
|
//
|
|
bool gotMore = true;
|
|
while (gotMore && !handler.getDone())
|
|
gotMore = parser.parseNext(token);</source>
|
|
|
|
<p>In this case, our event handler object (named 'handler')
|
|
is watching for some criteria and will
|
|
return a status from its <code>getDone()</code> method. Since
|
|
the handler
|
|
sees the SAX events coming out of the SAXParser, it can tell
|
|
when it finds what it wants. So we loop until we get no more
|
|
data or our handler indicates that it saw what it wanted to
|
|
see.</p>
|
|
|
|
<p>When doing non-progressive parses, the parser can easily
|
|
know when the parse is complete and insure that any used
|
|
resources are cleaned up. Even in the case of a fatal parsing
|
|
error, it can clean up all per-parse resources. However, when
|
|
progressive parsing is done, the client code doing the parse
|
|
loop might choose to stop the parse before the end of the
|
|
primary file is reached. In such cases, the parser will not
|
|
know that the parse has ended, so any resources will not be
|
|
reclaimed until the parser is destroyed or another parse is started.</p>
|
|
|
|
<p>This might not seem like such a bad thing; however, in this case,
|
|
the files and sockets which were opened in order to parse the
|
|
referenced XML entities will remain open. This could cause
|
|
serious problems. Therefore, you should destroy the parser instance
|
|
in such cases, or restart another parse immediately. In a future
|
|
release, a reset method will be provided to do this more cleanly.</p>
|
|
|
|
<p>Also note that you must create a scan token and pass it
|
|
back in on each call. This insures that things don't get done
|
|
out of sequence. When you call <code>parseFirst()</code> or
|
|
<code>parse()</code>, any
|
|
previous scan tokens are invalidated and will cause an error
|
|
if used again. This prevents incorrect mixed use of the two
|
|
different parsing schemes or incorrect calls to
|
|
<code>parseNext()</code>.</p>
|
|
|
|
</s2>
|
|
|
|
<anchor name="GrammarCache"/>
|
|
<s2 title="Pre-parsing Grammar and Grammar Caching">
|
|
<p>&XercesCName; provides a function to pre-parse the grammar so that users
|
|
can check for any syntax error before using the grammar. Users can also optionally
|
|
cache these pre-parsed grammars for later use during actual parsing.
|
|
</p>
|
|
<p>Here is an example:</p>
|
|
<source>
|
|
XercesDOMParser parser;
|
|
|
|
// Enable schema processing.
|
|
parser.setDoSchema(true);
|
|
parser.setDONamespaces(true);
|
|
|
|
// Let's preparse the schema grammar (.xsd) and cache it.
|
|
Grammar* grammar = parser.loadGrammar(xmlFile, Grammar::SchemaGrammarType, true);
|
|
</source>
|
|
<p>Besides caching pre-parsed schema grammars, users can also cache any
|
|
grammars encountered during an xml document parse.
|
|
</p>
|
|
<p>Here is an example:</p>
|
|
<source>
|
|
SAXParser parser;
|
|
|
|
// Enable grammar caching by setting cacheGrammarFromParse to true.
|
|
// The parser will cache any encountered grammars if it does not
|
|
// exist in the pool.
|
|
// If the grammar is DTD, no internal subset is allowed.
|
|
parser.cacheGrammarFromParse(true);
|
|
|
|
// Let's parse our xml file (DTD grammar)
|
|
parser.parse(xmlFile);
|
|
|
|
// We can get the grammar where the root element was declared
|
|
// by calling the parser's method getRootGrammar;
|
|
// Note: The parser owns the grammar, and the user should not delete it.
|
|
Grammar* grammar = parser.getRootGrammar();
|
|
</source>
|
|
<p>We can use any previously cached grammars when parsing new xml
|
|
documents. Here are some examples on how to use those cached grammars:
|
|
</p>
|
|
<source>
|
|
/**
|
|
* Caching and reusing XML Schema (.xsd) grammar
|
|
* Parse an XML document and cache its grammar set. Then, use the cached
|
|
* grammar set in subsequent parses.
|
|
*/
|
|
|
|
XercesDOMParser parser;
|
|
|
|
// Enable schema processing
|
|
parser.setDoSchema(true);
|
|
parser.setDoNamespaces(true);
|
|
|
|
// Enable grammar caching
|
|
parser.cacheGrammarFromParse(true);
|
|
|
|
// Let's parse the XML document. The parser will cache any grammars encountered.
|
|
parser.parse(xmlFile);
|
|
|
|
// No need to enable re-use by setting useCachedGrammarInParse to true. It is
|
|
// automatically enabled with grammar caching.
|
|
for (int i=0; i< 3; i++)
|
|
parser.parse(xmlFile);
|
|
|
|
// This will flush the grammar pool
|
|
parser.resetCachedGrammarPool();
|
|
</source>
|
|
|
|
<source>
|
|
/**
|
|
* Caching and reusing DTD grammar
|
|
* Preparse a grammar and cache it in the pool. Then, we use the cached grammar
|
|
* when parsing XML documents.
|
|
*/
|
|
|
|
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();
|
|
|
|
// Load grammar and cache it
|
|
parser->loadGrammar(dtdFile, Grammar::DTDGrammarType, true);
|
|
|
|
// enable grammar reuse
|
|
parser->setFeature(XMLUni::fgXercesUseCachedGrammarInParse, true);
|
|
|
|
// Parse xml files
|
|
parser->parse(xmlFile1);
|
|
parser->parse(xmlFile2);
|
|
</source>
|
|
<p>There are some limitations about caching and using cached grammars:</p>
|
|
<ul>
|
|
<li>When caching/reusing DTD grammars, no internal subset is allowed.</li>
|
|
<li>When preparsing grammars with caching option enabled, if a grammar, in the
|
|
result set, already exists in the pool (same namespace for schema or same system
|
|
id for DTD), the entire set will not be cached. This behavior is the default but can
|
|
be overridden for XML Schema caching. See the SAX/SAX2/DOM parser features for details.</li>
|
|
<li>When parsing an XML document with the grammar caching option enabled, the
|
|
reuse option is also automatically enabled. We will only parse a grammar if it
|
|
does not exist in the pool.</li>
|
|
</ul>
|
|
</s2>
|
|
|
|
<anchor name="LoadableMessageText"/>
|
|
<s2 title="Loadable Message Text">
|
|
|
|
<p>The &XercesCName; supports loadable message text. Although
|
|
the current distribution only supports English, it is capable of
|
|
supporting other
|
|
languages. Anyone interested in contributing any translations
|
|
should contact us. This would be an extremely useful
|
|
service.</p>
|
|
|
|
<p>In order to support the local message loading services, all the error messages
|
|
are captured in an XML file in the src/xercesc/NLS/ directory.
|
|
There is a simple program, in the tools/NLS/Xlat/ directory,
|
|
which can translate that text in various formats. It currently
|
|
supports a simple 'in memory' format (i.e. an array of
|
|
strings), the Win32 resource format, and the message catalog
|
|
format. The 'in memory' format is intended for very simple
|
|
installations or for use when porting to a new platform (since
|
|
you can use it until you can get your own local message
|
|
loading support done.)</p>
|
|
|
|
<p>In the src/xercesc/util/ directory, there is an XMLMsgLoader
|
|
class. This is an abstraction from which any number of
|
|
message loading services can be derived. Your platform driver
|
|
file can create whichever type of message loader it wants to
|
|
use on that platform. &XercesCName; currently has versions for the in
|
|
memory format, the Win32 resource format, the message
|
|
catalog format, and ICU message loader.
|
|
Some of the platforms can support multiple message
|
|
loaders, in which case a #define token is used to control
|
|
which one is used. You can set this in your build projects to
|
|
control the message loader type used.</p>
|
|
|
|
</s2>
|
|
|
|
<anchor name="PluggableTranscoders"/>
|
|
<s2 title="Pluggable Transcoders">
|
|
|
|
<p>&XercesCName; also supports pluggable transcoding services. The
|
|
XMLTransService class is an abstract API that can be derived
|
|
from, to support any desired transcoding
|
|
service. XMLTranscoder is the abstract API for a particular
|
|
instance of a transcoder for a particular encoding. The
|
|
platform driver file decides what specific type of transcoder
|
|
to use, which allows each platform to use its native
|
|
transcoding services, or the ICU service if desired.</p>
|
|
|
|
<p>Implementations are provided for Win32 native services, ICU
|
|
services, and the <ref>iconv</ref> services available on many
|
|
Unix platforms. The Win32 version only provides native code
|
|
page services, so it can only handle XML code in the intrinsic
|
|
encodings ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4
|
|
(Big/Small Endian), EBCDIC code pages IBM037, IBM1047 and
|
|
IBM1140 encodings, ISO-8859-1 (aka Latin1) and Windows-1252. The ICU version
|
|
provides all of the encodings that ICU supports. The
|
|
<ref>iconv</ref> version will support the encodings supported
|
|
by the local system. You can use transcoders we provide or
|
|
create your own if you feel ours are insufficient in some way,
|
|
or if your platform requires an implementation that &XercesCName; does not
|
|
provide.</p>
|
|
|
|
</s2>
|
|
|
|
<anchor name="PortingGuidelines"/>
|
|
<s2 title="Porting Guidelines">
|
|
|
|
<p>All platform dependent code in &XercesCName; has been
|
|
isolated to a couple of files, which should ease the porting
|
|
effort. The <code>src/xercesc/util</code> directory
|
|
contains all such files. In particular:</p>
|
|
|
|
<ul>
|
|
<li>The <code>src/xercesc/util/FileManagers</code> directory
|
|
contains implementations of file managers for various
|
|
platforms.</li>
|
|
|
|
<li>The <code>src/xercesc/util/MutexManagers</code> directory
|
|
contains implementations of mutex managers for various
|
|
platforms.</li>
|
|
|
|
<li>The <code>src/xercesc/util/Xerces_autoconf_const*</code> files
|
|
provide base definitions for various platforms.</li>
|
|
</ul>
|
|
|
|
<p>Other concerns are:</p>
|
|
|
|
<ul>
|
|
<li>Does ICU compile on your platform? If not, then you'll need to
|
|
create a transcoder implementation that uses your local transcoding
|
|
services. The iconv transcoder should work for you, though perhaps
|
|
with some modifications.</li>
|
|
<li>What message loader will you use? To get started, you can use the
|
|
"in memory" one, which is very simple and easy. Then, once you get
|
|
going, you may want to adapt the message catalog message loader, or
|
|
write one of your own that uses local services.</li>
|
|
<li>What should I define XMLCh to be? Please refer to <jump
|
|
href="build-misc-&XercesC3Series;.html#XMLChInfo">What should I define XMLCh to be?</jump> for
|
|
further details.</li>
|
|
</ul>
|
|
|
|
<p>Finally, you need to decide about how to define XMLCh. Generally,
|
|
XMLCh should be defined to be a type suitable for holding a
|
|
utf-16 encoded (16 bit) value, usually an <code>unsigned short</code>. </p>
|
|
|
|
<p>All XML data is handled within &XercesCName; as strings of
|
|
XMLCh characters. Regardless of the size of the
|
|
type chosen, the data stored in variables of type XMLCh
|
|
will always be utf-16 encoded values. </p>
|
|
|
|
|
|
|
|
<p>Unlike XMLCh, the encoding
|
|
of wchar_t is platform dependent. Sometimes it is utf-16
|
|
(AIX, Windows), sometimes ucs-4 (Solaris,
|
|
Linux), sometimes it is not based on Unicode at all
|
|
(HP/UX, AS/400, system 390). </p>
|
|
|
|
<p>Some earlier releases of &XercesCName; defined XMLCh to be the
|
|