attoparser 2.0.7.RELEASE API
attoparser is a Java parser for XML and HTML markup.
Main features
The main features of attoparser are:
- Fast, lightweight and easy to use.
- Supports and understands both XML and HTML (including HTML5).
- Powerful API. Does not implement the official SAX or DOM standard XML APIs. On purpose.
- Event-based (SAX-style), uses markup handler objects for processing parsing events, which can be (and many times are) chained to achieve the desired final results.
- Though it is event-based, it offers out-of-the-box a handler that can turn events into a DOM-style object tree.
- Does not perform any DTD / XSD validation, namespace processing, entity resolution or escaping / unescaping operations. All of this on purpose, too.
- Allows ill-formed markup (XML or HTML) if configured to do so.
- Performs auto-balancing of tags if configured to do so. Both in XML and in HTML parsing modes (will do it according to the HTML5 specification if in HTML mode).
- Zero loss parsing. Does not lose any information during parsing (keyword case, attribute quoting...), so that the exact original markup can be reconstructed at the handler layer.
- Can perform fast fragment selection operations during parsing, based on powerful markup selection expressions like //div/p#content.
- Loaded with other useful goodies like HTML minimization, event trace building or pretty-HTML reporting.
How to use it
Using attoparser can be as simple as:
// Obtain a java.io.Reader on the document to be parsed
final Reader documentReader = ...;
// Create the handler instance. Extending the no-op AbstractMarkupHandler is a good start
final IMarkupHandler handler = new AbstractMarkupHandler() {
... // some events implemented
};
// Create or obtain the parser instance (can be reused). Example uses the default configuration for HTML
final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());
// Parse it!
parser.parse(documentReader, handler);
A more complex example, say you want to extract to another file only the <div> elements with class "content" from an HTML file. You will need a BlockSelectorMarkupHandler instance to do the selection, and then an OutputMarkupHandler chained to the former in order to write the output markup somewhere:
// Obtain a java.io.Reader on the document to be parsed
final Reader documentReader = ...;
// Obtain a java.io.Writer on the resource you want the results to be written to
final Writer documentWriter = ...;
// Last step of the chain will be the OutputMarkupHandler, who will write events as markup to the writer
final OutputMarkupHandler outputHandler = new OutputMarkupHandler(documentWriter);
// Before outputting, we will need to select those div's by means of a "markup selector expression", so we chain it
final BlockSelectorMarkupHandler selectorHandler = new BlockSelectorMarkupHandler(outputHandler, "div.content");
// Create or obtain the parser instance (can be reused). We will use the default configuration for HTML
final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());
// Parse it!
parser.parse(documentReader, selectorHandler);
Where to start
The best place to start learning about attoparser by reading this docs is having a look at the IMarkupParser and especially the IMarkupHandler interfaces.
Packages
Package
Description
Main parser and handler artifacts: basic interfaces and implementations.
Parser configuration artifacts.
Handlers for discarding markup.
Handlers for creating DOM trees as a result of parsing.
Handlers for duplicating events between more than one handler.
Handlers for minimizing (compacting) HTML markup.
Handlers for outputting markup as a result of parsing.
Handlers for creating a pretty-HTML representation of parsing events.
Handlers for filtering a part or several parts of markup during parsing
in a fast and efficient way.
Artifacts for parsing using a simplified version of the handler interfaces.
Handlers for creating traces of parsing events (for testing/debugging).
Utility classes.