attoparser 2.0.7.RELEASE API

attoparser is a Java parser for XML and HTML markup.

Main features

The main features of attoparser are:

  • Fast, lightweight and easy to use.
  • Supports and understands both XML and HTML (including HTML5).
  • Powerful API. Does not implement the official SAX or DOM standard XML APIs. On purpose.
  • Event-based (SAX-style), uses markup handler objects for processing parsing events, which can be (and many times are) chained to achieve the desired final results.
  • Though it is event-based, it offers out-of-the-box a handler that can turn events into a DOM-style object tree.
  • Does not perform any DTD / XSD validation, namespace processing, entity resolution or escaping / unescaping operations. All of this on purpose, too.
  • Allows ill-formed markup (XML or HTML) if configured to do so.
  • Performs auto-balancing of tags if configured to do so. Both in XML and in HTML parsing modes (will do it according to the HTML5 specification if in HTML mode).
  • Zero loss parsing. Does not lose any information during parsing (keyword case, attribute quoting...), so that the exact original markup can be reconstructed at the handler layer.
  • Can perform fast fragment selection operations during parsing, based on powerful markup selection expressions like //div/p#content.
  • Loaded with other useful goodies like HTML minimization, event trace building or pretty-HTML reporting.

How to use it

Using attoparser can be as simple as:


  // Obtain a java.io.Reader on the document to be parsed
  final Reader documentReader = ...;

  // Create the handler instance. Extending the no-op AbstractMarkupHandler is a good start
  final IMarkupHandler handler = new AbstractMarkupHandler() {
      ... // some events implemented
  };

  // Create or obtain the parser instance (can be reused). Example uses the default configuration for HTML
  final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

  // Parse it!
  parser.parse(documentReader, handler);

A more complex example, say you want to extract to another file only the <div> elements with class "content" from an HTML file. You will need a BlockSelectorMarkupHandler instance to do the selection, and then an OutputMarkupHandler chained to the former in order to write the output markup somewhere:


    // Obtain a java.io.Reader on the document to be parsed
    final Reader documentReader = ...;

    // Obtain a java.io.Writer on the resource you want the results to be written to
    final Writer documentWriter = ...;

    // Last step of the chain will be the OutputMarkupHandler, who will write events as markup to the writer
    final OutputMarkupHandler outputHandler = new OutputMarkupHandler(documentWriter);

    // Before outputting, we will need to select those div's by means of a "markup selector expression", so we chain it
    final BlockSelectorMarkupHandler selectorHandler = new BlockSelectorMarkupHandler(outputHandler, "div.content");

    // Create or obtain the parser instance (can be reused). We will use the default configuration for HTML
    final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

    // Parse it!
    parser.parse(documentReader, selectorHandler);

Where to start

The best place to start learning about attoparser by reading this docs is having a look at the IMarkupParser and especially the IMarkupHandler interfaces.

Packages
Package
Description
Main parser and handler artifacts: basic interfaces and implementations.
Parser configuration artifacts.
Handlers for discarding markup.
Handlers for creating DOM trees as a result of parsing.
Handlers for duplicating events between more than one handler.
Handlers for minimizing (compacting) HTML markup.
Handlers for outputting markup as a result of parsing.
Handlers for creating a pretty-HTML representation of parsing events.
Handlers for filtering a part or several parts of markup during parsing in a fast and efficient way.
Artifacts for parsing using a simplified version of the handler interfaces.
Handlers for creating traces of parsing events (for testing/debugging).
Utility classes.