Class LagartoParser
java.lang.Object
jodd.lagarto.LagartoParser
HTML/XML content parser/tokenizer using
TagVisitor for callbacks.
Works by the HTML5 specs for tokenization, as described
on WhatWG.
Differences from the specs:
- text is emitted as a block of text, and not character by character.
- tags name case (and letter case of other entities) is not changed, but case-sensitive information exist for matching.
- the whole tokenization process is implemented here, without going into the tree building. This applies for switching to the RAWTEXT state.
- script tag is emitted separately
- conditional comments added
- xml states and callbacks added
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprotected classSince escaping states inside the SCRIPT tag are rare, we want to use them lazy, only when really needed.protected class -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final CharSequenceprivate static final char[]private static final char[]protected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprivate static final char[]private static final char[]private static final char[]private static final char[]protected Stateprotected Stateprotected Stateprotected intprotected Stateprotected intprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprivate static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]protected Stateprivate static final char[]protected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected intprivate booleanprotected final LagartoParserConfigprivate static final char[]protected StateData state.protected ParsedDoctypeprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected Stateprotected intprotected Stateprotected final CharsInputprivate static final char[]protected Stateprotected booleanprotected char[]protected Stateprotected Stateprotected Stateprotected Stateprivate static final char[][]protected intprotected intprotected Stateprotected Stateprotected Stateprotected Stateprivate static final char[][]protected char[]protected intprivate static final charprotected Stateprotected Stateprotected Stateprotected Stateprotected intprotected intprotected LagartoParser.ScriptEscapeprotected intprotected Stateprotected Stateprivate static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]private static final char[]protected ParsedTagprotected Stateprotected Stateprivate static final char[]private static final char[]protected char[]protected intprotected TagVisitorprivate static final char[]private static final char[]private static final char[]private static final char[]protected LagartoParser.XmlDeclaration -
Constructor Summary
ConstructorsConstructorDescriptionLagartoParser(char[] input) Creates parser on char array.LagartoParser(CharSequence input) Creates parser on a char sequence.LagartoParser(LagartoParserConfig parserConfig, char[] input) Creates parser on char array.LagartoParser(LagartoParserConfig parserConfig, CharSequence input) Creates parser on a char sequence. -
Method Summary
Modifier and TypeMethodDescriptionprivate voidprivate void_addAttribute(CharSequence attrName, CharSequence attrValue) private voidprivate voidprivate voidprivate void_consumeNumber(int unconsumeNdx) protected voidPrepares error message and reports it to the visitor.configure(Consumer<LagartoParserConfig> configConsumer) Configures the parser.protected voidprotected voidconsumeCharacterReference(char allowedChar) protected voidemitCData(CharSequence charSequence) protected voidemitComment(int from, int to) Emits a comment.protected voidprotected voidemitScript(int from, int to) protected voidemitTag()protected voidemitText()Emits text if there is some content.protected voidemitXml()private voidprivate voidensureCapacity(int growth) protected voidprotected voiderrorEOF()protected voidReturnsconfigurationof the parser.protected voidInitializes parser.private booleanisAppropriateTagName(char[] lowerCaseNameToMatch, int from, int to) private booleanmatchTagName(char[] tagNameLowercase) voidparse(TagVisitor visitor) Parses content and emits event to providedTagVisitor.private voidprotected voidtextEmitChar(char c) Emits characters into the local text buffer.protected voidtextEmitChars(char[] buffer) protected voidtextEmitChars(int from, int to) protected voidResets text buffer.protected CharSequencetextWrap()
-
Field Details
-
visitor
-
tag
-
doctype
-
in
-
config
-
parsing
protected boolean parsing -
DATA_STATE
Data state. -
TAG_OPEN
-
END_TAG_OPEN
-
TAG_NAME
-
BEFORE_ATTRIBUTE_NAME
-
ATTRIBUTE_NAME
-
AFTER_ATTRIBUTE_NAME
-
BEFORE_ATTRIBUTE_VALUE
-
ATTR_VALUE_UNQUOTED
-
ATTR_VALUE_SINGLE_QUOTED
-
ATTR_VALUE_DOUBLE_QUOTED
-
AFTER_ATTRIBUTE_VALUE_QUOTED
-
SELF_CLOSING_START_TAG
-
BOGUS_COMMENT
-
MARKUP_DECLARATION_OPEN
-
rawTextStart
protected int rawTextStart -
rawTextEnd
protected int rawTextEnd -
rawTagName
protected char[] rawTagName -
RAWTEXT
-
RAWTEXT_LESS_THAN_SIGN
-
RAWTEXT_END_TAG_OPEN
-
RAWTEXT_END_TAG_NAME
-
rcdataTagStart
protected int rcdataTagStart -
rcdataTagName
protected char[] rcdataTagName -
RCDATA
-
RCDATA_LESS_THAN_SIGN
-
RCDATA_END_TAG_OPEN
-
RCDATA_END_TAG_NAME
-
commentStart
protected int commentStart -
COMMENT_START
-
COMMENT_START_DASH
-
COMMENT
-
COMMENT_END_DASH
-
COMMENT_END
-
COMMENT_END_BANG
-
DOCTYPE
-
BEFORE_DOCTYPE_NAME
-
DOCTYPE_NAME
-
AFTER_DOCUMENT_NAME
-
doctypeIdNameStart
protected int doctypeIdNameStart -
AFTER_DOCTYPE_PUBLIC_KEYWORD
-
BEFORE_DOCTYPE_PUBLIC_IDENTIFIER
-
DOCTYPE_PUBLIC_IDENTIFIER_DOUBLE_QUOTED
-
DOCTYPE_PUBLIC_IDENTIFIER_SINGLE_QUOTED
-
AFTER_DOCTYPE_PUBLIC_IDENTIFIER
-
BETWEEN_DOCTYPE_PUBLIC_AND_SYSTEM_IDENTIFIERS
-
BOGUS_DOCTYPE
-
AFTER_DOCTYPE_SYSTEM_KEYWORD
-
BEFORE_DOCTYPE_SYSTEM_IDENTIFIER
-
DOCTYPE_SYSTEM_IDENTIFIER_DOUBLE_QUOTED
-
DOCTYPE_SYSTEM_IDENTIFIER_SINGLE_QUOTED
-
AFTER_DOCTYPE_SYSTEM_IDENTIFIER
-
scriptStartNdx
protected int scriptStartNdx -
scriptEndNdx
protected int scriptEndNdx -
scriptEndTagName
protected int scriptEndTagName -
SCRIPT_DATA
-
SCRIPT_DATA_LESS_THAN_SIGN
-
SCRIPT_DATA_END_TAG_OPEN
-
SCRIPT_DATA_END_TAG_NAME
-
scriptEscape
-
xmlDeclaration
-
text
protected char[] text -
textLen
protected int textLen -
attrStartNdx
protected int attrStartNdx -
attrEndNdx
protected int attrEndNdx -
conditionalCommentStarted
private boolean conditionalCommentStarted -
state
-
TAG_WHITESPACES
private static final char[] TAG_WHITESPACES -
TAG_WHITESPACES_OR_END
private static final char[] TAG_WHITESPACES_OR_END -
CONTINUE_CHARS
private static final char[] CONTINUE_CHARS -
ATTR_INVALID_1
private static final char[] ATTR_INVALID_1 -
ATTR_INVALID_2
private static final char[] ATTR_INVALID_2 -
ATTR_INVALID_3
private static final char[] ATTR_INVALID_3 -
ATTR_INVALID_4
private static final char[] ATTR_INVALID_4 -
COMMENT_DASH
private static final char[] COMMENT_DASH -
T_DOCTYPE
private static final char[] T_DOCTYPE -
T_SCRIPT
private static final char[] T_SCRIPT -
T_XMP
private static final char[] T_XMP -
T_STYLE
private static final char[] T_STYLE -
T_IFRAME
private static final char[] T_IFRAME -
T_NOFRAMES
private static final char[] T_NOFRAMES -
T_NOEMBED
private static final char[] T_NOEMBED -
T_NOSCRIPT
private static final char[] T_NOSCRIPT -
T_TEXTAREA
private static final char[] T_TEXTAREA -
T_TITLE
private static final char[] T_TITLE -
A_PUBLIC
private static final char[] A_PUBLIC -
A_SYSTEM
private static final char[] A_SYSTEM -
CDATA
private static final char[] CDATA -
CDATA_END
private static final char[] CDATA_END -
XML
private static final char[] XML -
XML_VERSION
private static final char[] XML_VERSION -
XML_ENCODING
private static final char[] XML_ENCODING -
XML_STANDALONE
private static final char[] XML_STANDALONE -
CC_IF
private static final char[] CC_IF -
CC_ENDIF
private static final char[] CC_ENDIF -
CC_ENDIF2
private static final char[] CC_ENDIF2 -
CC_END
private static final char[] CC_END -
RAWTEXT_TAGS
private static final char[][] RAWTEXT_TAGS -
RCDATA_TAGS
private static final char[][] RCDATA_TAGS -
REPLACEMENT_CHAR
private static final char REPLACEMENT_CHAR- See Also:
-
INVALID_CHARS
private static final char[] INVALID_CHARS -
_ENDIF
-
-
Constructor Details
-
LagartoParser
Creates parser on char array. -
LagartoParser
public LagartoParser(char[] input) Creates parser on char array. -
LagartoParser
Creates parser on a char sequence. -
LagartoParser
Creates parser on a char sequence.
-
-
Method Details
-
initialize
protected void initialize()Initializes parser. -
getConfig
Returnsconfigurationof the parser. -
configure
Configures the parser. -
parse
Parses content and emits event to providedTagVisitor. -
consumeCharacterReference
protected void consumeCharacterReference(char allowedChar) -
consumeCharacterReference
protected void consumeCharacterReference() -
_consumeCharacterReference
private void _consumeCharacterReference() -
_consumeAttrCharacterReference
private void _consumeAttrCharacterReference() -
_consumeNumber
private void _consumeNumber(int unconsumeNdx) -
ensureCapacity
private void ensureCapacity() -
ensureCapacity
private void ensureCapacity(int growth) -
textEmitChar
protected void textEmitChar(char c) Emits characters into the local text buffer. -
textStart
protected void textStart()Resets text buffer. -
textEmitChars
protected void textEmitChars(int from, int to) -
textEmitChars
protected void textEmitChars(char[] buffer) -
textWrap
-
_addAttribute
private void _addAttribute() -
_addAttributeWithValue
private void _addAttributeWithValue() -
_addAttribute
-
emitTag
protected void emitTag() -
emitComment
protected void emitComment(int from, int to) Emits a comment. Also checks for conditional comments! -
emitText
protected void emitText()Emits text if there is some content. -
emitScript
protected void emitScript(int from, int to) -
emitDoctype
protected void emitDoctype() -
emitXml
protected void emitXml() -
emitCData
-
errorEOF
protected void errorEOF() -
errorInvalidToken
protected void errorInvalidToken() -
errorCharReference
protected void errorCharReference() -
_error
Prepares error message and reports it to the visitor. -
isAppropriateTagName
private boolean isAppropriateTagName(char[] lowerCaseNameToMatch, int from, int to) -
matchTagName
private boolean matchTagName(char[] tagNameLowercase) -
switchTypeToSelfClosing
private void switchTypeToSelfClosing()
-