HTML Cleaner is a powerful open source HTML parser written in Java. The HTML code contained in web pages is usually "dirty", is mediocrely formed, and is not suitable for further processing. For its further use it is necessary to put it in order first, to organize and format the tags, attributes and the usual text. The program takes the original HTML-document and remakes it, and also arranges the content in accordance with the standards. The output is a well-formed XML document. By default, the program follows rules which are very similar to those applied by the majority of modern web browsers at creation of object model of the document.
HTML Cleaner can be used when working with Java code, as a command line or Ant-task. It was designed as a small, independent of other packages (except JRE), fast and flexible program. The main goal of the developers was to create an application that would prepare HTML-code for further processing in XPath, XQuery and XSLT.
HTML Cleaner enhances efficiency by transforming unorganized, poorly-structured HTML into well-formed, easily processable XML.
- Fast automatic processing and generation of HTML documents;
- possibility to specify the type of the final file;
- wide range of options for setting parameters;
- you can run several copies of the program simultaneously;
- can be used for Java code;
- dependence on only one package (JRE 1.5+).