Releases: jhy/jsoup
jsoup 1.18.2
Improvements
- Optimized the throughput and memory use throughout the input read and parse flows, with heap allocations and GC down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see throughput increases of ~ 20%. These performance improvements come through recycling the backing
byte[]
andchar[]
arrays used to read and parse the input. 2186 - Speed optimized
html()
andEntities.escape()
when the input contains UTF characters in a supplementary plane, by around 49%. 2183 - The form associated elements returned by
FormElement.elements()
now reflect changes made to the DOM, subsequently to the original parse. 2140 - In the
TreeBuilder
, theonNodeInserted()
andonNodeClosed()
events are now also fired for the outermost / rootDocument
node. This enables source position tracking on the Document node (which was previously unset). And it also enables the node traversor to see the outer Document node. 2182 - Selected Elements can now be position swapped inline using
Elements#set()
. 2212
Bug Fixes
Element.cssSelector()
would fail if the element's class contained a*
character. 2169- When tracking source ranges, a text node following an invalid self-closing element may be left untracked. 2175
- When a document has no doctype, or a doctype not named
html
, it should be parsed in Quirks Mode. 2197 - With a selector like
div:has(span + a)
, thehas()
component was not working correctly, as the inner combining query caused the evaluator to match those against the outer's siblings, not children. 2187 - A selector query that included multiple
:has()
components in a nested:has()
might incorrectly execute. 2131 - When cookie names in a response are duplicated, the simple view of cookies available via
Connection.Response#cookies()
will provide the last one set. Generally it is better to use the Jsoup.newSession method to maintain a cookie jar, as that applies appropriate path selection on cookies when making requests. 1831 - When parsing named HTML entities, base entities should resolve if they are a prefix of the input token (and not in an attribute). 2207
- Fixed incorrect tracking of source ranges for attributes merged from late-occurring elements that were implicitly created (
html
orbody
). 2204 - Follow the current HTML specification in the tokenizer to allow
<
as part of a tag name, instead of emitting it as a character node. 2230 - Similarly, allow a
<
as the start of an attribute name, vs creating a new element. The previous behavior was intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to how browsers behave. 1483
jsoup-1.18.1
https://jsoup.org/news/release-1.18.1
Improvements
- Stream Parser: A
StreamParser
provides a progressive parse of its input. As eachElement
is completed, it is
emitted via aStream
orIterator
interface. Elements returned will be complete with all their children, and an
(empty) next sibling, if applicable. Elements (or their children) may be removed from the DOM during the parse,
for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit
into memory, yet still providing a DOM interface to the document and its elements. Additionally, the parser provides
aselectFirst(String query)
/selectNext(String query)
, which will run the parser until a hit is found, at which
point the parse is suspended. It can be resumed via anotherselect()
call, or via thestream()
oriterator()
methods. 2096 - Download Progress: added a Response Progress event interface, which reports progress and URLs are downloaded (and
parsed). Supported on both a session and a single connection
level. 2164, 656 - Added
Path
accepting parse methods:Jsoup.parse(Path)
,Jsoup.parse(path, charsetName, baseUri, parser)
,
etc. 2055 - Updated the
button
tag configuration to include a space between multiple button elements in theElement.text()
method. 2105 - Added support for the
ns|*
all elements in namespace Selector. 1811 - When normalising attribute names during serialization, invalid characters are now replaced with
_
, vs being
stripped. This should make the process clearer, and generally prevent an invalid attribute name being coerced
unexpectedly. 2143
Changes
- Removed previously deprecated internal classes and methods. 2094
- Build change: the built jar's OSGi manifest no longer imports itself. 2158
Bug Fixes
- When tracking source positions, if the first node was a TextNode, its position was incorrectly set
to-1.
2106 - When connecting (or redirecting) to URLs with characters such as
{
,}
in the path, a Malformed URL exception would
be thrown (if in development), or the URL might otherwise not be escaped correctly (if in
production). The URL encoding process has been improved to handle these characters
correctly. 2142 - When using
W3CDom
with a custom output Document, a Null Pointer Exception would be
thrown. 2114 - The
:has()
selector did not match correctly when using sibling combinators (like
e.g.:h1:has(+h2)
). 2137 - The
:empty
selector incorrectly matched elements that started with a blank text node and were followed by
non-empty nodes, due to an incorrect short-circuit. 2130 Element.cssSelector()
would fail with "Did not find balanced marker" when building a selector for elements that had
a(
or[
in their class names. And selectors with those characters escaped would not match as
expected. 2146- Updated
Entities.escape(string)
to make the escaped text suitable for both text nodes and attributes (previously was
only for text nodes). This does not impact the output ofElement.html()
which correctly applies a minimal escape
depending on if the use will be for text data or in a quoted
attribute. 1278 - Fuzz: a Stack Overflow exception could occur when resolving a crafted
<base href>
URL, in the normalizing regex.
2165
jsoup Java HTML Parser release 1.17.2
Improvements
- Attribute object accessors: Added
Element.attribute(String)
andAttributes.attribute(String)
to more simply
obtain anAttribute
object. 2069 - Attribute source tracking: If source tracking is on, and an Attribute's key is changed (
viaAttribute.setKey(String)
), the source range is now still tracked
inAttribute.sourceRange()
. 2070 - Wildcard attribute selector: Added support for the
[*]
element with any attribute selector. And also restored
support for selecting by an empty attribute name prefix ([^]
). 2079
Bug Fixes
- Mixed-cased source position: When tracking the source position of attributes, if the source attribute name was
mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not
tracked correctly. 2067 - Source position NPE: When tracking the source position of a body fragment parse, a null pointer
exception was thrown. 2068 - Multi-point emoji entity: A multi-point encoded emoji entity may be incorrectly decoded to the replacement
character. 2074 - Selector sub-expressions: (Regression) in a selector like
parent [attr=va], other
, the, OR
was binding
to[attr=va]
instead ofparent [attr=va]
, causing incorrect selections. The fix includes a EvaluatorDebug class
that generates a sexpr to represent the query, allowing simpler and more thorough query parse
tests. 2073 - XML CData output: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData
sections would have an extraneous CData section added, causing script execution errors. Now, the data content is
emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData
section. 2078 - Thread safety: The
:has
evaluator held a non-thread-safe Iterator, and so if an Evaluator object was
shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
incorrect. Now, the iterator object is a thread-local. 2088
jsoup 1.17.1
jsoup 1.17.1 is out now with support for request-level authentication, attribute name & value source ranges, stream() iterable support, and a bunch of other improvements and bug fixes.
Many thanks to everyone who contributed to this release!
Improvements
- Request-Level Authentication: Added support for request-level authentication in Jsoup.connect(), enabling authentication to proxies and servers. More.
- Elements DOM Mutators: In the
Elements
list, added direct support forElements#set(int, Element)
,Elements#remove(int)
,Elements#remove(Object)
,Elements#clear()
,Elements#removeAll()
,Elements#retainAll()
,Elements#removeIf()
,Elements#replaceAll()
. These methods update the original DOM, as well as the Elements list. More.
- Stream Interface: Introduced the
NodeIterator
class for efficient node tree traversal using the Iterator interface. Added StreamElement#stream()
andNode#nodeStream()
methods for fluent composable stream pipelines of node traversals. More.
- XML OutputSettings: Automatically sets the xhtml
EscapeMode
as default when changing theOutputSettings
syntax toXML
.
- is() Selector: Added the
:is(selector list)
pseudo-selector to find elements that match any selectors in the selector list. This enhances readability for largeOR
ed selectors. More.
- JPMS Module Support: Repackaged the library with native JPMS module support. More.
- Source Position Fidelity: Improved fidelity of source positions when tracking is enabled. Implicitly created or closed elements are now trackable via
Range.isImplicit()
. More.
- Attribute Source Positions: Enabled source position for attribute names and values when source tracking is on.
Attribute#sourceRange()
provides the ranges. More.
- Virtual Threads: Enhanced performance under Java 21+ Virtual Threads by replacing the internal
ConstrainableInputStream
withControllableInputStream
. More.
- XML Mimetype Support: Extended XML mimetype support in
Jsoup.connect()
to include any XML mimetype. More.
Bug Fixes
- XML Data Nodes: Fixed a bug where HTML elements parsed as data nodes were not correctly emitted as
CDATA
nodes when outputting withXML
syntax. More.
- Immediate Parent Selector: Corrected a bug where the Immediate Parent selector
>
could match elements above the root context element. More.
- Sub-Query Parsing: Resolved a bug where combinators following the
,
Or combinator in a sub-query were incorrectly skipped. More.
- Empty Doctype: Fixed a bug in
W3CDom
where the conversion would fail if the jsoup input document contained an empty doctype. The doctype is now discarded, and the conversion continues.
- SVG Elements Cleaning: Fixed incorrect nesting when cleaning a document containing SVG elements or other foreign elements with preserved-case names. More.
- Unknown Self-Closing Tags: Preserved the output style of unknown self-closing tags from the input when cleaning a document. More.
Build Improvements
- Local Test Proxy: Added a local test proxy implementation for proxy integration tests. More.
- HTTPS Request Tests: Added tests for HTTPS request support using a local self-signed certificate. Includes proxy tests. More.
Changes
- Response BodyStream: The InputStream returned in
Connection.Response.bodyStream()
is now a plainBufferedInputStream
. More.
jsoup 1.16.2
Improvements
- Optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the matching process by ensuring that simpler evaluations (such as a tag name match) are conducted prior to more complex evaluations (such as an attribute regex, or a deep child scan with a :has).
- Added support for
<svg>
and<math>
tags (and their children). This includes tag namespaces and case preservation on applicable tags and attributes.#2008
- When converting jsoup Documents to W3C Documents in
W3CDom
, HTML documents will be placed in thehttp://www.w3.org/1999/xhtml
namespace by default, per the HTML5 spec. This can be controlled by settingW3CDom#namespaceAware(boolean false)
.#1848
- Speed optimized the Structural Evaluators by memoizing previous evaluations. Particularly the
~
(any preceding sibling) and:nth-of-type
selectors are improved.#1956
- Tweaked the performance of the
Element
nextElementSibling
,previousElementSibling
,firstElementSibling
,lastElementSibling
,firstElementChild
, and `lastElementChild. They now inplace filter/skip in the child-node list, vs having to allocate and scan a complete Element filtered list.
- Optimized internal methods that previously called
Element.children()
to use filter/skip child-node list accessors instead, reducing new Element List allocations.
- Tweaked the performance of parsing
:pseudo
selectors.
- When using the
:empty
pseudo-selector, blank textnodes are now considered empty. Previously, an element containing any whitespace was not considered empty.#1976
- In forms,
<input type="image">
should be excluded fromElement.formData()
(and hence from form submissions).#2010
Bug Fixes
- Bugfix:
form
elements and empty elements (such asimg
) did not have their attributes de-duplicated.#1950
- If
Document.OutputSettings
was cloned from a clone, an NPE would be thrown when used.#1964
- In
Jsoup.connect(String url)
, URL paths containing a %2B were incorrectly recoded to a '+', or a '+' was recoded to a ' '. Fixed by reverting to the previous behavior of not encoding supplied paths, other than normalizing to ASCII.#1952
- In
Jsoup.connect(String url)
, strings containing supplemental characters (e.g. emoji) were not URL escaped correctly.
- In
Jsoup.connect(String url)
, the ConstrainableInputStream would clear Thread interrupts when reading the body. This precluded callers from spawning a thread, running a number of requests for a length of time, then joining that thread after interrupting it.#1991
- When tracking HTML source positions, the closing tags for
H1
...H6
elements were not tracked correctly.#1987
- In
Jsoup.connect()
, aDELETE
method request did not support a request body.#1972
- When calling
Element.cssSelector()
on an extremely deeply nested element, aStackOverflowError
could occur. Further, aStackOverflowError
may occur when running the query.#2001
- Appending a node back to its original
Element
afterempty()
would throw an Index out of bounds exception. Also, now the child nodes that were removed have their parent node cleared, fully detaching them from the original parent.#2013
- In
Connection
when adding headers, the value may have been assumed to be an incorrectly decodedISO_8859_1
string, and re-encoded asUTF-8
. The value is now left as-is.
Changes
- Removed previously deprecated methods
Document.normalise()
,Element.forEach(org.jsoup.helper.Consumer<>)
,Node.forEach(org.jsoup.helper.Consumer<>)
, and theorg.jsoup.helper.Consumer
interface; the latter being a previously required compatibility shim prior to Android's de-sugaring support.
- The previous compatibility shim
org.jsoup.UncheckedIOException
is deprecated in favor of the now supportedjava.io.UncheckedIOException
. If you are catching the former, modify your code to catch the latter instead.#1989
- Blocked
noscript
tags from being added to Safelists, due to incompatibilities between parsers with and without script-mode enabled.
jsoup 1.16.1
jsoup Java HTML Parser release 1.16.1
Improvements
- In
Jsoup.connect(String url)
, natively support URLs with Unicode characters in the path or query string, without having to be escaped by the caller. #1914
- Calling
Node.remove()
on a node with no parent is now a no-op, vs a validation error. #1898
Bug Fixes
- Aligned the HTML Tree Builder processing steps for
AfterBody
andAfterAfterBody
to the updated WHATWG standard, to not pop the stack to close<body>
or<html>
elements. This prevents an errant</html>
closing the preceding structure. Also added appropriate error message outputs in this case. #1851
- Corrected support for ruby elements (
<ruby>
,<rp>
,<rt>
, and<rtc>
) to current spec. #1294
- When using
Node.before(Node)
orNode.after(Node)
, if the incoming node was a sibling of the context node, the incoming node may be inserted into the wrong relative location. #1898
- In
Jsoup.connect(String url)
, if the input URL had components that were already%
escaped, they would be escaped again, causing errors when fetched. #1902
- When tracking input source positions, text in tables that was fostered had invalid positions. #1927
- If the
Document.OutputSettings
class was initialized, and thenEntities.escape(String)
called, an NPE may be thrown due to a class loading circular dependency. #1910
- When pretty-printing, the first inline
Element
orComment
in a block would not be wrap-indented if it were preceded by a blank text node. #1906
- When pretty-printing a
<pre>
containing block tags, those tags were incorrectly indented. #1891
- When pretty-printing nested inlineable blocks (such as a
<p>
in a<td>
), the inner element should be indented. #1926
<br>
tags should be wrap-indented when in block tags (and not when in inline tags). #1911
- The contents of a sufficiently large
<textarea>
with un-escaped HTML closing tags may be incorrectly parsed to an empty node. #1929
jsoup 1.15.4
jsoup Java HTML Parser release 1.15.4
jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow regular CSS syntax. For example, to match by classname
<p class="one.two">
, usedocument.select("p.one\\.two");
#838
- When pretty-printing, wrap text that follows a
<br>
tag. #1858
- When pretty-printing, normalize newlines that follow self-closing tags in custom tags. #1852
- When pretty-printing, collapse non-significant whitespace between a block and an inline tag. #1802
- In
Element.forEach()
andNode.forEachNode()
, usejava.util.function.Consumer
instead of the previous Android compatibility shimorg.jsoup.helper.Consumer
. Subsequently, the latter has been deprecated. #1870
- Added a new method
Document.forms()
, to conveniently retrieve aList<FormElement>
containing the<form>
elements in a document.
- Added a new method
Document.expectForm()
, to find the first matchingFormElement
, or blow up trying.
Bug Fixes
- URLs containing characters such as
and <code>
were not escaped correctly, and would throw aMalformedURLException
when fetched. #1873
Element.cssSelector()
would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a:
or.
). #1742
Element.text()
should have a space between a block and an inline element. #1877
- Form data on a previous request was copied to a new request in
newRequest()
, resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now,newRequest()
only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778
- Fixed an issue in
Safelist.removeAttributes()
which could throw aConcurrentModificationException
when using the:all
pseudo-attribute.
- Given extremely deeply nested HTML, a number of methods in
Element
could throw aStackOverflowError
due to excessive recursion. Namely:#data()
,#hasText()
,#parents()
, and#wrap(html)
. #1864
Changes
- Deprecated the unused
Document.normalise()
method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@[email protected]) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
jsoup 1.15.3
jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with other bug fixes and improvements, including more descriptive validation error messages.
Details:
jsoup 1.15.2
jsoup 1.15.2 is out now with a bunch of improvements and bug fixes.
jsoup 1.15.1
jsoup 1.15.1 is out now with a bunch of improvements and bug fixes.