This is mainly a stabalization release, adding a few stability improvements for heavily loaded networks and improving support for running mixed clusters while migrating to the 1.1.x series.
The frequency that Riak checks whether primary partitions need to be transferred and also a few other administrative tasks has been lowered from every 30s to every 10s by default. It is also configurable on an application environment variable in app.config.
{riak_core, [{vnode_management_timer, #Millisecs}, ...]}
The /usr/sbin/riak script expects /var/run/riak to exist. On Fedora and possibly other Linux distributions /var/run is built on tmpfs, so the directory created at package install time goes away on reboot and the /usr/sbin/riak script does not have permissions to recreate it.
As a work around, add this to /etc/rc.local
mkdir /var/run/riak && chown -R riak /var/run/riak
-erlang_js inlines do not work on XCode 4.3 -Work-around for net_kernel:get_status() race condition -Handoff reported as started even if rejected for exceeding handoff concurrency -Intermittent hang with handoff sender -handoff of fallback data can be delayed or never performed -Chicken & egg problem with vnode proxy process start & registration? -Put FSM crashes if forwarded node goes down
This release includes fixes for systems with heavy MapReduce load.
Pipe maintains a limit on the number of active worker processes in the system. Workers are created on demand when the pipe executes - they are not reserved at pipe creation time.
Prior to 1.1.1 not all fittings (e.g. get object from riak, map value, reduce) were checking to make sure their output was being delivered to the next stage of the pipeline. This has been changed so that the pipe will error if processing errors or system limits are hit.
As a result of the change if the pipe is overloaded clients will see errors of this form:
{error,<<"Error sending inputs: {error,worker_limit_reached}">>}
Clients can either decide to retry (preferably with a backoff/retries limit
mechanism) or if the hardware has sufficient capacity the number of workers
per vnode can be increased by setting worker_limit in etc/app.config
.
{riak_pipe, [{worker_limit, #Workers}]}
Different numbers of workers are used based on the type of MapReduce query so tuning will be very workload dependent.
If you are using Javascript MapReduce you will also need to increase the number of
Javascript virtual machines in the riak_kv
section of app.config
.
{map_js_vm_count, #MapVMs},
{reduce_js_vm_count, #ReduceVMs}
Again, tuning will be workload dependent. If sufficient resources are available setting #MapVMs to the maximum number of JS MapReduce jobs being run in parallel will give best results and #ReduceVMs set to number of jobs / number of nodes plus a margin for overload.
-Map-reduce jobs fail during rolling upgrade to 1.1 -Race Condition causes Javascript VMs to never be returned to VM Pool -Avoid {case_clause, ok} errors when vnode backend fails to start -Bitcask - Fix incorrect NIF error tuples
-Riak Control preview demo -Riak Control repository with documentation on how to get started
-Blog post announcing riaknostic -Riaknostic Homepage
- CRC checks have been added to hintfiles for bitcask
- Snappy (compression algorithm from Google) has been enabled
- The compaction algorithm has been tuned to reduce delays due to compaction
- Multi-backend now supports 2i properly
- Tracing support (see the README)
- Term printing is ~4x faster and much more correct (compared to io:format)
- Bitstring printing support was added
- The MapReduce interface now supports requests with empty queries. This allows the 2i, list-keys, and search inputs to return matching keys to clients without needing to include a reduce_identity query phase.
- MapReduce error messages have been improved. Most error cases should now return helpful information all the way to the client, while also producing less spam in Riak’s logs.
There has been numerous changes to riak_core
which address issues
with cluster scalability, and enable Riak to better handle large
clusters and large rings. Specifically, the warnings about potential
gossip overload and large ring sizes from the Riak 1.0 release notes
are no longer valid. However, there are a few user visible
configuration changes impacted by these improvements.
First, Riak has a new routing layer that determines how requests are sent to vnodes for servicing. In order to support mixed-version clusters during a rolling upgrade, Riak 1.1 starts in a legacy routing mode that is slower than the traditional behavior in prior releases. After upgrading or deploying a cluster that is composed solely of Riak 1.1 nodes, the legacy mode can be disabled by adding the following to the riak_core section of app.config:
{legacy_vnode_routing, false}
Second, the handoff_concurrency
setting that can be enabled in the riak_core
section of app.config has been changed to limit both incoming and outgoing
handoff from a node. Prior to 1.1, this setting only limited the amount of
outgoing handoff traffic from a node, and was unable to prevent a node from
becoming overloaded with too much inbound traffic.
Finally, the gossip_interval
setting that can be enabled in riak_core section
of app.config has been removed, and replaced with a new gossip_limit
setting.
The gossip_limit
setting takes the form {allowed_gossips, interval_in_ms}
.
For example, the default setting is {45, 10000}
which limits gossip to 45
gossip messages every 10 seconds.
The new ring ownership claim algorithm introduced as an optional setting in the 1.0 release has been set as the default for 1.1.
The new claim algorithm significantly reduces the amount of ownerhip shuffling for clusters with more than N+2 nodes in them. Changes to the cluster membership code in 1.0 made the original claim algorithm fall back to just reordering nodes in sequence in more cases than it used to, causing massive handoff activity on larger clusters. Mixed clusters including pre-1.0 nodes will still use the original algorithm until all nodes are upgraded.
NOTE: The semantics of the new algorithm are different than the old
algorithm when the number of nodes in the cluster is less than the
target_n_val
setting. When the number of nodes in your cluster is
less than target_n_val
, Riak will not guarantee that each of your
replicas are on distinct physical nodes. By default, Riak is setup
with a default replication value N=3 and a target_n_val
of 4. As
such, you must have at least a 4-node cluster in order to have your
replicas placed on different physical nodes. If you want to deploy a
3-replica, 3-node cluster, you can override the target_n_val
setting
in the riak_core section of app.config:
{target_n_val, 3}
In practice, you should set target_n_val
to match the largest replication
value you plan to use across your entire cluster.
Backpressure has been added to listkeys to prevent the node listing keys from being overwhelemed. The change has required a protocol change so that the key lister can limit the rate it receives data.
In mixed clusters where some of the nodes are < 1.1 please set listkeys_backpressure false in the riak_kv section of app.config until all nodes are upgraded.
{listkeys_backpressure, false}
Once all nodes are upgraded, set listkeys_backpressue to true in the riak_kv section of app.config
{listkeys_backpressure, true}
Running nodes can be upgraded without restarting by running this snippet from the riak console
application:set_env(riak_kv, listkeys_backpressure, true).
Fresh 1.1.0 and above installs default to using listkeys backpressure - adjust app.config if different behavior is desired.
In previous releases there is no easy way to determine if a
post-commit hook is failing. In this release two counters have been
added to riak-admin status
that will indicate pre/post-commit hook
failures. They are precommit_fail
and postcommit_fail
. By
default the errors themselves are not logged. The thought is that a
bad hook could cause unnecessary IO overload.
If the error needs to be discovered then Lager, Riak’s logging system,
will allow you to dynamically change the logging level to debug on the
function executing the hook. To do that you need to riak attach
on
one of the nodes and run the following.
{ok, Trace} = lager:trace_file("<path>/failing-postcommits", [{module, riak_kv_put_fsm}, {function, decode_postcommit}], debug).
This will output all post-commit errors to
<path>/failing-postcommits
. When you’ve got enough samples you can
stop the trace like so.
lager:stop_trace(Trace).
If you are using bidirectional cluster replication and you have overridden the defaults for either of these then you should consider setting both to the same value.
The default value of small_vclock
has been changed to be equal to
big_vclock
in order to delay or even prevent unnecessary sibling
creation in a Riak deployment with bidirectional cluster replication.
When you replicate a pruned vector clock the other cluster will think
it isn’t a descendent, even though it is, and create a sibling. By
raising small_vclock
to match big_vclock
you reduce the frequency
of pruning and thus siblings. Combined with vnode vclocks, sibling
creation, for this particular reason, may be entirely avoided since
the number of entries will almost always stay below the threshold in a
well behaved cluster (i.e. one not under constant node membership
change or network partitions).
-Luwak has been deprecated in the 1.1 release -bz1160 - Bitcask fails to merge on corrupt file
Depending on the partition count and the default stack size of the operating system, using Innostore as the backend on 32-bit systems can cause problems. A symptom of this problem would be a message similar to this on the console or in the Riak error log:
10:22:38.719 [error] Failed to start riak_kv_innostore_backend for index 1415829711164312202009819681693899175291684651008. Reason: {eagain,[{erlang,open_port,[{spawn,innostore_drv},[binary]]},{innostore,connect,0},{riak_kv_innostore_backend,start,2},{riak_kv_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}
10:22:38.871 [notice] "backend module failed to start."
The workaround for this problem is to reduce the stack size for the
user Riak is run as (riak
for most distributions). For example
on Linux, the current stack size can be viewed using ulimit -s
and
can be altered by adding entries to the /etc/security/limits.conf
file such as these:
riak soft stack 1024
riak hard stack 1024
It has been observed that 1.1.0 clusters can end up in a state with ownership
handoff failing to complete. This state should not happen under normal
circumstances, but has been observed in cases where Riak nodes crashed due
to unrelated issues (eg. running out of disk space/memory) during cluster
membership changes. The behavior can be identified by looking at the output
of riak-admin ring_status
and looking for transfers that are waiting on an
empty set. As an example:
============================== Ownership Handoff ==============================
Owner: riak@host1
Next Owner: riak@host2
Index: 123456789123456789123456789123456789123456789123
Waiting on: []
Complete: [riak_kv_vnode,riak_pipe_vnode]
To fix this issue, copy/paste the following code sequence into an Erlang
shell for each Owner
node in riak-admin ring_status
for which
this case has been identified. The Erlang shell can be accessed with riak attach
fun() ->
Node = node(),
{_Claimant, _RingReady, _Down, _MarkedDown, Changes} =
riak_core_status:ring_status(),
Stuck =
[{Idx, Complete} || {{Owner, NextOwner}, Transfers} <- Changes,
{Idx, Waiting, Complete, Status} <- Transfers,
Owner =:= Node,
Waiting =:= [],
Status =:= awaiting],
RT = fun(Ring, _) ->
NewRing = lists:foldl(fun({Idx, Mods}, Ring1) ->
lists:foldl(fun(Mod, Ring2) ->
riak_core_ring:handoff_complete(Ring2, Idx, Mod)
end, Ring1, Mods)
end, Ring, Stuck),
{new_ring, NewRing}
end,
case Stuck of
[] ->
ok;
_ ->
riak_core_ring_manager:ring_trans(RT, []),
ok
end
end().
-bz775 - Start-up script does not recreate /var/run/riak -bz1283 - erlang_js uses non-thread-safe driver function -bz1333 - Bitcask attempts to open backup/other files -Poolboy - Lots of potential bugs fixed, see detailed post by Andrew Thompson
- #26 - don’t make a crash log called ‘undefined’
- #28 - R13A support
- #29 - Don’t unnecessarily quote atoms
- #31 - Better crash reports for proc_lib processes
- #33 - Don’t assume supervisor children are named with atoms
- #35 - Support printing bitstrings (binaries with trailing bits)
- #37 - Don’t generate dynamic atoms