MarkLogic Replication Failures: Understanding and Fixing XDMP-OLDSTAMP Errors - Hero image

MarkLogic Replication Failures: Understanding and Fixing XDMP-OLDSTAMP Errors

If you’ve administered a MarkLogic cluster for any length of time, you’ve almost certainly encountered this error in your logs:

XDMP-OLDSTAMP: Timestamp too old for forest Forest-12
(17738307661246550 < 17738307936721930; 17738389937472340)

When it appears, the replica forest goes unavailable. If you have local failover configured, read queries silently redirect back to the primary — you’ve lost your read scalability and redundancy without necessarily realising it. If you don’t have failover configured, queries against the replica simply fail. Either way, your carefully planned high-availability architecture is suddenly running on one leg. This post explains what the error means, how to recover from it, and — more importantly — how to stop it from happening again.

What XDMP-OLDSTAMP Actually Means

Every transaction in MarkLogic operates against a point-in-time snapshot identified by a timestamp. These timestamps are monotonically increasing numbers that represent the state of a forest at a specific moment. When the primary forest writes new data and merges stands, it advances its timestamp. The replica must keep pace.

The three numbers in the error message tell the story:

(17738307661246550 < 17738307936721930; 17738389937472340)
  • First number (17738307661246550): the timestamp the replica is trying to read at
  • Second number (17738307936721930): the oldest timestamp the forest currently supports
  • Third number (17738389937472340): the forest’s current timestamp

In plain terms: the replica is requesting a view of the data that no longer exists. The primary forest has moved on — its merges have consolidated stands and discarded the historical state the replica needed. The replica can’t catch up because the data it needs has been compacted away.

Why This Happens

Several scenarios cause the replica to fall behind the primary’s merge horizon:

Heavy Write Load with Aggressive Merges

The most common cause. When the primary forest receives a sustained burst of writes, it accumulates in-memory stands that eventually flush to disk. MarkLogic’s background merger then consolidates these stands into larger ones, advancing the forest’s minimum available timestamp. If the replica can’t replicate those changes before the merge completes, it loses the ability to synchronise.

Network Latency or Bandwidth Constraints

Replication between primary and replica forests happens over the network. If the link between hosts is slow, saturated, or experiencing packet loss, the replica falls behind. The primary doesn’t wait — it continues accepting writes and merging. Eventually the gap becomes unbridgeable.

Replica Host Under Resource Pressure

If the replica host is CPU-bound, running out of disk I/O, or memory-constrained, it can’t apply replicated changes fast enough. This is particularly common when the replica host is also serving read queries and those queries compete with the replication process for resources.

Large Merges on the Primary

Even without extreme write volumes, a large merge on the primary can jump the timestamp forward significantly in a single operation. If the replica was slightly behind before the merge, it may find itself completely out of range afterwards. This is especially problematic with forests that have accumulated many small stands over time — the resulting merge is large and the timestamp leap is substantial.

Journal Archiving and Deletion

MarkLogic uses journal files to record every transaction. Replicas rely on these journals to catch up. If journal archiving is configured too aggressively, or if journal files are deleted (manually or by automated cleanup), the replica loses access to the transaction history it needs.

Immediate Recovery: Stop the Bleeding

When you see XDMP-OLDSTAMP errors and replica forests going unavailable, the priority is stabilisation. Here’s the sequence:

Step 1: Reduce or Stop Writes to the Primary Database

This is the single most effective immediate action. Every new write to the primary advances the timestamp further, making recovery harder. If you can pause ingestion or redirect writes temporarily, do it.

For batch loading processes, this usually means pausing the job. For application-driven writes, you may need to put the application into a read-only or maintenance mode. The goal is to give the replica a chance to close the gap without the primary pulling further ahead.

How you stop writes depends on your setup:

  • Batch jobs: Pause the MLCP, CoRB, or custom loader process
  • Application writes: Enable a maintenance mode flag, or temporarily restrict the app server to GET requests
  • Data Hub flows: Pause running flows from the Hub Central interface or via the Management API

Even a partial reduction in write throughput helps. If you can’t stop writes entirely, reducing the rate buys time.

Step 2: Check the Current State

With writes paused, assess the situation via the Admin UI or the Management REST API:

# Check forest status
curl -s --digest -u admin:password \
  http://localhost:8002/manage/v2/forests?format=json | \
  jq '.["forest-default-list"]["list-items"]["list-item"][] | {name: .nameref, state: .["status-detail"]["state"]}'

Or for a specific forest:

curl -s --digest -u admin:password \
  http://localhost:8002/manage/v2/forests/Forest-12?format=json

Look for forests with a state of unmounted or error. Note which primary forests have lost their replicas.

Step 3: Clear and Re-replicate the Affected Replica Forests

If the replica forest is unmounted and can’t recover on its own, you need to clear it and let it re-replicate from the primary:

  1. Unmount the replica forest if it isn’t already unmounted
  2. Clear the replica forest — this deletes all data on the replica and forces a full re-synchronisation from the primary
  3. Mount the replica forest — it will begin bulk-replicating from the primary

You can do this through the Admin UI (Forests > [forest name] > clear) or via the Management API:

# Clear the replica forest
curl -X POST --digest -u admin:password \
  -H "Content-Type: application/json" \
  -d '{"operation": "clear"}' \
  http://localhost:8002/manage/v2/forests/Forest-12-replica

After clearing, the replica will re-synchronise. This can take significant time for large forests — plan accordingly and keep writes paused or minimal until it completes.

Step 4: Resume Writes Gradually

Once the replica shows a state of sync replicating or open in the Admin UI, you can begin resuming writes. Start gradually rather than opening the floodgates — let the replica prove it can keep pace before returning to full production load.

Understanding Merge Timestamps

To prevent recurrence, you need to understand how MarkLogic’s merge policy affects the timestamp window available to replicas.

How Merges Work

MarkLogic stores data in stands — immutable segments of a forest. When documents are inserted or updated, new stands are created. The background merger periodically consolidates smaller stands into larger ones, which:

  1. Reclaims space from deleted or updated documents
  2. Improves query performance by reducing the number of stands to search
  3. Advances the forest’s minimum timestamp by discarding the pre-merge state

That third point is the crux of the XDMP-OLDSTAMP problem. Once a merge completes, the historical state captured by the pre-merge stands is gone. Any replica that needed that state to catch up is now stranded.

The Merge Timestamp Property

Each forest has a merge timestamp property that controls how far back the forest retains historical state after merges. This is the key configuration for preventing XDMP-OLDSTAMP errors.

When set to 0 (the default), MarkLogic is free to discard historical state as soon as a merge completes. This is fine for standalone forests but dangerous for replicated ones — there’s no buffer for the replica to fall behind and recover.

You can set the merge timestamp to a specific value to retain historical state for a defined period. The value is in units of seconds since the Unix epoch, multiplied by 10,000 (MarkLogic’s internal timestamp resolution). In practice, you typically set it relative to the current time using the xdmp:request-timestamp() function or calculate it based on how much history you want to retain.

Setting a Safer Merge Timestamp

The most common approach is to configure the merge timestamp on the database to retain a window of historical state. You can do this via the Admin UI under Databases > [database] > Merge Policy, or via XQuery:

xdmp:set-request-time-limit(3600),
xdmp:database-set-field(
  xdmp:database("my-database"),
  "merge-timestamp",
  xdmp:request-timestamp() - (3600 * 10000)
)

Or more practically, set it in the Admin UI or via the Management API to retain, say, one hour of history:

curl -X PUT --digest -u admin:password \
  -H "Content-Type: application/json" \
  -d '{"merge-timestamp": -3600000000}' \
  http://localhost:8002/manage/v2/databases/my-database/properties

A negative merge timestamp value tells MarkLogic to retain state for that many timestamp units (tenths of milliseconds) before the current time. A value of -3600000000 means “keep at least one hour of history after merges.”

The right value depends on your environment:

  • Low-latency, reliable replication links: 30–60 minutes may be sufficient
  • Cross-datacentre replication: 2–4 hours provides more safety margin
  • Known periods of heavy write activity: Set it higher temporarily during bulk loads

The Trade-off: Disk Space

Retaining merge history costs disk space. MarkLogic must keep old stands around until they fall outside the merge timestamp window, which means:

  • Deleted documents continue consuming space until the window passes
  • Updated documents exist in both old and new stands
  • Large forests with frequent updates can see significant space overhead

Monitor your disk usage after adjusting the merge timestamp. In most environments, the additional space is modest compared to the pain of losing a replica, but it’s worth tracking.

Preventing XDMP-OLDSTAMP Long-Term

Beyond setting an appropriate merge timestamp, several practices reduce the likelihood of replication failures:

Monitor Replication Lag

Don’t wait for forests to go unavailable. Monitor the replication lag between primary and replica forests and alert when it exceeds a threshold. The Management API exposes this:

curl -s --digest -u admin:password \
  http://localhost:8002/manage/v2/forests/Forest-12?view=status&format=json | \
  jq '.["forest-status"]["replica-forest-status"]'

Set up alerts when lag exceeds 50% of your merge timestamp window. This gives you time to investigate and act before the replica falls out of range.

Right-Size Your Replica Hosts

Replica hosts need enough resources to both serve read queries and apply replicated changes. If you’re running replicas on smaller hardware than the primary — which is tempting for cost reasons — you’re increasing the risk. At minimum, ensure the replica hosts have:

  • Comparable disk I/O throughput to the primary
  • Sufficient CPU to handle replication alongside read queries
  • Enough memory for MarkLogic’s caches and the replication process

Manage Bulk Loads Carefully

Bulk ingestion is the most common trigger for XDMP-OLDSTAMP errors. When you know a large load is coming:

  1. Increase the merge timestamp before the load starts to widen the safety window
  2. Throttle the load to a rate the replica can sustain — MLCP’s -thread_count and -batch_size options help here
  3. Monitor replication lag during the load and pause if the replica falls behind
  4. Reduce the merge timestamp back to normal after the load completes and the replica has caught up

Consider Merge Scheduling

MarkLogic allows you to control when merges occur. If your write patterns are predictable — for example, heavy batch loads overnight followed by lighter query traffic during the day — you can schedule merges for periods of lower write activity. This prevents the situation where a large merge happens mid-load and the timestamp jumps.

Configure this via the merge blackout periods in the Admin UI under Databases > [database] > Merge Policy > Merge Blackouts.

Journal Configuration

Ensure your journal settings support replication recovery:

  • Don’t delete journal files prematurely — replicas may need them to catch up
  • Size your journal partitions generously — running out of journal space is worse than using extra disk
  • If you archive journals, ensure the archiving window exceeds your merge timestamp window

Quick Reference: Recovery Checklist

When XDMP-OLDSTAMP strikes, work through this sequence:

  1. Stop or reduce writes to the affected database
  2. Assess the damage — which replicas are unavailable?
  3. Clear affected replica forests and let them re-synchronise
  4. Monitor re-synchronisation until replicas are fully caught up
  5. Resume writes gradually once replicas are stable
  6. Investigate the root cause — was it write volume, network issues, resource constraints, or merge timing?
  7. Adjust the merge timestamp if it’s set to 0 or too low
  8. Set up monitoring for replication lag to catch problems earlier next time

Further Reading


Dealing with MarkLogic replication issues or planning a high-availability deployment? Contact us for hands-on help from consultants who’ve managed MarkLogic clusters in production.

Back to Blog