How to Use Norconex Committer to Track Crawl Progress and Status
What Norconex Committer does
Norconex Committer tracks item states during a crawl and manages commits of crawl data to your storage/indexing system. It records statuses such as new, updated, deleted, skipped, and failed, enabling reliable resume, reporting, and accurate downstream indexing.
Key concepts
- Committer: component that receives document events (add/update/delete) and decides when and how to persist changes.
- Commit point: a durable checkpoint representing progress; used to resume or rollback.
- Batching: grouping operations into transactions for efficiency.
- Retry and failure handling: retries, error logging, and marking items as failed to avoid data loss.
- Item fingerprints: unique IDs or hashes to detect duplicates or unchanged items.
Typical use cases
- Ensuring only successfully processed documents are indexed.
- Generating progress reports during and after crawls.
- Resuming interrupted crawls without reprocessing already committed items.
- Synchronizing deletions between source and target index/storage.
Configuration steps (typical)
- Add a Committer section in your Norconex configuration (commonly in crawler-config.xml).
- Choose a committer implementation (e.g., IndexCommitter, FileCommitter, SolrCommitter, ElasticsearchCommitter, or a custom committer).
- Configure batching:
- set batchSize (number of items per commit)
- set batchInterval (time-based flush)
- Configure checkpointing:
- enable durable checkpoints (e.g., using FileCheckpoint or a DB-backed checkpoint)
- set checkpoint frequency and retention
- Configure retry/failure behavior:
- maxRetries, retryInterval
- failureAction (skip, mark, halt)
- Map fields and actions:
- specify which metadata fields to send
- set delete handling (e.g., deleteById)
- Enable logging and metrics for progress visibility.
Example snippet (conceptual XML):
xml
<committers> <committer class=“com.norconex.committer.core.impl.IndexCommitter”> <batchSize>100</batchSize> <batchInterval>5000</batchInterval> <checkpoint> <fileCheckpoint directory=“/var/norconex/checkpoints”/> </checkpoint> <retry> <maxRetries>3</maxRetries> <retryInterval>2000</retryInterval> </retry> </committer> </committers>
Monitoring crawl progress
- Use Committer-provided metrics: committed count, failed count, pending items.
- Inspect checkpoints to see last committed document and timestamp.
- Enable detailed logging (INFO/DEBUG) to trace per-document commit events.
- Integrate with external monitoring (Prometheus/Grafana) by emitting metrics via a custom committer wrapper.
- For web UI, build a simple dashboard that reads checkpoint files or committer logs.
Resuming and recovery
- On restart, the committer reads last checkpoint and skips items already committed.
- Failed items can be retried based on configured policies or exported for manual reprocessing.
- Ensure checkpoints are stored in durable storage (not ephemeral containers).
Best practices
- Set batch sizes based on target system capacity—large batches for throughput, smaller for lower memory.
- Use durable checkpointing to avoid reprocessing after failures.
- Keep retry policies conservative to prevent cascading failures.
- Monitor and alert on failed commit rates and long-running batches.
- Test with dry runs and incremental loads before full-scale crawls.
Quick checklist
- Choose correct committer type for your target.
- Configure batching, checkpointing, retries.
- Enable logging and metrics.
- Store checkpoints durably.
- Test resume and failure scenarios.
If you want, I can generate an exact XML config for a specific committer (Solr/Elasticsearch/File) and crawler version—tell me which target you use.
Leave a Reply