Jan 2025

MongoDB Aggregation Pipelines: The Power You Might Not Be Using

Most developers use MongoDB as a simple document store: insert documents, query by ID or simple filters, maybe add an index or two. But MongoDB’s aggregation framework is where the real power lives. If you’re not using aggregation pipelines, you’re leaving performance and capability on the table.

What Are Aggregation Pipelines?

An aggregation pipeline is a sequence of stages that process documents. Each stage transforms the documents as they pass through. Think of it like Unix pipes for your data—each stage takes input, does something with it, and passes the result to the next stage.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", totalSpent: { $sum: "$amount" } } },
  { $sort: { totalSpent: -1 } },
  { $limit: 10 }
])

This pipeline finds your top 10 customers by spending in four simple stages. Try doing that efficiently with basic queries.

Why Aggregation Beats Application-Side Processing

1. Data Stays in the Database

Moving data from MongoDB to your application, processing it, and potentially writing it back is expensive. Network latency, serialization overhead, and memory usage all add up. Aggregation pipelines process data where it lives.

2. Indexes Work Throughout the Pipeline

MongoDB can use indexes in aggregation pipelines, particularly in early stages like $match and $sort. A well-designed pipeline with proper indexes can be remarkably fast.

3. Memory-Efficient Processing

For large datasets, pipelines can spill to disk when needed. Your application server’s memory is finite and expensive. Let the database handle the heavy lifting.

Stages Every Developer Should Know

$match: Filter Early, Filter Often

Always put $match stages as early as possible. This reduces the documents flowing through subsequent stages.

// Good: Filter first
{ $match: { createdAt: { $gte: lastMonth } } },
{ $group: { ... } }

// Bad: Group everything, then filter
{ $group: { ... } },
{ $match: { total: { $gt: 1000 } } }

$lookup: Join Documents Across Collections

Yes, MongoDB can do joins. The $lookup stage performs a left outer join to another collection.

db.orders.aggregate([
  {
    $lookup: {
      from: "customers",
      localField: "customerId",
      foreignField: "_id",
      as: "customer"
    }
  },
  { $unwind: "$customer" }
])

This fetches the customer document for each order. Use it judiciously—it’s not as optimized as a relational join, but it’s there when you need it.

Need to run several aggregations on the same data? $facet lets you run multiple pipelines in parallel.

db.products.aggregate([
  {
    $facet: {
      byCategory: [
        { $group: { _id: "$category", count: { $sum: 1 } } }
      ],
      priceStats: [
        { $group: {
          _id: null,
          avg: { $avg: "$price" },
          min: { $min: "$price" },
          max: { $max: "$price" }
        }}
      ],
      topRated: [
        { $sort: { rating: -1 } },
        { $limit: 5 }
      ]
    }
  }
])

One query, three different analyses. Perfect for dashboard data.

$bucket: Automatic Histogram Creation

Group documents into buckets based on a field value. Great for analytics and reporting.

db.orders.aggregate([
  {
    $bucket: {
      groupBy: "$amount",
      boundaries: [0, 50, 100, 250, 500, 1000, Infinity],
      default: "Other",
      output: {
        count: { $sum: 1 },
        avgAmount: { $avg: "$amount" }
      }
    }
  }
])

This creates order amount ranges automatically—no manual binning required.

Advanced Patterns

Rolling Averages with $setWindowFields

MongoDB 5.0 introduced window functions. Calculate running totals, moving averages, and rankings directly in your queries.

db.sales.aggregate([
  {
    $setWindowFields: {
      partitionBy: "$region",
      sortBy: { date: 1 },
      output: {
        movingAvg: {
          $avg: "$amount",
          window: { documents: [-6, 0] }
        },
        runningTotal: {
          $sum: "$amount",
          window: { documents: ["unbounded", "current"] }
        }
      }
    }
  }
])

Seven-day moving average and running total, partitioned by region. This used to require application code or a separate analytics database.

Recursive Lookups with $graphLookup

Navigate hierarchical or graph data structures within MongoDB.

db.employees.aggregate([
  { $match: { name: "CEO" } },
  {
    $graphLookup: {
      from: "employees",
      startWith: "$_id",
      connectFromField: "_id",
      connectToField: "reportsTo",
      as: "allReports",
      maxDepth: 10
    }
  }
])

Find all employees in the reporting chain, recursively. Organizational charts, category trees, social graphs—all queryable.

Performance Tips

1. Explain Your Pipelines

Use explain() to understand how MongoDB executes your pipeline:

db.orders.explain("executionStats").aggregate([...])

Look for COLLSCAN (bad) vs IXSCAN (good) in the early stages.

2. Project Early

If you only need specific fields, use $project early to reduce document size through the pipeline.

{ $project: { customerId: 1, amount: 1, date: 1 } }

3. Use $merge for Materialized Views

For expensive aggregations that don’t need real-time data, write results to a collection:

db.orders.aggregate([
  // ... complex pipeline ...
  { $merge: { into: "dailySummary", whenMatched: "replace" } }
])

Run this on a schedule, query the summary collection for fast reads.

When Not to Use Aggregation

Aggregation pipelines aren’t always the answer:

Simple queries: Don’t overcomplicate basic find operations
Real-time user-facing queries: Complex pipelines can have unpredictable latency
Transactions: Aggregations run outside multi-document transactions
When you need the full document: If you’re just filtering and need complete documents, find() is simpler

Conclusion

MongoDB’s aggregation framework transforms it from a simple document store into a powerful data processing engine. The learning curve is worth it—you’ll write less application code, reduce data transfer, and often see significant performance improvements.

Start small. Take one piece of data processing logic from your application and move it to an aggregation pipeline. Measure the difference. Then do it again.