2 min read

Why is dataset search still so hard?

Insights from Dataset Search Research

Survey of 89 Data Professionals across Tech, Finance, Healthcare & More

Hulsebos et al., 2024

Current State of Dataset Search

  • 79% primarily search for initial datasets
  • 70% search internal databases
  • 61% search external repositories
  • Only 9% use specialized dataset search tools
  • Average search time: Several hours per week

User Pain Points

  • Inconsistent naming conventions
  • Too many attributes to filter effectively
  • Unclear data granularity levels
  • Limited query flexibility
  • Poor semantic understanding

What Users Actually Want

Content Priorities

  • Table semantics and meaning
  • Data granularity (geographic/temporal)
  • Freshness and update frequency
  • Complete schema information
  • Quality metrics

What Users Actually Want

Metadata Priorities

  • Usage statistics
  • Data lineage
  • Prior queries
  • Related datasets
  • Business context

Recommendations: Search Experience

  • Implement semantic search capabilities
  • Add faceted filtering
  • Support natural language queries
  • Enable progressive refinement
  • Improve result ranking

Recommendations: Metadata

  • Automated quality metrics
  • Usage analytics dashboard
  • Data lineage visualization
  • Freshness indicators
  • Schema relationship mapping

Recommendations: Collaboration

  • User annotations and comments
  • Usage examples and patterns
  • Expert identification
  • Search history sharing
  • Team-based collections

Implementation Priorities

Short-term Wins

  • Basic semantic search implementation
  • Quality metrics dashboard
  • User annotations
  • Search history tracking

Implementation Priorities

Long-term Investments

  • Advanced NLP capabilities
  • Automated lineage tracking
  • ML-based recommendations
  • Cross-system search

Success Metrics

  • Reduced time to find datasets
  • Improved search success rate
  • Increased dataset reuse
  • Higher user satisfaction scores
  • Better collaboration indicators

Key Takeaways

  • Focus on semantic understanding
  • Prioritize collaboration features
  • Invest in metadata quality
  • Enable iterative discovery
  • Measure and iterate constantly

References

  • Hulsebos, M., Lin, W., Shankar, S., & Parameswaran, A. (2024). "It Took Longer than I was Expecting:" Why is Dataset Search Still so Hard?