Insights from Dataset Search Research
Survey of 89 Data Professionals across Tech, Finance, Healthcare & More
Hulsebos et al., 2024
Current State of Dataset Search
- 79% primarily search for initial datasets
- 70% search internal databases
- 61% search external repositories
- Only 9% use specialized dataset search tools
- Average search time: Several hours per week
User Pain Points
- Inconsistent naming conventions
- Too many attributes to filter effectively
- Unclear data granularity levels
- Limited query flexibility
- Poor semantic understanding
What Users Actually Want
Content Priorities
- Table semantics and meaning
- Data granularity (geographic/temporal)
- Freshness and update frequency
- Complete schema information
- Quality metrics
What Users Actually Want
Metadata Priorities
- Usage statistics
- Data lineage
- Prior queries
- Related datasets
- Business context
Recommendations: Search Experience
- Implement semantic search capabilities
- Add faceted filtering
- Support natural language queries
- Enable progressive refinement
- Improve result ranking
Recommendations: Metadata
- Automated quality metrics
- Usage analytics dashboard
- Data lineage visualization
- Freshness indicators
- Schema relationship mapping
Recommendations: Collaboration
- User annotations and comments
- Usage examples and patterns
- Expert identification
- Search history sharing
- Team-based collections
Implementation Priorities
Short-term Wins
- Basic semantic search implementation
- Quality metrics dashboard
- User annotations
- Search history tracking
Implementation Priorities
Long-term Investments
- Advanced NLP capabilities
- Automated lineage tracking
- ML-based recommendations
- Cross-system search
Success Metrics
- Reduced time to find datasets
- Improved search success rate
- Increased dataset reuse
- Higher user satisfaction scores
- Better collaboration indicators
Key Takeaways
- Focus on semantic understanding
- Prioritize collaboration features
- Invest in metadata quality
- Enable iterative discovery
- Measure and iterate constantly
References
- Hulsebos, M., Lin, W., Shankar, S., & Parameswaran, A. (2024). "It Took Longer than I was Expecting:" Why is Dataset Search Still so Hard?