Dataset Guidelines

Best practices for creating high-quality computer vision datasets

Overview

Why Quality Matters

High-quality datasets are the foundation of successful computer vision models. The quality of your dataset directly impacts:

  • Model Performance: Better data leads to better model accuracy and generalization.
  • Training Efficiency: Clean, well-organized data reduces training time and resource usage.
  • Usability: Well-documented datasets are easier for others to use and build upon.
  • Value: Higher quality datasets command better prices in the marketplace.

These guidelines will help you create datasets that meet VisionHowl's quality standards and maximize their value and utility.

VisionHowl Quality Criteria

VisionHowl evaluates datasets based on four main criteria:

  1. Metadata Completeness (30%): How thoroughly you've documented your dataset.
  2. Preview Quality (25%): The quality and representativeness of your preview images.
  3. Format Validation (25%): Whether your dataset passes format validation checks.
  4. Documentation (20%): The quality of included documentation files.

Following these guidelines will help you achieve high scores in all these areas.

Data Collection

Diversity and Representativeness

A good dataset should represent the real-world distribution of data that models will encounter:

  • Diverse Scenarios: Include images from different environments, lighting conditions, and contexts.
  • Object Variations: Capture objects from different angles, distances, and with different occlusions.
  • Edge Cases: Include challenging examples that represent rare but important scenarios.
  • Balanced Classes: Ensure a reasonable balance between different classes or categories.
Tip: For object detection datasets, aim for at least 100 instances of each object class, with variations in size, orientation, and context.

Image Quality

The quality of individual images affects the overall dataset quality:

  • Resolution: Use appropriate resolution for the task (typically 640x640 or higher for object detection).
  • Clarity: Avoid blurry, over-exposed, or under-exposed images when possible.
  • Consistency: Maintain consistent image quality throughout the dataset.
  • Format: Use common formats like JPEG or PNG with reasonable compression.
Important: While some variation in image quality is natural and even desirable for robustness, extremely poor quality images should be excluded unless they specifically represent important edge cases.

Dataset Size

The appropriate size depends on the task and complexity:

  • Object Detection: Minimum 1,000 images, ideally 5,000+ for complex scenarios.
  • Classification: At least 100 images per class, ideally 1,000+.
  • Segmentation: Minimum 500 images, ideally 2,000+ for complex scenes.
  • Specialized Datasets: Even smaller datasets (500+ images) can be valuable if they address a specific niche.
Tip: Quality matters more than quantity. A smaller, high-quality dataset is often more valuable than a larger, poorly curated one.

Annotation

Annotation Accuracy

Accurate annotations are crucial for training effective models:

  • Precision: Bounding boxes should tightly fit the objects with minimal padding.
  • Consistency: Apply consistent annotation rules across the entire dataset.
  • Completeness: Annotate all relevant objects in each image.
  • Validation: Review annotations for errors, especially for automated or outsourced annotations.
Note: For object detection, aim for IoU (Intersection over Union) of at least 0.85 between your annotations and the actual objects.

Annotation Guidelines

Establish clear guidelines for annotation to ensure consistency:

  • Class Definitions: Clearly define what constitutes each class or category.
  • Boundary Rules: Define how to handle partial objects, occlusions, and group instances.
  • Minimum Size: Establish minimum object sizes for annotation (e.g., at least 10x10 pixels).
  • Special Cases: Document how to handle ambiguous or edge cases.
Tip: Include your annotation guidelines in the dataset documentation to help users understand your methodology.

Annotation Tools

Use appropriate tools for your annotation task:

  • Object Detection: Tools like CVAT, LabelImg, or Roboflow.
  • Segmentation: Tools like CVAT, LabelMe, or Supervisely.
  • Classification: Tools like LabelImg or custom scripts for bulk labeling.
  • Keypoints: Specialized tools like OpenPose or CVAT with keypoint configuration.
Tip: Document which annotation tool and version you used, as this can help users understand any tool-specific quirks in the annotations.

Organization

File Structure

A well-organized file structure makes your dataset more usable:

dataset_name/
├── images/
│   ├── train/
│   ├── val/
│   └── test/
├── annotations/
│   ├── train/
│   ├── val/
│   └── test/
├── metadata.json
├── classes.txt
└── README.md

Key principles for file organization:

  • Separation of Concerns: Keep images, annotations, and metadata in separate directories.
  • Split Organization: Clearly separate training, validation, and test sets.
  • Consistent Naming: Use consistent file naming conventions.
  • Flat Hierarchies: Avoid deeply nested directories unless necessary.

Data Splits

Properly splitting your dataset is crucial for model development:

  • Training Set: 70-80% of the data, used to train the model.
  • Validation Set: 10-15% of the data, used to tune hyperparameters and prevent overfitting.
  • Test Set: 10-15% of the data, used only for final evaluation.

Ensure that each split is representative of the overall dataset distribution. Random splitting is often sufficient, but stratified splitting (maintaining class balance) is better for imbalanced datasets.

Important: Ensure there is no overlap or data leakage between splits. Images that are very similar (e.g., frames from the same video) should be in the same split.

Metadata Organization

Organize metadata to make it easily accessible:

  • Central Metadata File: Include a central JSON or YAML file with dataset-level metadata.
  • Class Information: Provide a separate file listing all classes with descriptions.
  • Statistics: Include basic statistics about the dataset (number of images, class distribution, etc.).
  • Version Information: Clearly indicate the dataset version and date.
Tip: Use VisionHowl's metadata templates to ensure you're capturing all necessary information.

Documentation

README File

A comprehensive README file is essential for any dataset:

  • Dataset Overview: Brief description of the dataset and its purpose.
  • Contents: Description of what's included (number of images, classes, etc.).
  • Structure: Explanation of the file structure and organization.
  • Collection Methodology: How the data was collected and annotated.
  • Usage: Instructions for using the dataset, including code examples if possible.
  • License: Clear information about usage rights and restrictions.
  • Citation: How users should cite your dataset if they use it in their work.
  • Contact: How to reach you for questions or issues.
Tip: Use markdown formatting in your README for better readability. Include a table of contents for longer documentation.

Class Descriptions

Provide detailed information about each class in your dataset:

  • Class Definition: Clear definition of what constitutes each class.
  • Visual Examples: Include example images for each class.
  • Edge Cases: Explain how ambiguous cases were handled.
  • Hierarchy: If applicable, describe the class hierarchy or relationships.
# Example class description format
{
  "classes": [
    {
      "id": 0,
      "name": "car",
      "description": "Four-wheeled motor vehicles designed for passenger transportation.",
      "examples": ["example_car1.jpg", "example_car2.jpg"],
      "notes": "Includes sedans, SUVs, and hatchbacks. Excludes trucks and buses."
    },
    ...
  ]
}

Limitations and Biases

Transparently document the limitations and potential biases in your dataset:

  • Coverage Limitations: What scenarios or variations are underrepresented?
  • Known Biases: Are there demographic, geographic, or other biases?
  • Quality Issues: Are there known quality issues in certain subsets?
  • Annotation Limitations: What are the limitations of your annotation approach?
Note: Being transparent about limitations doesn't devalue your dataset—it helps users understand where and how to use it appropriately.

Quality Checks

Automated Validation

Run automated checks on your dataset before submission:

  • Format Validation: Ensure annotations follow the correct format specification.
  • Reference Integrity: Check that all referenced files exist and are valid.
  • Duplicate Detection: Identify and remove duplicate or near-duplicate images.
  • Annotation Consistency: Check for inconsistent or missing annotations.
Tip: VisionHowl provides a quality check tool that you can use before submission to identify potential issues.

Manual Review

Complement automated checks with manual review:

  • Random Sampling: Manually review a random sample (at least 5%) of your dataset.
  • Edge Case Review: Pay special attention to potential edge cases or unusual examples.
  • Cross-Validation: Have multiple annotators review the same subset to check for consistency.
  • Documentation Review: Ensure documentation is clear, complete, and accurate.
Important: Manual review is especially important for detecting subtle issues that automated checks might miss, such as semantic errors in annotations.

Pre-submission Checklist

Use this checklist before submitting your dataset:

Format Standards

COCO Format

The COCO (Common Objects in Context) format is widely used for object detection, segmentation, and keypoint detection:

{
  "info": {
    "year": 2023,
    "version": "1.0",
    "description": "Dataset description",
    "contributor": "Your Name",
    "url": "https://example.com",
    "date_created": "2023-06-15"
  },
  "licenses": [
    {
      "id": 1,
      "name": "Attribution-NonCommercial",
      "url": "https://creativecommons.org/licenses/by-nc/4.0/"
    }
  ],
  "images": [
    {
      "id": 1,
      "width": 640,
      "height": 640,
      "file_name": "image1.jpg",
      "license": 1,
      "date_captured": "2023-01-15"
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 3,
      "bbox": [100, 150, 200, 300],  // [x, y, width, height]
      "area": 60000,
      "segmentation": [...],  // Optional
      "iscrowd": 0
    }
  ],
  "categories": [
    {
      "id": 3,
      "name": "car",
      "supercategory": "vehicle"
    }
  ]
}

Key requirements for COCO format:

  • Use a single JSON file for all annotations
  • Ensure all IDs are unique and consistent
  • Bounding boxes should be in [x, y, width, height] format
  • Include all required fields for each section

YOLO Format

The YOLO format is simple and efficient for object detection:

# One text file per image, with the same name as the image
# Each line represents one object
# Format: class_id x_center y_center width height
# All values are normalized to [0, 1]

0 0.507 0.428 0.245 0.458
2 0.715 0.331 0.187 0.295

Key requirements for YOLO format:

  • Create a separate text file for each image
  • Use the same filename as the image (but with .txt extension)
  • Normalize all coordinates to [0, 1]
  • Use center coordinates (not top-left)
  • Include a classes.txt file listing all class names

Pascal VOC Format

The Pascal VOC format uses XML files for annotations:

<?xml version="1.0" encoding="UTF-8"?>
<annotation>
  <folder>images</folder>
  <filename>image1.jpg</filename>
  <path>/path/to/image1.jpg</path>
  <source>
    <database>Unknown</database>
  </source>
  <size>
    <width>640</width>
    <height>640</height>
    <depth>3</depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

Key requirements for Pascal VOC format:

  • Create a separate XML file for each image
  • Use absolute pixel coordinates (not normalized)
  • Include all required XML elements
  • Use consistent class names across all annotations

Ethical Considerations

Privacy and Consent

Respect privacy and obtain proper consent:

  • Personal Information: Remove or blur personally identifiable information (faces, license plates, etc.) unless you have explicit consent.
  • Consent: Ensure you have appropriate consent for any identifiable individuals in your dataset.
  • Location Data: Be cautious with geolocation data that could compromise privacy.
  • Documentation: Document your privacy protection measures in the dataset documentation.
Important: VisionHowl requires that all datasets comply with privacy laws and ethical guidelines. Datasets containing unauthorized personal information will be rejected.

Bias and Fairness

Address potential biases in your dataset:

  • Representation: Ensure diverse representation across relevant dimensions (e.g., demographics for human subjects).
  • Balance: Aim for balanced representation to avoid reinforcing existing biases.
  • Transparency: Document any known biases or limitations in representation.
  • Context: Consider the social and cultural context of your data collection.
Note: Even with best efforts, datasets may contain unintended biases. Documenting known limitations helps users apply appropriate mitigations.

Licensing and Attribution

Ensure proper licensing and attribution:

  • Source Attribution: Properly attribute any third-party data sources used in your dataset.
  • Clear Licensing: Provide clear licensing terms for your dataset.
  • Derivative Works: If your dataset is derived from other datasets, ensure you comply with their licenses.
  • Commercial Use: Clearly specify whether your dataset can be used for commercial purposes.
Tip: Consider using standard licenses like Creative Commons or Open Data Commons to make your usage terms clear and standardized.

Environmental Impact

Consider the environmental impact of your dataset:

  • Dataset Size: Balance comprehensiveness with efficiency. Unnecessarily large datasets consume more storage and energy to process.
  • Compression: Use appropriate compression to reduce storage and transfer requirements.
  • Efficiency: Design your dataset to enable efficient training (e.g., by including pre-processed versions).
Note: Environmental considerations are becoming increasingly important in AI research and development. A thoughtfully designed dataset can reduce the carbon footprint of models trained on it.

Conclusion

Final Thoughts

Creating high-quality datasets is both an art and a science. By following these guidelines, you'll be able to create datasets that are valuable, usable, and ethical. Remember that the quality of your dataset directly impacts the quality of models trained on it, so investing time in dataset creation and curation is well worth the effort.

VisionHowl is committed to maintaining a marketplace of high-quality datasets. By adhering to these guidelines, you'll not only increase the chances of your dataset being approved but also maximize its value and impact in the computer vision community.

If you have any questions or need assistance with dataset creation, don't hesitate to contact our support team.