Add Data Collection and Annotation Docs Page and Preprocessing Annotated Data Docs Page (#13253)

9 months ago · bac868c635
parent cbcb494cfc
commit bac868c635
5 changed files with 325 additions and 2 deletions
--- a/docs/en/guides/data-collection-and-annotation.md
+++ b/docs/en/guides/data-collection-and-annotation.md
@ -0,0 +1,165 @@
+---
+comments: true
+description: Data collection and annotation are vital steps in any computer vision project. Explore the tools, techniques, and best practices for collecting and annotating data.
+keywords: What is Data Annotation, Data Annotation Tools, Annotating Data, Avoiding Bias in Data Collection, Ethical Data Collection, Annotation Strategies
+---
+
+# Data Collection and Annotation Strategies for Computer Vision
+
+## Introduction
+
+The key to success in any [computer vision project](./steps-of-a-cv-project.md) starts with effective data collection and annotation strategies. The quality of the data directly impacts model performance, so it’s important to understand the best practices related to data collection and data annotation.
+
+Every consideration regarding the data should closely align with [your project's goals](./defining-project-goals.md). Changes in your annotation strategies could shift the project's focus or effectiveness and vice versa. With this in mind, let's take a closer look at the best ways to approach data collection and annotation.
+
+## Setting Up Classes and Collecting Data
+
+Collecting images and video for a computer vision project involves defining the number of classes, sourcing data, and considering ethical implications. Before you start gathering your data, you need to be clear about: 
+
+### Choosing the Right Classes for Your Project
+
+One of the first questions when starting a computer vision project is how many classes to include. You need to determine the class membership, which is involves the different categories or labels that you want your model to recognize and differentiate. The number of classes should be determined by the specific goals of your project.
+
+For example, if you want to monitor traffic, your classes might include "car," "truck," "bus," "motorcycle," and "bicycle." On the other hand, for tracking items in a store, your classes could be "fruits," "vegetables," "beverages," and "snacks." Defining classes based on your project goals helps keep your dataset relevant and focused.
+
+When you define your classes, another important distinction to make is whether to choose coarse or fine class counts. ‘Count' refers to the number of distinct classes you are interested in. This decision influences the granularity of your data and the complexity of your model. Here are the considerations for each approach:
+
+- **Coarse Class-Count**: These are broader, more inclusive categories, such as "vehicle" and "non-vehicle." They simplify annotation and require fewer computational resources but provide less detailed information, potentially limiting the model's effectiveness in complex scenarios.
+- **Fine Class-Count**: More categories with finer distinctions, such as "sedan," "SUV," "pickup truck," and "motorcycle." They capture more detailed information, improving model accuracy and performance. However, they are more time-consuming and labor-intensive to annotate and require more computational resources.
+
+Something to note is that starting with more specific classes can be very helpful, especially in complex projects where details are important. More specific classes lets you collect more detailed data, and gain deeper insights and clearer distinctions between categories. Not only does it improve the accuracy of the model, but it also makes it easier to adjust the model later if needed, saving both time and resources.
+
+### Sources of Data
+
+You can use public datasets or gather your own custom data. Public datasets like those on [Kaggle](https://www.kaggle.com/datasets) and [Google Dataset Search Engine](https://datasetsearch.research.google.com/) offer well-annotated, standardized data, making them great starting points for training and validating models. 
+
+Custom data collection, on the other hand, allows you to customize your dataset to your specific needs. You might capture images and videos with cameras or drones, scrape the web for images, or use existing internal data from your organization. Custom data gives you more control over its quality and relevance. Combining both public and custom data sources helps create a diverse and comprehensive dataset.
+
+### Avoiding Bias in Data Collection
+
+Bias occurs when certain groups or scenarios are underrepresented or overrepresented in your dataset. It leads to a model that performs well on some data but poorly on others. It's crucial to avoid bias so that your computer vision model can perform well in a variety of scenarios. 
+
+Here is how you can avoid bias while collecting data:
+
+- **Diverse Sources**: Collect data from many sources to capture different perspectives and scenarios.
+- **Balanced Representation**: Include balanced representation from all relevant groups. For example, consider different ages, genders, and ethnicities.
+- **Continuous Monitoring**: Regularly review and update your dataset to identify and address any emerging biases.
+- **Bias Mitigation Techniques**: Use methods like oversampling underrepresented classes, data augmentation, and fairness-aware algorithms.
+
+Following these practices helps create a more robust and fair model that can generalize well in real-world applications.
+
+## What is Data Annotation?
+
+Data annotation is the process of labeling data to make it usable for training machine learning models. In computer vision, this means labeling images or videos with the information that a model needs to learn from. Without properly annotated data, models cannot accurately learn the relationships between inputs and outputs.
+
+### Types of Data Annotation
+
+Depending on the specific requirements of a [computer vision task](../tasks/index.md), there are different types of data annotation. Here are some examples:
+
+- **Bounding Boxes**: Rectangular boxes drawn around objects in an image, used primarily for object detection tasks. These boxes are defined by their top-left and bottom-right coordinates.
+- **Polygons**: Detailed outlines for objects, allowing for more precise annotation than bounding boxes. Polygons are used in tasks like instance segmentation, where the shape of the object is important.
+- **Masks**: Binary masks where each pixel is either part of an object or the background. Masks are used in semantic segmentation tasks to provide pixel-level detail.
+- **Keypoints**: Specific points marked within an image to identify locations of interest. Keypoints are used in tasks like pose estimation and facial landmark detection.
+
+<p align="center">
+  <img width="100%" src="https://labelyourdata.com/img/article-illustrations/types_of_da_light.jpg" alt="Types of Data Annotation">
+</p>
+
+### Common Annotation Formats
+
+After selecting a type of annotation, it’s important to choose the appropriate format for storing and sharing annotations. 
+
+Commonly used formats include [COCO](../datasets/detect/coco.md), which supports various annotation types like object detection, keypoint detection, stuff segmentation, panoptic segmentation, and image captioning, stored in JSON. [Pascal VOC](../datasets/detect/voc.md)) uses XML files and is popular for object detection tasks. YOLO, on the other hand, creates a .txt file for each image, containing annotations like object class, coordinates, height, and width, making it suitable for object detection.
+
+### Techniques of Annotation
+
+Now, assuming you've chosen a type of annotation and format, it's time to establish clear and objective labeling rules. These rules are like a roadmap for consistency and accuracy throughout the annotation process. Key aspects of these rules include:
+
+- **Clarity and Detail**: Make sure your instructions are clear. Use examples and illustrations to understand what's expected.
+- **Consistency**: Keep your annotations uniform. Set standard criteria for annotating different types of data, so all annotations follow the same rules.
+- **Reducing Bias**: Stay neutral. Train yourself to be objective and minimize personal biases to ensure fair annotations.
+- **Efficiency**: Work smarter, not harder. Use tools and workflows that automate repetitive tasks, making the annotation process faster and more efficient..
+
+Regularly reviewing and updating your labeling rules will help keep your annotations accurate, consistent, and aligned with your project goals.
+
+### Popular Annotation Tools
+
+Let's say you are ready to annotate now. There are several open-source tools available to help streamline the data annotation process. Here are some useful open annotation tools: 
+
+- **[LabeI Studio](https://github.com/HumanSignal/label-studio)**: A flexible tool that supports a wide range of annotation tasks and includes features for managing projects and quality control.
+- **[CVAT](https://github.com/cvat-ai/cvat)**: A powerful tool that supports various annotation formats and customizable workflows, making it suitable for complex projects.
+- **[Labelme](https://github.com/labelmeai/labelme)**: A simple and easy-to-use tool that allows for quick annotation of images with polygons, making it ideal for straightforward tasks.
+
+<p align="center">
+  <img width="100%" src="https://github.com/labelmeai/labelme/raw/main/examples/instance_segmentation/.readme/annotation.jpg" alt="LabelMe Overview">
+</p>
+
+These open-source tools are budget-friendly and provide a range of features to meet different annotation needs.
+
+### Some More Things to Consider Before Annotating Data
+
+Before you dive into annotating your data, there are a few more things to keep in mind. You should be aware of accuracy, precision, outliers, and quality control to avoid labeling your data in a counterproductive manner. 
+
+#### Understanding Accuracy and Precision
+
+It's important to understand the difference between accuracy and precision and how it relates to annotation. Accuracy refers to how close the annotated data is to the true values. It helps us measure how closely the labels reflect real-world scenarios. Precision indicates the consistency of annotations. It checks if you are giving the same label to the same object or feature throughout the dataset. High accuracy and precision lead to better-trained models by reducing noise and improving the model's ability to generalize from the training data.
+
+<p align="center">
+  <img width="100%" src="https://keylabs.ai/blog/content/images/size/w1600/2023/12/new26-3.jpg" alt="Example of Precision">
+</p>
+
+#### Identifying Outliers
+
+Outliers are data points that deviate quite a bit from other observations in the dataset. With respect to annotations, an outlier could be an incorrectly labeled image or an annotation that doesn't fit with the rest of the dataset. Outliers are concerning because they can distort the model's learning process, leading to inaccurate predictions and poor generalization.
+
+You can use various methods to detect and correct outliers:
+
+- **Statistical Techniques**: To detect outliers in numerical features like pixel values, bounding box coordinates, or object sizes, you can use methods such as box plots, histograms, or z-scores.
+- **Visual Techniques**: To spot anomalies in categorical features like object classes, colors, or shapes, use visual methods like plotting images, labels, or heat maps.
+- **Algorithmic Methods**: Use tools like clustering (e.g., K-means clustering, DBSCAN) and anomaly detection algorithms to identify outliers based on data distribution patterns.
+
+#### Quality Control of Annotated Data
+
+Just like other technical projects, quality control is a must for annotated data. It is a good practice to regularly check annotations to make sure they are accurate and consistent. This can be done in a few different ways:
+
+- Reviewing samples of annotated data
+- Using automated tools to spot common errors
+- Having another person double-check the annotations
+
+If you are working with multiple people, consistency between different annotators is important. Good inter-annotator agreement means that the guidelines are clear and everyone is following them the same way. It keeps everyone on the same page and the annotations consistent.
+
+While reviewing, if you find errors, correct them and update the guidelines to avoid future mistakes. Provide feedback to annotators and offer regular training to help reduce errors. Having a strong process for handling errors keeps your dataset accurate and reliable.
+
+## FAQs
+
+Here are some questions that might encounter while collecting and annotating data:
+
+- **Q1:** What is active learning in the context of data annotation?
+    - **A1:** Active learning in data annotation is a technique where a machine learning model iteratively selects the most informative data points for labeling. This improves the model's performance with fewer labeled examples. By focusing on the most valuable data, active learning accelerates the training process and improves the model's ability to generalize from limited data.
+
+<p align="center">
+  <img width="100%" src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/63b413cc43a073846453dca4_633a98dcd9b9793e1eebdfb6_HERO_Active%2520Learning%2520.png" alt="Overview of Active Learning">
+</p>
+
+- **Q2:** How does automated annotation work?
+    - **A2:** Automated annotation uses pre-trained models and algorithms to label data without needing human effort. These models, which have been trained on large datasets, can identify patterns and features in new data. Techniques like transfer learning adjust these models for specific tasks, and active learning helps by selecting the most useful data points for labeling. However, this approach is only possible in certain cases where the model has been trained on sufficiently similar data and tasks.
+
+- **Q3:** How many images do I need to collect for [YOLOv8 custom training](../modes/train.md)?
+    - **A3:** For transfer learning and object detection, a good general rule of thumb is to have a minimum of a few hundred annotated objects per class. However, when training a model to detect just one class, it is advisable to start with at least 100 annotated images and train for around 100 epochs. For complex tasks, you may need thousands of images per class to achieve reliable model performance.
+
+## Share Your Thoughts with the Community
+
+Bouncing your ideas and queries off other computer vision enthusiasts can help accelerate your projects. Here are some great ways to learn, troubleshoot, and network:
+
+### Where to Find Help and Support
+
+- **GitHub Issues:** Visit the YOLOv8 GitHub repository and use the [Issues tab](https://github.com/ultralytics/ultralytics/issues) to raise questions, report bugs, and suggest features. The community and maintainers are there to help with any issues you face.
+- **Ultralytics Discord Server:** Join the [Ultralytics Discord server](https://ultralytics.com/discord/) to connect with other users and developers, get support, share knowledge, and brainstorm ideas.
+
+### Official Documentation
+
+- **Ultralytics YOLOv8 Documentation:** Refer to the [official YOLOv8 documentation](./index.md) for thorough guides and valuable insights on numerous computer vision tasks and projects.
+
+## Conclusion
+
+By following the best practices for collecting and annotating data, avoiding bias, and using the right tools and techniques, you can significantly improve your model's performance. Engaging with the community and using available resources will keep you informed and help you troubleshoot issues effectively. Remember, quality data is the foundation of a successful project, and the right strategies will help you build robust and reliable models.
--- a/docs/en/guides/index.md
+++ b/docs/en/guides/index.md
@ -44,6 +44,8 @@ Here's a compilation of in-depth guides to help you master different aspects of
 - [OpenVINO Latency vs Throughput Modes](optimizing-openvino-latency-vs-throughput-modes.md) - Learn latency and throughput optimization techniques for peak YOLO inference performance.
 - [Steps of a Computer Vision Project ](steps-of-a-cv-project.md) 🚀 NEW: Learn about the key steps involved in a computer vision project, including defining goals, selecting models, preparing data, and evaluating results.
 - [Defining A Computer Vision Project's Goals](defining-project-goals.md) 🚀 NEW: Walk through how to effectively define clear and measurable goals for your computer vision project. Learn the importance of a well-defined problem statement and how it creates a roadmap for your project.
+- - [Data Collection and Annotation](data-collection-and-annotation.md)🚀 NEW: Explore the tools, techniques, and best practices for collecting and annotating data to create high-quality inputs for your computer vision models.
+- [Preprocessing Annotated Data](preprocessing_annotated_data.md)🚀 NEW: Learn about preprocessing and augmenting image data in computer vision projects using YOLOv8, including normalization, dataset augmentation, splitting, and exploratory data analysis (EDA).

 ## Contribute to Our Guides

--- a/docs/en/guides/preprocessing_annotated_data.md
+++ b/docs/en/guides/preprocessing_annotated_data.md
@ -0,0 +1,154 @@
+---
+comments: true
+description: Data preprocessing and augmentation help prepare datasets for model training in computer vision projects. Learn about various techniques for preprocessing annotated data.
+keywords: What is Data Preprocessing, Data Preprocessing Techniques, What is Data Augmentation, Data Augmentation Methods, Benefits of Data Augmentation
+---
+
+# Data Preprocessing Techniques for Annotated Computer Vision Data
+
+## Introduction
+
+After you’ve defined your computer vision [project’s goals](./defining-project-goals.md) and [collected and annotated data](./data-collection-and-annotation.md), the next step is to preprocess annotated data and prepare it for model training. Clean and consistent data are vital to creating a model that performs well. 
+
+Preprocessing is a step in the [computer vision project workflow](./steps-of-a-cv-project.md) that includes resizing images, normalizing pixel values, augmenting the dataset, and splitting the data into training, validation, and test sets. Let’s explore the essential techniques and best practices for cleaning your data!
+
+## Importance of Data Preprocessing
+
+We are already collecting and annotating our data carefully with multiple considerations in mind. Then, what makes data preprocessing so important to a computer vision project? Well, data preprocessing is all about getting your data into a suitable format for training that reduces the computational load and helps improve model performance. Here are some common issues in raw data that preprocessing addresses:
+
+- **Noise**: Irrelevant or random variations in data.
+- **Inconsistency**: Variations in image sizes, formats, and quality.
+- **Imbalance**: Unequal distribution of classes or categories in the dataset.
+
+## Data Preprocessing Techniques
+
+One of the first and foremost steps in data preprocessing is resizing.  Some models are designed to handle variable input sizes, but many models require a consistent input size. Resizing images makes them uniform and reduces computational complexity.
+
+### Resizing Images
+
+You can resize your images using the following methods:
+
+- **Bilinear Interpolation**: Smooths pixel values by taking a weighted average of the four nearest pixel values.
+- **Nearest Neighbor**: Assigns the nearest pixel value without averaging, leading to a blocky image but faster computation.
+
+To make resizing a simpler task, you can use the following tools: 
+
+- **OpenCV**: A popular computer vision library with extensive functions for image processing.
+- **PIL (Pillow)**: A Python Imaging Library for opening, manipulating, and saving image files.
+
+With respect to YOLOv8, the ‘imgsz’ parameter during [model training](../modes/train.md) allows for flexible input sizes. When set to a specific size, such as 640, the model will resize input images so their largest dimension is 640 pixels while maintaining the original aspect ratio. 
+
+By evaluating your model's and dataset's specific needs, you can determine whether resizing is a necessary preprocessing step or if your model can efficiently handle images of varying sizes.
+
+### Normalizing Pixel Values
+
+Another preprocessing technique is normalization. Normalization scales the pixel values to a standard range, which helps in faster convergence during training and improves model performance. Here are some common normalization techniques:
+
+- **Min-Max Scaling**: Scales pixel values to a range of 0 to 1.
+- **Z-Score Normalization**: Scales pixel values based on their mean and standard deviation.
+
+With respect to YOLOv8, normalization is seamlessly handled as part of its preprocessing pipeline during model training. YOLOv8 automatically performs several preprocessing steps, including conversion to RGB, scaling pixel values to the range [0, 1], and normalization using predefined mean and standard deviation values. 
+
+### Splitting the Dataset
+
+Once you’ve cleaned the data, you are ready to split the dataset. Splitting the data into training, validation, and test sets is done to ensure that the model can be evaluated on unseen data to assess its generalization performance. A common split is 70% for training, 20% for validation, and 10% for testing. There are various tools and libraries that you can use to split your data like scikit-learn or TensorFlow. 
+
+Consider the following when splitting your dataset:
+- **Maintaining Data Distribution**: Ensure that the data distribution of classes is maintained across training, validation, and test sets.
+- **Avoiding Data Leakage**: Typically, data augmentation is done after the dataset is split. Data augmentation and any other preprocessing should only be applied to the training set to prevent information from the validation or test sets from influencing the model training.
+-**Balancing Classes**: For imbalanced datasets, consider techniques such as oversampling the minority class or undersampling the majority class within the training set.
+
+### What is Data Augmentation?
+
+The most commonly discussed data preprocessing step is data augmentation. Data augmentation artificially increases the size of the dataset by creating modified versions of images. By augmenting your data, you can reduce overfitting and improve model generalization.
+
+Here are some other benefits of data augmentation:
+
+- **Creates a More Robust Dataset**: Data augmentation can make the model more robust to variations and distortions in the input data. This includes changes in lighting, orientation, and scale.
+- **Cost-Effective**: Data augmentation is a cost-effective way to increase the amount of training data without collecting and labeling new data.
+- **Better Use of Data**: Every available data point is used to its maximum potential by creating new variations
+
+#### Data Augmentation Methods
+
+Common augmentation techniques include flipping, rotation, scaling, and color adjustments. Several libraries, such as Albumentations, Imgaug, and TensorFlow's ImageDataGenerator, can generate these augmentations.
+
+<p align="center">
+  <img width="100%" src="https://i0.wp.com/ubiai.tools/wp-content/uploads/2023/11/UKwFg.jpg?fit=2204%2C775&ssl=1" alt="Overview of Data Augmentationsr">
+</p>
+
+With respect to YOLOv8, you can [augment your custom dataset](../modes/train.md) by modifying the dataset configuration file, a .yaml file. In this file, you can add an augmentation section with parameters that specify how you want to augment your data.
+
+The [Ultralytics YOLOv8 repository](https://github.com/ultralytics/ultralytics/tree/main) supports a wide range of data augmentations. You can apply various transformations such as:
+
+- Random Crops
+- Flipping: Images can be flipped horizontally or vertically.
+- Rotation: Images can be rotated by specific angles.
+- Distortion
+
+Also, you can adjust the intensity of these augmentation techniques through specific parameters to generate more data variety.
+
+## A Case Study of Preprocessing
+
+Consider a project aimed at developing a model to detect and classify different types of vehicles in traffic images using YOLOv8. We’ve collected traffic images and annotated them with bounding boxes and labels. 
+
+Here’s what each step of preprocessing would look like for this project:
+
+- Resizing Images: Since YOLOv8 handles flexible input sizes and performs resizing automatically, manual resizing is not required. The model will adjust the image size according to the specified ‘imgsz’ parameter during training.
+- Normalizing Pixel Values: YOLOv8 automatically normalizes pixel values to a range of 0 to 1 during preprocessing, so it's not required.
+- Splitting the Dataset: Divide the dataset into training (70%), validation (20%), and test (10%) sets using tools like scikit-learn. 
+- Data Augmentation: Modify the dataset configuration file (.yaml) to include data augmentation techniques such as random crops, horizontal flips, and brightness adjustments.
+
+These steps make sure the dataset is prepared without any potential issues and is ready for Exploratory Data Analysis (EDA).
+
+## Exploratory Data Analysis Techniques
+
+After preprocessing and augmenting your dataset, the next step is to gain insights through Exploratory Data Analysis. EDA uses statistical techniques and visualization tools to understand the patterns and distributions in your data. You can identify issues like class imbalances or outliers and make informed decisions about further data preprocessing or model training adjustments.
+
+### Statistical EDA Techniques 
+
+Statistical techniques often begin with calculating basic metrics such as mean, median, standard deviation, and range. These metrics provide a quick overview of your image dataset's properties, such as pixel intensity distributions. Understanding these basic statistics helps you grasp the overall quality and characteristics of your data, allowing you to spot any irregularities early on.
+
+### Visual EDA Techniques 
+
+Visualizations are key in EDA for image datasets. For example, class imbalance analysis is another vital aspect of EDA. It helps determine if certain classes are underrepresented in your dataset, Visualizing the distribution of different image classes or categories using bar charts can quickly reveal any imbalances. Similarly, outliers can be identified using visualization tools like box plots, which highlight anomalies in pixel intensity or feature distributions. Outlier detection prevents unusual data points from skewing your results.
+
+Common tools for visualizations include:
+
+- Histograms and Box Plots: Useful for understanding the distribution of pixel values and identifying outliers.
+- Scatter Plots: Helpful for exploring relationships between image features or annotations.
+- Heatmaps: Effective for visualizing the distribution of pixel intensities or the spatial distribution of annotated features within images.
+
+### Using Ultralytics Explorer for EDA
+
+For a more advanced approach to EDA, you can use the Ultralytics Explorer tool. It offers robust capabilities for exploring computer vision datasets. By supporting semantic search, SQL queries, and vector similarity search, the tool makes it easy to analyze and understand your data. With Ultralytics Explorer, you can create embeddings for your dataset to find similar images, run SQL queries for detailed analysis, and perform semantic searches, all through a user-friendly graphical interface.
+
+<p align="center">
+  <img width="100%" src="https://github.com/AyushExel/assets/assets/15766192/1b5f3708-be3e-44c5-9ea3-adcd522dfc75" alt="Overview of Ultralytics Explorer">
+</p>
+
+## FAQs
+
+Here are some questions that might come up while you prepare your dataset:
+
+- **Q1:** How much preprocessing is too much?
+    - **A1:** Preprocessing is essential but should be balanced. Overdoing it can lead to loss of critical information, overfitting, increased complexity, and higher computational costs. Focus on necessary steps like resizing, normalization, and basic augmentation, adjusting based on model performance.
+
+- **Q2:** What are the common pitfalls in EDA?
+    - **A2:** Common pitfalls in Exploratory Data Analysis (EDA) include ignoring data quality issues like missing values and outliers, confirmation bias, overfitting visualizations, neglecting data distribution, and overlooking correlations. A systematic approach helps gain accurate and valuable insights.
+
+## Reach Out and Connect
+
+Having discussions about your project with other computer vision enthusiasts can give you new ideas from different perspectives. Here are some great ways to learn, troubleshoot, and network:
+
+### Channels to Connect with the Community
+
+- **GitHub Issues:** Visit the YOLOv8 GitHub repository and use the [Issues tab](https://github.com/ultralytics/ultralytics/issues) to raise questions, report bugs, and suggest features. The community and maintainers are there to help with any issues you face.
+- **Ultralytics Discord Server:** Join the [Ultralytics Discord server](https://ultralytics.com/discord/) to connect with other users and developers, get support, share knowledge, and brainstorm ideas.
+
+### Official Documentation
+
+- **Ultralytics YOLOv8 Documentation:** Refer to the [official YOLOv8 documentation](./index.md) for thorough guides and valuable insights on numerous computer vision tasks and projects.
+
+## Your Dataset Is Ready!
+
+Properly resized, normalized, and augmented data improves model performance by reducing noise and improving generalization. By following the preprocessing techniques and best practices outlined in this guide, you can create a solid dataset. With your preprocessed dataset ready, you can confidently proceed to the next steps in your project.
--- a/docs/en/guides/steps-of-a-cv-project.md
+++ b/docs/en/guides/steps-of-a-cv-project.md
@ -87,7 +87,7 @@ However, if you choose to collect images or take your own pictures, you’ll nee
  <img width="100%" src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*VhpVAAJnvq5ZE_pv" alt="Different Types of Image Annotation">
 </p>

-Data annotation can be a time-consuming manual effort. Annotation tools can help make this process easier. Here are some useful open annotation tools: [LabeI Studio](https://github.com/HumanSignal/label-studio), [CVAT](https://github.com/cvat-ai/cvat), and [Labelme](https://github.com/labelmeai/labelme).
+[Data collection and annotation](./data-collection-and-annotation.md) can be a time-consuming manual effort. Annotation tools can help make this process easier. Here are some useful open annotation tools: [LabeI Studio](https://github.com/HumanSignal/label-studio), [CVAT](https://github.com/cvat-ai/cvat), and [Labelme](https://github.com/labelmeai/labelme).

 ## Step 3: Data Augmentation and Splitting Your Dataset

@ -113,7 +113,7 @@ To understand your data better, you can use tools like [Matplotlib](https://matp
  <img width="100%" src="https://github.com/ultralytics/ultralytics/assets/15766192/feb1fe05-58c5-4173-a9ff-e611e3bba3d0" alt="The Ultralytics Explorer Tool">
 </p> 

-By properly understanding, splitting, and augmenting your data, you can develop a well-trained, validated, and tested model that performs well in real-world applications.
+By properly [understanding, splitting, and augmenting your data](./preprocessing_annotated_data.md), you can develop a well-trained, validated, and tested model that performs well in real-world applications.

 ## Step 4: Model Training

--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -322,6 +322,8 @@ nav:
      - OpenVINO Latency vs Throughput modes: guides/optimizing-openvino-latency-vs-throughput-modes.md
      - Steps of a Computer Vision Project: guides/steps-of-a-cv-project.md
      - Defining A Computer Vision Project's Goals: guides/defining-project-goals.md
+      - Data Collection and Annotation: guides/data-collection-and-annotation.md
+      - Preprocessing Annotated Data: guides/preprocessing_annotated_data.md
      - Explorer:
          - datasets/explorer/index.md
          - Explorer API: datasets/explorer/api.md