Test fix for #11

9 changed files with 29 additions and 1701 deletions
--- a/README.md
+++ b/README.md
@ -1,23 +1,10 @@
-# :sauropod: Grounding DINO 
-
---
-
-
-Grounding DINO Methods |  [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/IDEA-Research/GroundingDINO)
-[![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499) 
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/wxWDt5UiwY8)
-
-Grounding DINO Demos |
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)
-[![HuggingFace space](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/C4NqaRBz_Kw)
-
-Extensions | [Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything); [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb); [Grounding DINO with GLIGEN](demo/image_editing_with_groundingdino_gligen.ipynb)
-
-
-
+# Grounding DINO 
+[📃Paper](https://arxiv.org/abs/2303.05499) | 
+[📽️Video](https://www.youtube.com/watch?v=wxWDt5UiwY8) |
+[📯Demo on Colab](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) | 
+[🤗Demo on HF (Coming soon)]() 

+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) \
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) \
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-odinw)](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) \
@ -25,48 +12,37 @@ Extensions | [Grounding DINO with Segment Anything](https://github.com/IDEA-Rese



-Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now!
+Official pytorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now!


-## :bulb: Highlight
+## Highlight

 - **Open-Set Detection.** Detect **everything** with language!
 - **High Performancce.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
 - **Flexible.** Collaboration with Stable Diffusion for Image Editting.

-
-
-
-## :fire: News
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN)  for more controllable image editings.
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named **[Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)** aims to support segmentation in GroundingDINO.
- **`2023/03/28`**: A YouTube [video](https://youtu.be/cMa77r3YrDk) about Grounding DINO and basic object detection prompt engineering. [[SkalskiP](https://github.com/SkalskiP)]
- **`2023/03/28`**: Add a [demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) on Hugging Face Space!
- **`2023/03/27`**: Support CPU-only mode. Now the model can run on machines without GPUs.
- **`2023/03/25`**: A [demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) for Grounding DINO is available at Colab. [[SkalskiP](https://github.com/SkalskiP)]
- **`2023/03/22`**: Code is available Now!
+## News
+[2023/03/27] Support CPU-only mode. Now the model can run on machines without GPUs.\
+[2023/03/25] A [demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) for Grounding DINO is available at Colab. Thanks to @Piotr! \
+[2023/03/22] Code is available Now!

 <details open>
 <summary><font size="4">
 Description
 </font></summary>
- <a href="https://arxiv.org/abs/2303.05499">Paper</a> introduction.
 <img src=".asset/hero_figure.png" alt="ODinW" width="100%">
-Marrying <a href="https://github.com/IDEA-Research/GroundingDINO">Grounding DINO</a> and <a href="https://github.com/gligen/GLIGEN">GLIGEN</a>
-<img src="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/GD_GLIGEN.png" alt="gd_gligen" width="100%">
 </details>



-## :label: TODO 
+## TODO 

 - [x] Release inference code and demo.
 - [x] Release checkpoints.
- [x] Grounding DINO with Stable Diffusion and GLIGEN demos.
+- [ ] Grounding DINO with Stable Diffusion and GLIGEN demos.
 - [ ] Release training codes.

-## :hammer_and_wrench: Install 
+## Install 

 If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.

@ -74,7 +50,7 @@ If you have a CUDA environment, please make sure the environment variable `CUDA_
 pip install -e .
 ```

-## :arrow_forward: Demo
+## Demo

 ```bash
 CUDA_VISIBLE_DEVICES=6 python demo/inference_on_a_image.py \
@ -87,17 +63,7 @@ CUDA_VISIBLE_DEVICES=6 python demo/inference_on_a_image.py \
 ```
 See the `demo/inference_on_a_image.py` for more details.

-**Web UI**
-
-We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file `demo/gradio_app.py` for more details.
-
-**Notebooks**
-
- We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN)  for more controllable image editings.
- We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
-
-
-## :luggage: Checkpoints
+## Checkpoints

 <!-- insert a table -->
 <table>
@ -119,22 +85,13 @@ We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See
      <td>Swin-T</td>
      <td>O365,GoldG,Cap4M</td>
      <td>48.4 (zero-shot) / 57.2 (fine-tune)</td>
-      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">Github link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
+      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">link</a></td>
      <td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinT_OGC.py">link</a></td>
    </tr>
-    <tr>
-      <th>2</th>
-      <td>GroundingDINO-B</td>
-      <td>Swin-B</td>
-      <td>COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO</td>
-      <td>56.7 </td>
-      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth">Github link</a>  | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth">HF link</a> 
-      <td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinB.cfg.py">link</a></td>
-    </tr>
  </tbody>
 </table>

-## :medal_military: Results
+## Results

 <details open>
 <summary><font size="4">
@ -154,27 +111,24 @@ ODinW Object Detection Results
 <summary><font size="4">
 Marrying Grounding DINO with <a href="https://github.com/Stability-AI/StableDiffusion">Stable Diffusion</a> for Image Editing
 </font></summary>
-See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_stablediffusion.ipynb">notebook</a> for more details.
 <img src=".asset/GD_SD.png" alt="GD_SD" width="100%">
 </details>

-
 <details open>
 <summary><font size="4">
-Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing.
+Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing
 </font></summary>
-See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_gligen.ipynb">notebook</a> for more details.
 <img src=".asset/GD_GLIGEN.png" alt="GD_GLIGEN" width="100%">
 </details>

-## :sauropod: Model: Grounding DINO
+## Model

 Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.

 ![arch](.asset/arch.png)


-## :hearts: Acknowledgement
+## Acknowledgement

 Our model is related to [DINO](https://github.com/IDEA-Research/DINO) and [GLIP](https://github.com/microsoft/GLIP). Thanks for their great work!

@ -183,7 +137,7 @@ We also thank great previous work including DETR, Deformable DETR, SMCA, Conditi
 Thanks [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) and [GLIGEN](https://github.com/gligen/GLIGEN) for their awesome models.


-## :black_nib: Citation
+## Citation

 If you find our work helpful for your research, please consider citing the following BibTeX entry.   

--- a/demo/create_coco_dataset.py
+++ b/demo/create_coco_dataset.py
@ -1,83 +0,0 @@
-import typer
-from groundingdino.util.inference import load_model, load_image, predict
-from tqdm import tqdm
-import torchvision
-import torch
-import fiftyone as fo
-
-
-def main(
-        image_directory: str = 'test_grounding_dino',
-        text_prompt: str = 'bus, car',
-        box_threshold: float = 0.15, 
-        text_threshold: float = 0.10,
-        export_dataset: bool = False,
-        view_dataset: bool = False,
-        export_annotated_images: bool = True,
-        weights_path : str = "groundingdino_swint_ogc.pth",
-        config_path: str = "../../GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py",
-        subsample: int = None,
-    ):
-
-    model = load_model(config_path, weights_path)
-    
-    dataset = fo.Dataset.from_images_dir(image_directory)
-
-    samples = []
-
-    if subsample is not None: 
-        
-        if subsample < len(dataset):
-            dataset = dataset.take(subsample).clone()
-    
-    for sample in tqdm(dataset):
-
-        image_source, image = load_image(sample.filepath)
-
-        boxes, logits, phrases = predict(
-            model=model, 
-            image=image, 
-            caption=text_prompt, 
-            box_threshold=box_threshold, 
-            text_threshold=text_threshold,
-        )
-
-        detections = [] 
-
-        for box, logit, phrase in zip(boxes, logits, phrases):
-
-            rel_box = torchvision.ops.box_convert(box, 'cxcywh', 'xywh')
-
-            detections.append(
-                fo.Detection(
-                    label=phrase, 
-                    bounding_box=rel_box,
-                    confidence=logit,
-            ))
-
-        # Store detections in a field name of your choice
-        sample["detections"] = fo.Detections(detections=detections)
-        sample.save()
-
-    # loads the voxel fiftyone UI ready for viewing the dataset.
-    if view_dataset:
-        session = fo.launch_app(dataset)
-        session.wait()
-        
-    # exports COCO dataset ready for training
-    if export_dataset:
-        dataset.export(
-            'coco_dataset',
-            dataset_type=fo.types.COCODetectionDataset,
-        )
-        
-    # saves bounding boxes plotted on the input images to disk
-    if export_annotated_images:
-        dataset.draw_labels(
-            'images_with_bounding_boxes',
-            label_fields=['detections']
-        )
-
-
-if __name__ == '__main__':
-    typer.run(main)
--- a/demo/gradio_app.py
+++ b/demo/gradio_app.py
@ -1,125 +0,0 @@
-import argparse
-from functools import partial
-import cv2
-import requests
-import os
-from io import BytesIO
-from PIL import Image
-import numpy as np
-from pathlib import Path
-
-
-import warnings
-
-import torch
-
-# prepare the environment
-os.system("python setup.py build develop --user")
-os.system("pip install packaging==21.3")
-os.system("pip install gradio")
-
-
-warnings.filterwarnings("ignore")
-
-import gradio as gr
-
-from groundingdino.models import build_model
-from groundingdino.util.slconfig import SLConfig
-from groundingdino.util.utils import clean_state_dict
-from groundingdino.util.inference import annotate, load_image, predict
-import groundingdino.datasets.transforms as T
-
-from huggingface_hub import hf_hub_download
-
-
-
-# Use this command for evaluate the Grounding DINO model
-config_file = "groundingdino/config/GroundingDINO_SwinT_OGC.py"
-ckpt_repo_id = "ShilongLiu/GroundingDINO"
-ckpt_filenmae = "groundingdino_swint_ogc.pth"
-
-
-def load_model_hf(model_config_path, repo_id, filename, device='cpu'):
-    args = SLConfig.fromfile(model_config_path) 
-    model = build_model(args)
-    args.device = device
-
-    cache_file = hf_hub_download(repo_id=repo_id, filename=filename)
-    checkpoint = torch.load(cache_file, map_location='cpu')
-    log = model.load_state_dict(clean_state_dict(checkpoint['model']), strict=False)
-    print("Model loaded from {} \n => {}".format(cache_file, log))
-    _ = model.eval()
-    return model    
-
-def image_transform_grounding(init_image):
-    transform = T.Compose([
-        T.RandomResize([800], max_size=1333),
-        T.ToTensor(),
-        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-    ])
-    image, _ = transform(init_image, None) # 3, h, w
-    return init_image, image
-
-def image_transform_grounding_for_vis(init_image):
-    transform = T.Compose([
-        T.RandomResize([800], max_size=1333),
-    ])
-    image, _ = transform(init_image, None) # 3, h, w
-    return image
-
-model = load_model_hf(config_file, ckpt_repo_id, ckpt_filenmae)
-
-def run_grounding(input_image, grounding_caption, box_threshold, text_threshold):
-    init_image = input_image.convert("RGB")
-    original_size = init_image.size
-
-    _, image_tensor = image_transform_grounding(init_image)
-    image_pil: Image = image_transform_grounding_for_vis(init_image)
-
-    # run grounidng
-    boxes, logits, phrases = predict(model, image_tensor, grounding_caption, box_threshold, text_threshold, device='cpu')
-    annotated_frame = annotate(image_source=np.asarray(image_pil), boxes=boxes, logits=logits, phrases=phrases)
-    image_with_box = Image.fromarray(cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB))
-
-
-    return image_with_box
-
-if __name__ == "__main__":
-
-    parser = argparse.ArgumentParser("Grounding DINO demo", add_help=True)
-    parser.add_argument("--debug", action="store_true", help="using debug mode")
-    parser.add_argument("--share", action="store_true", help="share the app")
-    args = parser.parse_args()
-
-    block = gr.Blocks().queue()
-    with block:
-        gr.Markdown("# [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO)")
-        gr.Markdown("### Open-World Detection with Grounding DINO")
-
-        with gr.Row():
-            with gr.Column():
-                input_image = gr.Image(source='upload', type="pil")
-                grounding_caption = gr.Textbox(label="Detection Prompt")
-                run_button = gr.Button(label="Run")
-                with gr.Accordion("Advanced options", open=False):
-                    box_threshold = gr.Slider(
-                        label="Box Threshold", minimum=0.0, maximum=1.0, value=0.25, step=0.001
-                    )
-                    text_threshold = gr.Slider(
-                        label="Text Threshold", minimum=0.0, maximum=1.0, value=0.25, step=0.001
-                    )
-
-            with gr.Column():
-                gallery = gr.outputs.Image(
-                    type="pil",
-                    # label="grounding results"
-                ).style(full_width=True, full_height=True)
-                # gallery = gr.Gallery(label="Generated images", show_label=False).style(
-                #         grid=[1], height="auto", container=True, full_width=True, full_height=True)
-
-        run_button.click(fn=run_grounding, inputs=[
-                        input_image, grounding_caption, box_threshold, text_threshold], outputs=[gallery])
-
-
-    block.launch(server_name='0.0.0.0', server_port=7579, debug=args.debug, share=args.share)
-
--- a/demo/image_editing_with_groundingdino_gligen.ipynb
+++ b/demo/image_editing_with_groundingdino_gligen.ipynb
--- a/demo/image_editing_with_groundingdino_stablediffusion.ipynb
+++ b/demo/image_editing_with_groundingdino_stablediffusion.ipynb
--- a/groundingdino/config/GroundingDINO_SwinB.cfg.py
+++ b/groundingdino/config/GroundingDINO_SwinB.cfg.py
@ -1,43 +0,0 @@
-batch_size = 1
-modelname = "groundingdino"
-backbone = "swin_B_384_22k"
-position_embedding = "sine"
-pe_temperatureH = 20
-pe_temperatureW = 20
-return_interm_indices = [1, 2, 3]
-backbone_freeze_keywords = None
-enc_layers = 6
-dec_layers = 6
-pre_norm = False
-dim_feedforward = 2048
-hidden_dim = 256
-dropout = 0.0
-nheads = 8
-num_queries = 900
-query_dim = 4
-num_patterns = 0
-num_feature_levels = 4
-enc_n_points = 4
-dec_n_points = 4
-two_stage_type = "standard"
-two_stage_bbox_embed_share = False
-two_stage_class_embed_share = False
-transformer_activation = "relu"
-dec_pred_bbox_embed_share = True
-dn_box_noise_scale = 1.0
-dn_label_noise_ratio = 0.5
-dn_label_coef = 1.0
-dn_bbox_coef = 1.0
-embed_init_tgt = True
-dn_labelbook_size = 2000
-max_text_len = 256
-text_encoder_type = "bert-base-uncased"
-use_text_enhancer = True
-use_fusion_layer = True
-use_checkpoint = True
-use_transformer_ckpt = True
-use_text_cross_attention = True
-text_dropout = 0.0
-fusion_dropout = 0.0
-fusion_droppath = 0.1
-sub_sentence_present = True
--- a/groundingdino/util/inference.py
+++ b/groundingdino/util/inference.py
@ -13,10 +13,6 @@ from groundingdino.util.misc import clean_state_dict
 from groundingdino.util.slconfig import SLConfig
 from groundingdino.util.utils import get_phrases_from_posmap

-# ----------------------------------------------------------------------------------------------------------------------
-# OLD API
-# ----------------------------------------------------------------------------------------------------------------------
-

 def preprocess_caption(caption: str) -> str:
    result = caption.lower().strip()
@ -25,9 +21,9 @@ def preprocess_caption(caption: str) -> str:
    return result + "."


-def load_model(model_config_path: str, model_checkpoint_path: str, device: str = "cuda"):
+def load_model(model_config_path: str, model_checkpoint_path: str):
    args = SLConfig.fromfile(model_config_path)
-    args.device = device
+    args.device = "cuda"
    model = build_model(args)
    checkpoint = torch.load(model_checkpoint_path, map_location="cpu")
    model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
@ -54,13 +50,12 @@ def predict(
        image: torch.Tensor,
        caption: str,
        box_threshold: float,
-        text_threshold: float,
-        device: str = "cuda"
+        text_threshold: float
 ) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

-    model = model.to(device)
-    image = image.to(device)
+    model = model.cuda()
+    image = image.cuda()

    with torch.no_grad():
        outputs = model(image[None], captions=[caption])
@ -100,143 +95,3 @@ def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor
    annotated_frame = cv2.cvtColor(image_source, cv2.COLOR_RGB2BGR)
    annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
    return annotated_frame
-
-
-# ----------------------------------------------------------------------------------------------------------------------
-# NEW API
-# ----------------------------------------------------------------------------------------------------------------------
-
-
-class Model:
-
-    def __init__(
-        self,
-        model_config_path: str,
-        model_checkpoint_path: str,
-        device: str = "cuda"
-    ):
-        self.model = load_model(
-            model_config_path=model_config_path,
-            model_checkpoint_path=model_checkpoint_path,
-            device=device
-        ).to(device)
-        self.device = device
-
-    def predict_with_caption(
-        self,
-        image: np.ndarray,
-        caption: str,
-        box_threshold: float = 0.35,
-        text_threshold: float = 0.25
-    ) -> Tuple[sv.Detections, List[str]]:
-        """
-        import cv2
-
-        image = cv2.imread(IMAGE_PATH)
-
-        model = Model(model_config_path=CONFIG_PATH, model_checkpoint_path=WEIGHTS_PATH)
-        detections, labels = model.predict_with_caption(
-            image=image,
-            caption=caption,
-            box_threshold=BOX_THRESHOLD,
-            text_threshold=TEXT_THRESHOLD
-        )
-
-        import supervision as sv
-
-        box_annotator = sv.BoxAnnotator()
-        annotated_image = box_annotator.annotate(scene=image, detections=detections, labels=labels)
-        """
-        processed_image = Model.preprocess_image(image_bgr=image).to(self.device)
-        boxes, logits, phrases = predict(
-            model=self.model,
-            image=processed_image,
-            caption=caption,
-            box_threshold=box_threshold,
-            text_threshold=text_threshold)
-        source_h, source_w, _ = image.shape
-        detections = Model.post_process_result(
-            source_h=source_h,
-            source_w=source_w,
-            boxes=boxes,
-            logits=logits)
-        return detections, phrases
-
-    def predict_with_classes(
-        self,
-        image: np.ndarray,
-        classes: List[str],
-        box_threshold: float,
-        text_threshold: float
-    ) -> sv.Detections:
-        """
-        import cv2
-
-        image = cv2.imread(IMAGE_PATH)
-
-        model = Model(model_config_path=CONFIG_PATH, model_checkpoint_path=WEIGHTS_PATH)
-        detections = model.predict_with_classes(
-            image=image,
-            classes=CLASSES,
-            box_threshold=BOX_THRESHOLD,
-            text_threshold=TEXT_THRESHOLD
-        )
-
-
-        import supervision as sv
-
-        box_annotator = sv.BoxAnnotator()
-        annotated_image = box_annotator.annotate(scene=image, detections=detections)
-        """
-        caption = ", ".join(classes)
-        processed_image = Model.preprocess_image(image_bgr=image).to(self.device)
-        boxes, logits, phrases = predict(
-            model=self.model,
-            image=processed_image,
-            caption=caption,
-            box_threshold=box_threshold,
-            text_threshold=text_threshold)
-        source_h, source_w, _ = image.shape
-        detections = Model.post_process_result(
-            source_h=source_h,
-            source_w=source_w,
-            boxes=boxes,
-            logits=logits)
-        class_id = Model.phrases2classes(phrases=phrases, classes=classes)
-        detections.class_id = class_id
-        return detections
-
-    @staticmethod
-    def preprocess_image(image_bgr: np.ndarray) -> torch.Tensor:
-        transform = T.Compose(
-            [
-                T.RandomResize([800], max_size=1333),
-                T.ToTensor(),
-                T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
-            ]
-        )
-        image_pillow = Image.fromarray(cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB))
-        image_transformed, _ = transform(image_pillow, None)
-        return image_transformed
-
-    @staticmethod
-    def post_process_result(
-            source_h: int,
-            source_w: int,
-            boxes: torch.Tensor,
-            logits: torch.Tensor
-    ) -> sv.Detections:
-        boxes = boxes * torch.Tensor([source_w, source_h, source_w, source_h])
-        xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
-        confidence = logits.numpy()
-        return sv.Detections(xyxy=xyxy, confidence=confidence)
-
-    @staticmethod
-    def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
-        class_ids = []
-        for phrase in phrases:
-            try:
-                class_ids.append(classes.index(phrase))
-            except ValueError:
-                class_ids.append(None)
-        return np.array(class_ids)
--- a/groundingdino/util/slconfig.py
+++ b/groundingdino/util/slconfig.py
@ -2,7 +2,6 @@
 # Modified from mmcv
 # ==========================================================
 import ast
-import os
 import os.path as osp
 import shutil
 import sys
@ -81,8 +80,6 @@ class SLConfig(object):
            with tempfile.TemporaryDirectory() as temp_config_dir:
                temp_config_file = tempfile.NamedTemporaryFile(dir=temp_config_dir, suffix=".py")
                temp_config_name = osp.basename(temp_config_file.name)
-                if os.name == 'nt':
-                    temp_config_file.close()
                shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
                temp_module_name = osp.splitext(temp_config_name)[0]
                sys.path.insert(0, temp_config_dir)
--- a/requirements.txt
+++ b/requirements.txt
@ -6,5 +6,5 @@ yapf
 timm
 numpy
 opencv-python
-supervision==0.4.0
+supervision==0.3.2
 pycocotools