From 3cd57ea09eb28d5872e44950a6f4c998dcb77088 Mon Sep 17 00:00:00 2001
From: Vadim Pisarevsky <vadim.pisarevsky@gmail.com>
Date: Wed, 16 Oct 2024 15:28:19 +0300
Subject: [PATCH] Merge pull request #26056 from vpisarev:new_dnn_engine

New dnn engine #26056

This is the 1st PR with the new engine; CI is green and PR is ready to be merged, I think.
Merge together with https://github.com/opencv/opencv_contrib/pull/3794

---

**Known limitations:**
* [solved] OpenVINO is temporarily disabled, but is probably easy to restore (it's not a deal breaker to merge this PR, I guess)
* The new engine does not support any backends nor any targets except for the default CPU implementation. But it's possible to choose the old engine when loading a model, then all the functionality is available.
* [Caffe patch is here: #26208] The new engine only supports ONNX. When a model is constructed manually or is loaded from a file of different format (.tf, .tflite, .caffe, .darknet), the old engine is used.
* Even in the case of ONNX some layers are not supported by the new engine, such as all quantized layers (including DequantizeLinear, QuantizeLinear, QLinearConv etc.), LSTM, GRU, .... It's planned, of course, to have full support for ONNX by OpenCV 5.0 gold release. When a loaded model contains unsupported layers, we switch to the old engine automatically  (at ONNX parsing time, not at `forward()` time).
* Some layers , e.g. Expat, are only partially supported by the new engine. In the case of unsupported flavours it switches to the old engine automatically (at ONNX parsing time, not at `forward()` time).
* 'Concat' graph optimization is disabled. The optimization eliminates Concat layer and instead makes the layers that generate tensors to be concatenated to write the outputs to the final destination. Of course, it's only possible when `axis=0` or `axis=N=1`. The optimization is not compatible with dynamic shapes since we need to know in advance where to store the tensors. Because some of the layer implementations have been modified to become more compatible with the new engine, the feature appears to be broken even when the old engine is used.
* Some `dnn::Net` API is not available with the new engine. Also, shape inference may return false if some of the output or intermediate tensors' shapes cannot be inferred without running the model. Probably this can be fixed by a dummy run of the model with zero inputs.
* Some overloads of `dnn::Net::getFLOPs()` and `dnn::Net::getMemoryConsumption()` are not exposed any longer in wrapper generators; but the most useful overloads are exposed (and checked by Java tests).
* [in progress] A few Einsum tests related to empty shapes have been disabled due to crashes in the tests and in Einsum implementations. The code and the tests need to be repaired.
* OpenCL implementation of Deconvolution is disabled. It's very bad and very slow anyway; need to be completely revised.
* Deconvolution3D test is now skipped, because it was only supported by CUDA and OpenVINO backends, both of which are not supported by the new engine.
* Some tests, such as FastNeuralStyle, checked that the in the case of CUDA backend there is no fallback to CPU. Currently all layers in the new engine are processed on CPU, so there are many fallbacks. The checks, therefore, have been temporarily disabled.

---

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
---
 modules/core/include/opencv2/core.hpp         |    7 +
 modules/core/include/opencv2/core/mat.hpp     |  266 ++
 modules/core/include/opencv2/core/mat.inl.hpp |   61 +-
 .../core/include/opencv2/core/operations.hpp  |    6 +
 modules/core/misc/objc/common/IntVector.h     |    4 +-
 modules/core/misc/objc/common/IntVector.mm    |    4 +-
 modules/core/misc/objc/gen_dict.json          |    6 +
 modules/core/src/matrix.cpp                   |  568 ++++
 modules/core/src/matrix_transform.cpp         |   10 +
 modules/core/src/matrix_wrap.cpp              |  127 +
 modules/core/src/out.cpp                      |  182 ++
 modules/core/src/umatrix.cpp                  |   74 +
 .../dnn/include/opencv2/dnn/all_layers.hpp    |  131 +
 modules/dnn/include/opencv2/dnn/dict.hpp      |    4 +
 modules/dnn/include/opencv2/dnn/dnn.hpp       |  313 ++-
 modules/dnn/include/opencv2/dnn/dnn.inl.hpp   |   39 +-
 .../dnn/include/opencv2/dnn/shape_utils.hpp   |   57 +-
 .../dnn/misc/java/src/cpp/dnn_converters.cpp  |   12 +-
 .../dnn/misc/java/src/cpp/dnn_converters.hpp  |    7 +-
 .../misc/java/test/DnnListRegressionTest.java |    6 +-
 modules/dnn/misc/objc/gen_dict.json           |   23 +-
 modules/dnn/misc/python/pyopencv_dnn.hpp      |   43 +-
 modules/dnn/misc/python/test/test_dnn.py      |   12 +-
 modules/dnn/perf/perf_einsum.cpp              |    6 +-
 modules/dnn/perf/perf_net.cpp                 |    4 +-
 .../cuda4dnn/primitives/depth_space_ops.hpp   |   16 +-
 modules/dnn/src/dnn_common.hpp                |    5 +
 modules/dnn/src/dnn_read.cpp                  |    8 +-
 modules/dnn/src/graph_buffer_allocator.cpp    |  336 +++
 modules/dnn/src/graph_const_fold.cpp          |  139 +
 modules/dnn/src/init.cpp                      |   26 +-
 .../dnn/src/int8layers/convolution_layer.cpp  |    4 +-
 modules/dnn/src/int8layers/eltwise_layer.cpp  |   10 +-
 modules/dnn/src/int8layers/pooling_layer.cpp  |    4 +-
 .../dnn/src/int8layers/quantization_utils.cpp |    6 +-
 modules/dnn/src/layer.cpp                     |  107 +-
 modules/dnn/src/layer_internals.hpp           |    2 +-
 modules/dnn/src/layers/accum_layer.cpp        |    7 +-
 modules/dnn/src/layers/arg_layer.cpp          |    4 +-
 modules/dnn/src/layers/blank_layer.cpp        |   19 +-
 modules/dnn/src/layers/concat2_layer.cpp      |  191 ++
 modules/dnn/src/layers/concat_layer.cpp       |    6 +-
 .../dnn/src/layers/constantofshape_layer.cpp  |  149 +
 modules/dnn/src/layers/convolution_layer.cpp  |  180 +-
 modules/dnn/src/layers/correlation_layer.cpp  |    2 +-
 .../src/layers/cpu_kernels/convolution.cpp    |    1 -
 modules/dnn/src/layers/cumsum_layer.cpp       |    5 +-
 .../dnn/src/layers/dequantizelinear_layer.cpp |  348 +++
 modules/dnn/src/layers/einsum_layer.cpp       |  123 +-
 modules/dnn/src/layers/eltwise_layer.cpp      |    8 +-
 modules/dnn/src/layers/expand2_layer.cpp      |  130 +
 modules/dnn/src/layers/expand_layer.cpp       |   30 +-
 modules/dnn/src/layers/flatten_layer.cpp      |   65 +-
 modules/dnn/src/layers/gather2_layer.cpp      |  210 ++
 modules/dnn/src/layers/gemm_layer.cpp         |   59 +-
 modules/dnn/src/layers/layer_norm.cpp         |   33 +-
 modules/dnn/src/layers/layers_common.cpp      |  142 +
 modules/dnn/src/layers/layers_common.hpp      |   32 +
 .../dnn/src/layers/max_unpooling_layer.cpp    |    2 +-
 .../dnn/src/layers/normalize_bbox_layer.cpp   |    4 +-
 modules/dnn/src/layers/not_layer.cpp          |    7 +-
 modules/dnn/src/layers/pad2_layer.cpp         |  377 +++
 modules/dnn/src/layers/padding_layer.cpp      |    6 -
 modules/dnn/src/layers/pooling_layer.cpp      |    2 +-
 modules/dnn/src/layers/prior_box_layer.cpp    |   10 +-
 modules/dnn/src/layers/proposal_layer.cpp     |   23 +-
 .../dnn/src/layers/quantlizelinear_layer.cpp  |  336 +++
 modules/dnn/src/layers/range_layer.cpp        |  224 ++
 modules/dnn/src/layers/reshape2_layer.cpp     |  190 ++
 modules/dnn/src/layers/resize_layer.cpp       |  189 +-
 modules/dnn/src/layers/shape_layer.cpp        |  137 +
 modules/dnn/src/layers/slice2_layer.cpp       |  359 +++
 modules/dnn/src/layers/softmax_layer.cpp      |    5 +-
 modules/dnn/src/layers/split2_layer.cpp       |  266 ++
 modules/dnn/src/layers/squeeze_layer.cpp      |  159 ++
 modules/dnn/src/layers/tile2_layer.cpp        |  304 ++
 modules/dnn/src/layers/tile_layer.cpp         |   10 +-
 modules/dnn/src/layers/transpose_layer.cpp    |  218 ++
 modules/dnn/src/layers/unsqueeze_layer.cpp    |  156 ++
 modules/dnn/src/legacy_backend.hpp            |    2 +-
 modules/dnn/src/model.cpp                     |    2 +-
 modules/dnn/src/net.cpp                       |  106 +-
 modules/dnn/src/net_impl.cpp                  |  196 +-
 modules/dnn/src/net_impl.hpp                  |  144 +
 modules/dnn/src/net_impl2.cpp                 | 1107 ++++++++
 modules/dnn/src/net_impl_backend.cpp          |   12 +
 modules/dnn/src/net_impl_fuse.cpp             |   10 +
 modules/dnn/src/net_openvino.cpp              |    2 +-
 .../dnn/src/ocl4dnn/src/ocl4dnn_softmax.cpp   |   10 +-
 modules/dnn/src/onnx/onnx_importer.cpp        |   95 +-
 modules/dnn/src/onnx/onnx_importer2.cpp       | 2450 +++++++++++++++++
 modules/dnn/src/op_cuda.hpp                   |    2 +-
 modules/dnn/src/tensorflow/tf_importer.cpp    |    8 +-
 modules/dnn/src/tflite/tflite_importer.cpp    |    6 +-
 modules/dnn/test/test_backends.cpp            |    4 +-
 modules/dnn/test/test_common.cpp              |    2 +-
 modules/dnn/test/test_darknet_importer.cpp    |    2 +-
 modules/dnn/test/test_graph_simplifier.cpp    |    6 +
 modules/dnn/test/test_layers.cpp              |   24 +-
 modules/dnn/test/test_layers_1d.cpp           |   56 +-
 modules/dnn/test/test_model.cpp               |   17 +-
 modules/dnn/test/test_onnx_importer.cpp       |   88 +-
 .../java/generator/src/cpp/listconverters.cpp |    2 +-
 .../java/generator/src/cpp/listconverters.hpp |    2 +-
 modules/objc/generator/CMakeLists.txt         |    2 +-
 modules/objc/generator/gen_objc.py            |    2 +-
 modules/ts/src/ts_func.cpp                    |   10 +-
 .../video/src/tracking/tracker_dasiamrpn.cpp  |   11 +-
 platforms/apple/cv_build_utils.py             |    2 +-
 platforms/ios/build_docs.py                   |    2 +-
 platforms/ios/readme.txt                      |    2 +-
 platforms/osx/build_framework.py              |    2 +-
 112 files changed, 11197 insertions(+), 554 deletions(-)
 mode change 100644 => 100755 modules/dnn/misc/python/test/test_dnn.py
 create mode 100644 modules/dnn/src/graph_buffer_allocator.cpp
 create mode 100644 modules/dnn/src/graph_const_fold.cpp
 create mode 100644 modules/dnn/src/layers/concat2_layer.cpp
 create mode 100644 modules/dnn/src/layers/constantofshape_layer.cpp
 create mode 100644 modules/dnn/src/layers/dequantizelinear_layer.cpp
 create mode 100644 modules/dnn/src/layers/expand2_layer.cpp
 create mode 100644 modules/dnn/src/layers/gather2_layer.cpp
 create mode 100644 modules/dnn/src/layers/pad2_layer.cpp
 create mode 100644 modules/dnn/src/layers/quantlizelinear_layer.cpp
 create mode 100644 modules/dnn/src/layers/range_layer.cpp
 create mode 100644 modules/dnn/src/layers/reshape2_layer.cpp
 create mode 100644 modules/dnn/src/layers/shape_layer.cpp
 create mode 100644 modules/dnn/src/layers/slice2_layer.cpp
 create mode 100644 modules/dnn/src/layers/split2_layer.cpp
 create mode 100644 modules/dnn/src/layers/squeeze_layer.cpp
 create mode 100644 modules/dnn/src/layers/tile2_layer.cpp
 create mode 100644 modules/dnn/src/layers/transpose_layer.cpp
 create mode 100644 modules/dnn/src/layers/unsqueeze_layer.cpp
 create mode 100644 modules/dnn/src/net_impl2.cpp
 create mode 100644 modules/dnn/src/onnx/onnx_importer2.cpp
 mode change 100644 => 100755 platforms/apple/cv_build_utils.py

diff --git a/modules/core/include/opencv2/core.hpp b/modules/core/include/opencv2/core.hpp
index e00ef365ed..dfd91ba08f 100644
--- a/modules/core/include/opencv2/core.hpp
+++ b/modules/core/include/opencv2/core.hpp
@@ -1124,6 +1124,13 @@ CV_EXPORTS_W void flipND(InputArray src, OutputArray dst, int axis);
  */
 CV_EXPORTS_W void broadcast(InputArray src, InputArray shape, OutputArray dst);
 
+/** @brief Broadcast the given Mat to the given shape.
+ * @param src input array
+ * @param shape target shape. Note that negative values are not supported.
+ * @param dst output array that has the given shape
+ */
+CV_EXPORTS void broadcast(InputArray src, const MatShape& shape, OutputArray dst);
+
 enum RotateFlags {
     ROTATE_90_CLOCKWISE = 0, //!<Rotate 90 degrees clockwise
     ROTATE_180 = 1, //!<Rotate 180 degrees clockwise
diff --git a/modules/core/include/opencv2/core/mat.hpp b/modules/core/include/opencv2/core/mat.hpp
index f0c421f810..c6a4d9c728 100644
--- a/modules/core/include/opencv2/core/mat.hpp
+++ b/modules/core/include/opencv2/core/mat.hpp
@@ -67,6 +67,113 @@ enum AccessFlag { ACCESS_READ=1<<24, ACCESS_WRITE=1<<25,
 CV_ENUM_FLAGS(AccessFlag)
 __CV_ENUM_FLAGS_BITWISE_AND(AccessFlag, int, AccessFlag)
 
+/**
+ * @brief Enum of data layout for model inference.
+ * @see Image2BlobParams
+ */
+enum DataLayout
+{
+    DATA_LAYOUT_UNKNOWN = 0,
+    DATA_LAYOUT_ND = 1,        //!< OpenCV data layout for 2D data.
+    DATA_LAYOUT_NCHW = 2,      //!< OpenCV data layout for 4D data.
+    DATA_LAYOUT_NCDHW = 3,     //!< OpenCV data layout for 5D data.
+    DATA_LAYOUT_NHWC = 4,      //!< Tensorflow-like data layout for 4D data.
+    DATA_LAYOUT_NDHWC = 5,     //!< Tensorflow-like data layout for 5D data.
+    DATA_LAYOUT_PLANAR = 6,    //!< Tensorflow-like data layout, it should only be used at tf or tflite model parsing.
+    DATA_LAYOUT_BLOCK = 7,     //!< Block layout (also referred to as 'NC1HWC0'), which some accelerators need and even on CPU a better performance may be achieved.
+
+    // for compatibility with the old code in DNN
+    DNN_LAYOUT_UNKNOWN = 0,
+    DNN_LAYOUT_ND = 1,        //!< OpenCV data layout for 2D data.
+    DNN_LAYOUT_NCHW = 2,      //!< OpenCV data layout for 4D data.
+    DNN_LAYOUT_NCDHW = 3,     //!< OpenCV data layout for 5D data.
+    DNN_LAYOUT_NHWC = 4,      //!< Tensorflow-like data layout for 4D data.
+    DNN_LAYOUT_NDHWC = 5,     //!< Tensorflow-like data layout for 5D data.
+    DNN_LAYOUT_PLANAR = 6,    //!< Tensorflow-like data layout, it should only be used at tf or tflite model parsing.
+    DNN_LAYOUT_BLOCK = 7,     //!< Block layout (also referred to as 'NC1HWC0'), which some accelerators need and even on CPU a better performance may be achieved.
+};
+
+CV_EXPORTS std::string layoutToString(DataLayout layout);
+
+/**
+ * @brief Represents shape of a matrix/tensor.
+ *   Previously, MatShape was defined as an alias of std::vector<int>,
+ *   but now we use a special structure that provides a few extra benefits:
+ *   1. avoids any heap operations, since the shape is stored in a plain array. This reduces overhead of shape inference etc.
+ *   2. includes information about the layout, including the actual number of channels ('C') in the case of block layout.
+ *   3. distinguishes between empty shape (total() == 0) and 0-dimensional shape (dims == 0, but total() == 1).
+ */
+struct CV_EXPORTS_W_SIMPLE MatShape
+{
+    enum {MAX_DIMS=10};
+
+    MatShape();
+    explicit MatShape(size_t dims, const int* sizes=nullptr, DataLayout layout=DATA_LAYOUT_UNKNOWN, int C=0);
+    explicit MatShape(size_t dims, int value, DataLayout layout=DATA_LAYOUT_UNKNOWN);
+    explicit MatShape(int dims, int value, DataLayout layout=DATA_LAYOUT_UNKNOWN);
+    explicit MatShape(const std::vector<int>& shape, DataLayout layout=DATA_LAYOUT_UNKNOWN, int C=0);
+    explicit MatShape(const int* begin, const int* end, DataLayout layout=DATA_LAYOUT_UNKNOWN, int C=0);
+    explicit MatShape(std::initializer_list<int> shape);
+    MatShape(const MatShape& shape);
+    MatShape& operator = (const MatShape& shape);
+    static MatShape scalar();
+    template<class _It> MatShape(_It begin, _It end);
+
+    // try to mimic basic std::vector<int> functionality
+    size_t size() const; // returns 0 in the case of scalar tensor. So, please don't use 'size()==0' to check for an empty shape. Use empty() instead.
+    CV_WRAP bool empty() const; // equivalent to total()==0, but may be slightly faster.
+    CV_WRAP bool isScalar() const; // dims==0
+    CV_WRAP void clear();
+    void resize(size_t newSize, int value=0);
+    void reserve(size_t maxSize);
+    void assign(size_t newSize, int value);
+    void assign(int newSize, int value);
+    void assign(const int* begin, const int* end);
+    void assign_(const int* begin, const int* end);
+    template<class _It> void assign(_It begin, _It end);
+    void insert(int* where, int value);
+    void insert(int* where, const int* begin, const int* end);
+    void insert_(int* where, const int* begin, const int* end);
+    void insert(int* where, size_t count, int value);
+    void insert(int* where, int count, int value);
+    template<class _It> void insert(int* where, _It begin, _It end);
+    CV_WRAP void erase(int* where);
+    int* data();
+    const int* data() const;
+    int* begin();
+    const int* begin() const;
+    int* end();
+    const int* end() const;
+    int& back();
+    const int& back() const;
+    void push_back(int value);
+    void emplace_back(int value);
+    int& operator [](size_t idx);
+    const int& operator [](size_t idx) const;
+
+    CV_WRAP bool hasSymbols() const; // negative elements in the shape may denote 'symbols' instead of actual values.
+
+    // compute shape of the result with possible broadcasting
+    CV_WRAP MatShape expand(const MatShape& another) const;
+
+    // convert shape to/from block layout
+    CV_WRAP MatShape toBlock(int C0) const;
+    CV_WRAP MatShape fromBlock(DataLayout newLayout) const;
+
+    size_t total() const; // returns the total number of elements in the tensor (including padding elements, i.e. the method ignores 'C' in the case of block layout). Returns 1 for scalar tensors. Returns 0 for empty shapes.
+
+    operator std::vector<int>() const;
+    std::string str() const;
+
+    int dims;
+    DataLayout layout;
+    int C;
+    int p[MAX_DIMS];
+};
+
+CV_EXPORTS bool operator == (const MatShape& shape1, const MatShape& shape2);
+CV_EXPORTS bool operator != (const MatShape& shape1, const MatShape& shape2);
+
 CV__DEBUG_NS_BEGIN
 
 class CV_EXPORTS _OutputArray;
@@ -227,6 +334,7 @@ public:
     int cols(int i=-1) const;
     int rows(int i=-1) const;
     Size size(int i=-1) const;
+    MatShape shape(int i=-1) const;
     int sizend(int* sz, int i=-1) const;
     bool sameSize(const _InputArray& arr) const;
     size_t total(int i=-1) const;
@@ -236,6 +344,7 @@ public:
     bool isContinuous(int i=-1) const;
     bool isSubmatrix(int i=-1) const;
     bool empty() const;
+    bool empty(int i) const;
     void copyTo(const _OutputArray& arr) const;
     void copyTo(const _OutputArray& arr, const _InputArray & mask) const;
     size_t offset(int i=-1) const;
@@ -367,10 +476,19 @@ public:
     template<typename _Tp> std::vector<std::vector<_Tp> >& getVecVecRef() const;
     ogl::Buffer& getOGlBufferRef() const;
     cuda::HostMem& getHostMemRef() const;
+
     void create(Size sz, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
     void create(int rows, int cols, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
     void create(int dims, const int* size, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
+    void create(const MatShape& shape, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
     void createSameSize(const _InputArray& arr, int mtype) const;
+
+    void fit(Size sz, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
+    void fit(int rows, int cols, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
+    void fit(int dims, const int* size, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
+    void fit(const MatShape& shape, int type, int i=-1, bool allowTransposed=false, _OutputArray::DepthMask fixedDepthMask=static_cast<_OutputArray::DepthMask>(0)) const;
+    void fitSameSize(const _InputArray& arr, int mtype) const;
+
     void release() const;
     void clear() const;
     void setTo(const _InputArray& value, const _InputArray & mask = _InputArray()) const;
@@ -876,6 +994,20 @@ public:
     */
     Mat(const std::vector<int>& sizes, int type);
 
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    */
+    Mat(const MatShape& shape, int type);
+
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    */
+    Mat(std::initializer_list<int> shape, int type);
+
     /** @overload
     @param ndims Array dimensionality.
     @param sizes Array of integers specifying an n-dimensional array shape.
@@ -897,6 +1029,25 @@ public:
     */
     Mat(const std::vector<int>& sizes, int type, const Scalar& s);
 
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    @param s An optional value to initialize each matrix element with. To set all the matrix elements to
+    the particular value after the construction, use the assignment operator
+    Mat::operator=(const Scalar& value) .
+    */
+    Mat(const MatShape& shape, int type, const Scalar& s);
+
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    @param s An optional value to initialize each matrix element with. To set all the matrix elements to
+     the particular value after the construction, use the assignment operator
+    Mat::operator=(const Scalar& value) .
+    */
+    Mat(std::initializer_list<int> shape, int type, const Scalar& s);
 
     /** @overload
     @param m Array that (as a whole or partly) is assigned to the constructed matrix. No data is copied
@@ -968,6 +1119,34 @@ public:
     */
     Mat(const std::vector<int>& sizes, int type, void* data, const size_t* steps=0);
 
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    @param data Pointer to the user data. Matrix constructors that take data and step parameters do not
+    allocate matrix data. Instead, they just initialize the matrix header that points to the specified
+    data, which means that no data is copied. This operation is very efficient and can be used to
+    process external data using OpenCV functions. The external data is not automatically deallocated, so
+    you should take care of it.
+    @param steps Array of ndims-1 steps in case of a multi-dimensional array (the last step is always
+    set to the element size). If not specified, the matrix is assumed to be continuous.
+    */
+    Mat(const MatShape& shape, int type, void* data, const size_t* steps=0);
+
+    /** @overload
+    @param shape Array shape.
+    @param type Array type. Use CV_8UC1, ..., CV_64FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    @param data Pointer to the user data. Matrix constructors that take data and step parameters do not
+    allocate matrix data. Instead, they just initialize the matrix header that points to the specified
+    data, which means that no data is copied. This operation is very efficient and can be used to
+    process external data using OpenCV functions. The external data is not automatically deallocated, so
+    you should take care of it.
+    @param steps Array of ndims-1 steps in case of a multi-dimensional array (the last step is always
+    set to the element size). If not specified, the matrix is assumed to be continuous.
+    */
+    Mat(std::initializer_list<int> shape, int type, void* data, const size_t* steps=0);
+
     /** @overload
     @param m Array that (as a whole or partly) is assigned to the constructed matrix. No data is copied
     by these constructors. Instead, the header pointing to m data or its sub-array is constructed and
@@ -1332,6 +1511,18 @@ public:
      */
     Mat reshape(int cn, const std::vector<int>& newshape) const;
 
+    /** @overload
+     * @param cn New number of channels. If the parameter is 0, the number of channels remains the same.
+     * @param newshape New shape in the form of MatShape.
+     */
+    Mat reshape(int cn, const MatShape& newshape) const;
+
+    /** @overload
+     * @param cn New number of channels. If the parameter is 0, the number of channels remains the same.
+     * @param newshape New shape in the form of initializer list.
+     */
+    Mat reshape(int cn, std::initializer_list<int> newshape) const;
+
     /** @brief Transposes a matrix.
 
     The method performs matrix transposition by means of matrix expressions. It does not perform the
@@ -1522,6 +1713,18 @@ public:
     */
     void create(const std::vector<int>& sizes, int type);
 
+    /** @overload
+    @param shape The new shape.
+    @param type New matrix type.
+    */
+    void create(const MatShape& shape, int type);
+
+    /** @overload
+    @param shape The new shape.
+    @param type New matrix type.
+    */
+    void create(std::initializer_list<int> shape, int type);
+
     /** @brief Creates the matrix of the same size as another array.
 
     The method is similar to _OutputArray::createSameSize(arr, type),
@@ -1531,6 +1734,50 @@ public:
     */
     void createSameSize(InputArray arr, int type);
 
+    /** @brief Similar to create(rows, cols, type), but only reallocates memory if the existing buffer size is not enough.
+     @param rows New number of rows.
+     @param cols New number of columns.
+     @param type New matrix type.
+    */
+    void fit(int rows, int cols, int type);
+
+    /** @overload
+    @param size Alternative new matrix size specification: Size(cols, rows)
+    @param type New matrix type.
+    */
+    void fit(Size size, int type);
+
+    /** @overload
+    @param ndims New array dimensionality.
+    @param sizes Array of integers specifying a new array shape.
+    @param type New matrix type.
+    */
+    void fit(int ndims, const int* sizes, int type);
+
+    /** @overload
+    @param sizes Array of integers specifying a new array shape.
+    @param type New matrix type.
+    */
+    void fit(const std::vector<int>& sizes, int type);
+
+    /** @overload
+    @param shape The new shape.
+    @param type New matrix type.
+    */
+    void fit(const MatShape& shape, int type);
+
+    /** @overload
+    @param shape The new shape.
+    @param type New matrix type.
+    */
+    void fit(std::initializer_list<int> shape, int type);
+
+    /** @brief Similar to createSameSize(arr, type), but only reallocates memory if the existing buffer is not enough.
+    @param arr The other array.
+    @param type New matrix type.
+    */
+    void fitSameSize(InputArray arr, int type);
+
     /** @brief Increments the reference counter.
 
     The method increments the reference counter associated with the matrix data. If the matrix header
@@ -1833,6 +2080,10 @@ public:
      */
     size_t step1(int i=0) const;
 
+    /** @brief Returns the shape.
+    */
+    MatShape shape() const;
+
     /** @brief Returns true if the array has no elements.
 
     The method returns true if Mat::total() is 0 or if Mat::data is NULL. Because of pop_back() and
@@ -2522,6 +2773,7 @@ public:
     // number of channels and/or different number of rows. see cvReshape.
     UMat reshape(int cn, int rows=0) const;
     UMat reshape(int cn, int newndims, const int* newsz) const;
+    UMat reshape(int cn, const MatShape& shape) const;
 
     //! matrix transposition by means of matrix expressions
     UMat t() const;
@@ -2549,11 +2801,21 @@ public:
     void create(Size size, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
     void create(int ndims, const int* sizes, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
     void create(const std::vector<int>& sizes, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+    void create(const MatShape& shape, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+
+    //! fits the new shape into existing data buffer if possible, otherwise reallocates data.
+    void fit(int rows, int cols, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+    void fit(Size size, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+    void fit(int ndims, const int* sizes, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+    void fit(const std::vector<int>& sizes, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+    void fit(const MatShape& shape, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
 
     //! allocates new matrix data unless the matrix already has specified size and type.
     // the size is taken from the specified array.
     void createSameSize(InputArray arr, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
 
+    void fitSameSize(InputArray arr, int type, UMatUsageFlags usageFlags = USAGE_DEFAULT);
+
     //! increases the reference counter; use with care to avoid memleaks
     void addref();
     //! decreases reference counter;
@@ -2602,6 +2864,10 @@ public:
     //! returns the total number of matrix elements
     size_t total() const;
 
+    /** @brief Returns the shape.
+    */
+    MatShape shape() const;
+
     //! returns N if the matrix is 1-channel (N x ptdim) or ptdim-channel (1 x N) or (N x 1); negative number otherwise
     int checkVector(int elemChannels, int depth=-1, bool requireContinuous=true) const;
 
diff --git a/modules/core/include/opencv2/core/mat.inl.hpp b/modules/core/include/opencv2/core/mat.inl.hpp
index 8f82fd80c0..221f2fee1d 100644
--- a/modules/core/include/opencv2/core/mat.inl.hpp
+++ b/modules/core/include/opencv2/core/mat.inl.hpp
@@ -71,9 +71,6 @@
 
 namespace cv
 {
-CV__DEBUG_NS_BEGIN
-
-
 //! @cond IGNORED
 
 ////////////////////////// Custom (raw) type wrapper //////////////////////////
@@ -86,6 +83,64 @@ int rawType()
     return (int)CV_MAKETYPE(CV_8U, elemSize);
 }
 
+/////////////////////////// MatShape ////////////////////////////////
+
+inline size_t MatShape::size() const { return dims >= 0 ? dims : 0; }
+inline bool MatShape::empty() const { return total() == 0; }
+inline bool MatShape::isScalar() const { return dims == 0; }
+
+inline int* MatShape::data() { return p; }
+inline const int* MatShape::data() const { return p; }
+
+inline int& MatShape::operator [](size_t idx)
+{
+    CV_Assert(idx < (size_t)(dims >= 0 ? dims : 1));
+    return p[idx];
+}
+
+inline const int& MatShape::operator [](size_t idx) const
+{
+    CV_Assert(idx < (size_t)(dims >= 0 ? dims : 1));
+    return p[idx];
+}
+
+template<typename _It> inline MatShape::MatShape(_It begin, _It end)
+{
+    int buf[MAX_DIMS];
+    int count = 0;
+    for (_It it = begin; it != end; ++it, ++count) {
+        CV_Assert(count < MAX_DIMS);
+        buf[count] = (int)*it;
+    }
+    clear();
+    assign_(buf, buf + count);
+}
+
+template<typename _It> inline void MatShape::assign(_It begin, _It end)
+{
+    int buf[MAX_DIMS];
+    int count = 0;
+    for (_It it = begin; it != end; ++it, ++count) {
+        CV_Assert(count < MAX_DIMS);
+        buf[count] = (int)*it;
+    }
+    assign_(buf, buf + count);
+}
+
+template<class _It> inline void MatShape::insert(int* where, _It begin, _It end)
+{
+    int buf[MAX_DIMS];
+    int count = 0;
+    for (_It it = begin; it != end; ++it, ++count) {
+        CV_Assert(count < MAX_DIMS);
+        buf[count] = (int)*it;
+    }
+    insert_(where, buf, buf + count);
+}
+
+CV__DEBUG_NS_BEGIN
+
+
 //////////////////////// Input/Output Arrays ////////////////////////
 
 inline void _InputArray::init(int _flags, const void* _obj)
diff --git a/modules/core/include/opencv2/core/operations.hpp b/modules/core/include/opencv2/core/operations.hpp
index 8e84be6c18..d345e24dc5 100644
--- a/modules/core/include/opencv2/core/operations.hpp
+++ b/modules/core/include/opencv2/core/operations.hpp
@@ -50,6 +50,7 @@
 #endif
 
 #include <cstdio>
+#include <ostream>
 
 #if defined(__GNUC__) || defined(__clang__) // at least GCC 3.1+, clang 3.5+
 #  if defined(__MINGW_PRINTF_FORMAT)  // https://sourceforge.net/p/mingw-w64/wiki2/gnu%20printf/.
@@ -484,6 +485,11 @@ int print(const Matx<_Tp, m, n>& matx, FILE* stream = stdout)
     return print(Formatter::get()->format(cv::Mat(matx)), stream);
 }
 
+// numpy/ONNXRuntime-style matrix pretty-printer
+CV_EXPORTS std::ostream& pprint(std::ostream& strm, InputArray tensor, int indent=0,
+                                int edge=3, int wholeTensorThreshold=100,
+                                char parens='\0');
+
 //! @endcond
 
 ///////////////////////////////// Formatted string generation /////////////////////////////////
diff --git a/modules/core/misc/objc/common/IntVector.h b/modules/core/misc/objc/common/IntVector.h
index 752f6e056c..c0ed2920c6 100644
--- a/modules/core/misc/objc/common/IntVector.h
+++ b/modules/core/misc/objc/common/IntVector.h
@@ -47,8 +47,8 @@ CV_EXPORTS @interface IntVector : NSObject
 * Create IntVector from std::vector<int>
 * @param src The std::vector<int> object to wrap
 */
--(instancetype)initWithStdVector:(std::vector<int>&)src;
-+(instancetype)fromNative:(std::vector<int>&)src;
+-(instancetype)initWithStdVector:(const std::vector<int>&)src;
++(instancetype)fromNative:(const std::vector<int>&)src;
 #endif
 
 #pragma mark - Properties
diff --git a/modules/core/misc/objc/common/IntVector.mm b/modules/core/misc/objc/common/IntVector.mm
index 112a5dcbc1..16ac322c28 100644
--- a/modules/core/misc/objc/common/IntVector.mm
+++ b/modules/core/misc/objc/common/IntVector.mm
@@ -50,7 +50,7 @@
     return v;
 }
 
--(instancetype)initWithStdVector:(std::vector<int>&)src {
+-(instancetype)initWithStdVector:(const std::vector<int>&)src {
     self = [super init];
     if (self) {
         v.insert(v.begin(), src.begin(), src.end());
@@ -58,7 +58,7 @@
     return self;
 }
 
-+(instancetype)fromNative:(std::vector<int>&)src {
++(instancetype)fromNative:(const std::vector<int>&)src {
     return [[IntVector alloc] initWithStdVector:src];
 }
 
diff --git a/modules/core/misc/objc/gen_dict.json b/modules/core/misc/objc/gen_dict.json
index e01b32d6dc..34a916a80d 100644
--- a/modules/core/misc/objc/gen_dict.json
+++ b/modules/core/misc/objc/gen_dict.json
@@ -110,6 +110,12 @@
             "to_cpp": "%(n)s.nativeRef",
             "from_cpp": "[TermCriteria fromNative:%(n)s]"
         },
+        "DataLayout": {
+            "objc_type": "int",
+            "from_cpp": "(int)%(n)s",
+            "to_cpp": "(cv::DataLayout)%(n)s",
+            "is_primitive": true
+        },
         "DMatch": {
             "objc_type": "DMatch*"
         },
diff --git a/modules/core/src/matrix.cpp b/modules/core/src/matrix.cpp
index 9ed492710a..74586e9439 100644
--- a/modules/core/src/matrix.cpp
+++ b/modules/core/src/matrix.cpp
@@ -7,6 +7,409 @@
 
 namespace cv {
 
+std::string layoutToString(DataLayout layout)
+{
+    return
+        layout == DATA_LAYOUT_ND ? "ND" :
+        layout == DATA_LAYOUT_NCHW ? "NCHW" :
+        layout == DATA_LAYOUT_NHWC ? "NHWC" :
+        layout == DATA_LAYOUT_BLOCK ? "NC1HWC0" :
+        layout == DATA_LAYOUT_NCDHW ? "NCDHW" :
+        layout == DATA_LAYOUT_NDHWC ? "NDHWC" :
+        layout == DATA_LAYOUT_PLANAR ? "PLANAR" :
+        layout == DATA_LAYOUT_UNKNOWN ? "Unknown" : "???";
+}
+
+bool operator == (const MatShape& size1, const MatShape& size2)
+{
+    if (size1.dims != size2.dims)
+        return false;
+    if (size1.layout != size2.layout &&
+        size1.layout != DATA_LAYOUT_UNKNOWN &&
+        size2.layout != DATA_LAYOUT_UNKNOWN)
+        return false;
+    if (size1.layout == DATA_LAYOUT_BLOCK &&
+        size2.layout == DATA_LAYOUT_BLOCK &&
+        size1.C != size2.C)
+        return false;
+    for (int i = 0; i < size1.dims; i++) {
+        if (size1.p[i] != size2.p[i])
+            return false;
+    }
+    return true;
+}
+
+bool operator != (const MatShape& size1, const MatShape& size2)
+{
+    return !(size1 == size2);
+}
+
+/////////////////////////// MatShape ////////////////////////////////
+
+MatShape MatShape::scalar()
+{
+    return MatShape(0);
+}
+
+void MatShape::clear()
+{
+    dims = -1;
+    layout = DATA_LAYOUT_UNKNOWN;
+    C = 0;
+    for (int i = 0; i < MAX_DIMS; i++)
+        p[i] = 0;
+}
+
+void MatShape::resize(size_t newSize, int value)
+{
+    CV_Assert(newSize < (size_t)MAX_DIMS);
+    int old_dims = std::max(dims, 0);
+    dims = (int)newSize;
+    for (int i = old_dims; i < dims; i++)
+        p[i] = value;
+}
+
+void MatShape::reserve(size_t)
+{
+    // no op; maybe need to add a check for overflow, but we check it anyway in other operations
+}
+
+void MatShape::assign(size_t newSize, int value)
+{
+    CV_Assert(newSize < (size_t)MAX_DIMS);
+    dims = (int)newSize;
+    for (int i = 0; i < dims; i++)
+        p[i] = value;
+}
+
+void MatShape::assign(int newSize, int value)
+{
+    assign((size_t)newSize, value);
+}
+
+void MatShape::assign(const int* begin, const int* end)
+{
+    assign_(begin, end);
+}
+
+void MatShape::assign_(const int* begin, const int* end)
+{
+    ptrdiff_t newSize = end - begin;
+    CV_Assert(0 <= newSize && newSize < (ptrdiff_t)MAX_DIMS);
+    dims = (int)newSize;
+    for (int i = 0; i < dims; i++)
+        p[i] = begin[i];
+}
+
+int* MatShape::begin() { return p; }
+const int* MatShape::begin() const { return p; }
+int* MatShape::end() { return p + std::max(dims, 0); }
+const int* MatShape::end() const { return p + std::max(dims, 0); }
+int& MatShape::back() { return p[std::max(dims-1, 0)]; }
+const int& MatShape::back() const { return p[std::max(dims-1, 0)]; }
+
+void MatShape::push_back(int value)
+{
+    CV_Assert(dims+1 < MAX_DIMS);
+    dims = std::max(dims+1, 1);
+    p[dims-1] = value;
+}
+
+void MatShape::emplace_back(int value)
+{
+    push_back(value);
+}
+
+void MatShape::insert(int* where, int value)
+{
+    int old_dims = std::max(dims, 0);
+    CV_Assert(old_dims+1 < MAX_DIMS);
+    ptrdiff_t ofs = where - p;
+    CV_Assert(0 <= ofs && ofs <= old_dims);
+    dims = old_dims+1;
+    for (int i = old_dims-1; i >= (int)ofs; i--)
+        p[i+1] = p[i];
+    p[ofs] = value;
+}
+
+void MatShape::insert(int* where, size_t count, int value)
+{
+    int old_dims = std::max(dims, 0);
+    CV_Assert((size_t)(old_dims+count) < (size_t)MAX_DIMS);
+    ptrdiff_t ofs = where - p;
+    CV_Assert(0 <= ofs && ofs <= old_dims);
+    dims = (int)(old_dims+count);
+    for (int i = old_dims-1; i >= (int)ofs; i--)
+        p[i+count] = p[i];
+    for (int i = 0; i < (int)count; i++)
+        p[i+ofs] = value;
+}
+
+void MatShape::insert(int* where, int count, int value)
+{
+    insert(where, (size_t)count, value);
+}
+
+void MatShape::insert(int* where, const int* begin, const int* end)
+{
+    insert_(where, begin, end);
+}
+
+void MatShape::insert_(int* where, const int* begin, const int* end)
+{
+    int old_dims = std::max(dims, 0);
+    ptrdiff_t delta = end - begin;
+    CV_Assert(0 <= delta && old_dims+delta < MAX_DIMS);
+    ptrdiff_t ofs = where - p;
+    CV_Assert(0 <= ofs && ofs <= old_dims);
+    dims = (int)(old_dims+delta);
+    for (int i = old_dims-1; i >= (int)ofs; i--)
+        p[i+delta] = p[i];
+    for (int i = 0; i < (int)delta; i++)
+        p[i+ofs] = begin[i];
+}
+
+void MatShape::erase(int* where)
+{
+    CV_Assert(dims > 0);
+    ptrdiff_t ofs = where - p;
+    CV_Assert(0 <= ofs && ofs <= dims);
+    if (ofs == dims)
+        return;
+    dims--;
+    for (int i = (int)ofs+1; i <= dims; i++)
+        p[i-1] = p[i];
+}
+
+size_t MatShape::total() const
+{
+    size_t result = 1;
+    if (dims < 0)
+        return 0;
+    for (int i = 0; i < dims; i++)
+        result *= p[i];
+    return result;
+}
+
+std::string MatShape::str() const
+{
+    std::stringstream sstrm;
+    if (empty()) {
+        sstrm << "<empty>";
+    } else if (dims == 0) {
+        sstrm << "<scalar>";
+    } else {
+        sstrm << "[";
+        for (int i = 0; i < dims; i++) {
+            sstrm << (i > 0 ? " x " : "") << p[i];
+        }
+        sstrm << "]";
+    }
+    return sstrm.str();
+}
+
+static void finalizeBlockLayout(MatShape& size, int C=0)
+{
+    if (size.layout == DATA_LAYOUT_BLOCK) {
+        CV_Assert(size.dims >= 4);
+        int C0 = size.p[size.dims-1];
+        CV_Assert(C0 > 1 && (C0 & (C0-1)) == 0);
+        size.C = C > 0 ? C : size.p[1]*size.p[size.dims-1];
+    } else {
+        size.C = 0;
+    }
+    for (int i = std::max(size.dims, 0); i < MatShape::MAX_DIMS; i++)
+        size.p[i] = 0;
+    if (size.dims == 0)
+        size.p[0] = 1;
+}
+
+MatShape::MatShape()
+{
+    clear();
+}
+
+MatShape::MatShape(size_t dims_, const int* size_, DataLayout layout_, int C_)
+{
+    layout = layout_;
+    CV_Assert(dims_ <= (size_t)MAX_DIMS);
+    dims = (int)dims_;
+    for (int i = 0; i < dims; i++) {
+        p[i] = size_ ? size_[i] : 0;
+    }
+    finalizeBlockLayout(*this, C_);
+}
+
+MatShape::MatShape(size_t dims_, int value, DataLayout layout_)
+{
+    layout = layout_;
+    CV_Assert(dims_ <= (size_t)MAX_DIMS);
+    dims = (int)dims_;
+    for (int i = 0; i < dims; i++) {
+        p[i] = value;
+    }
+    finalizeBlockLayout(*this, 0);
+}
+
+MatShape::MatShape(std::initializer_list<int> shape)
+{
+    layout = DATA_LAYOUT_UNKNOWN;
+    CV_Assert(shape.size() <= (size_t)MAX_DIMS);
+    dims = (int)shape.size();
+    auto it = shape.begin();
+    for (int i = 0; i < dims; i++, ++it) {
+        p[i] = *it;
+    }
+    finalizeBlockLayout(*this, 0);
+}
+
+MatShape::MatShape(int dims_, int value, DataLayout layout_)
+{
+    layout = layout_;
+    CV_Assert(dims_ <= MAX_DIMS);
+    dims = dims_;
+    for (int i = 0; i < dims; i++) {
+        p[i] = value;
+    }
+    finalizeBlockLayout(*this, 0);
+}
+
+MatShape::MatShape(const std::vector<int>& shape_, DataLayout layout_, int C_)
+{
+    layout = layout_;
+    size_t shape_size = shape_.size();
+    CV_Assert(shape_size < (size_t)MAX_DIMS);
+    dims = (int)shape_size;
+    for (int i = 0; i < dims; i++) {
+        p[i] = shape_[i];
+    }
+    finalizeBlockLayout(*this, C_);
+}
+
+MatShape::MatShape(const int* begin, const int* end, DataLayout layout_, int C_)
+{
+    layout = layout_;
+    ptrdiff_t shape_size = end - begin;
+    CV_Assert(0 <= shape_size && shape_size < MAX_DIMS);
+    dims = (int)shape_size;
+    for (int i = 0; i < dims; i++) {
+        p[i] = begin[i];
+    }
+    finalizeBlockLayout(*this, C_);
+}
+
+MatShape::MatShape(const MatShape& shape)
+{
+    dims = shape.dims;
+    layout = shape.layout;
+    C = shape.C;
+    for (int i = 0; i < MAX_DIMS; i++)
+        p[i] = shape.p[i];
+}
+
+MatShape& MatShape::operator = (const MatShape& shape)
+{
+    if (this != &shape) {
+        dims = shape.dims;
+        layout = shape.layout;
+        C = shape.C;
+        for (int i = 0; i < MAX_DIMS; i++)
+            p[i] = shape.p[i];
+    }
+    return *this;
+}
+
+bool MatShape::hasSymbols() const
+{
+    for (int i = 0; i < dims; i++) {
+        if (p[i] < 0)
+            return true;
+    }
+    return false;
+}
+
+MatShape MatShape::toBlock(int C0) const
+{
+    CV_Assert(dims >= 3);
+    // C0 should be > 1 and be a power-of-2: 2, 4, 8, ...
+    CV_Assert(C0 > 1 && (C0 & (C0-1)) == 0);
+    CV_Assert(layout == DATA_LAYOUT_NCHW || layout == DATA_LAYOUT_NHWC);
+    int c_idx = layout == DATA_LAYOUT_NCHW ? 1 : dims-1;
+
+    MatShape newsize = *this;
+    newsize.layout = DATA_LAYOUT_BLOCK;
+    newsize.C = p[c_idx];
+    newsize.p[newsize.dims++] = C0;
+    newsize.p[c_idx] = (p[c_idx] + C0 - 1)/C0;
+
+    return newsize;
+}
+
+MatShape MatShape::fromBlock(DataLayout newLayout) const
+{
+    CV_Assert(dims >= 4);
+    CV_Assert(layout == DATA_LAYOUT_BLOCK);
+    // C0 should be > 1 and be a power-of-2: 2, 4, 8, ...
+    int C0 = p[dims-1];
+    CV_Assert(C0 > 1 && (C0 & (C0-1)) == 0);
+    CV_Assert(p[1] == (C + C0-1)/C0);
+    CV_Assert(newLayout == DATA_LAYOUT_NCHW || newLayout == DATA_LAYOUT_NHWC);
+    int c_idx = newLayout == DATA_LAYOUT_NCHW ? 1 : dims-2;
+
+    MatShape newsize = *this;
+    newsize.layout = newLayout;
+    newsize.C = 0;
+    newsize.p[c_idx] = C;
+    newsize.dims--;
+
+    return newsize;
+}
+
+MatShape MatShape::expand(const MatShape& another) const
+{
+    if (dims == 0)
+        return another;
+    if (another.dims == 0)
+        return *this;
+
+    if ((layout == DATA_LAYOUT_NCHW || layout == DATA_LAYOUT_NHWC) &&
+        (another.layout == DATA_LAYOUT_NCHW || another.layout == DATA_LAYOUT_NHWC)) {
+        CV_Assert(layout == another.layout);
+    }
+    // [TODO] support block layout
+    CV_Assert(layout != DATA_LAYOUT_BLOCK && another.layout != DATA_LAYOUT_BLOCK);
+
+    MatShape result;
+
+    if (dims < 0 || another.dims < 0)
+        return result;
+
+    result = *this;
+    result.dims = std::max(dims, another.dims);
+    result.layout = layout == DATA_LAYOUT_UNKNOWN ? another.layout :
+        layout == DATA_LAYOUT_ND && (another.layout == DATA_LAYOUT_NCHW ||
+        another.layout == DATA_LAYOUT_NHWC) ? another.layout : layout;
+    for (int i = result.dims-1; i >= 0; i--) {
+        int i1 = i - (result.dims - dims);
+        int i2 = i - (result.dims - another.dims);
+        int sz1 = i1 < 0 ? 1 : p[i1];
+        int sz2 = i2 < 0 ? 1 : another.p[i2];
+        CV_Assert(sz1 == sz2 || sz1 == 1 || sz2 == 1);
+        // [TODO] handle symbolic shapes
+        result.p[i] = std::max(sz1, sz2);
+    }
+    return result;
+}
+
+MatShape::operator std::vector<int>() const
+{
+    if (dims < 0)
+        return std::vector<int>(1, 0);
+    return std::vector<int>(p, p + dims);
+}
+
+/////////////////////////// MatAllocator ////////////////////////////
+
 void MatAllocator::map(UMatData*, AccessFlag) const
 {
 }
@@ -403,6 +806,36 @@ Mat::Mat(const std::vector<int>& _sz, int _type, const Scalar& _s)
     *this = _s;
 }
 
+Mat::Mat(const MatShape& _shape, int _type)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows), step(0)
+{
+    create(_shape, _type);
+}
+
+Mat::Mat(std::initializer_list<int> _shape, int _type)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows), step(0)
+{
+    create(_shape, _type);
+}
+
+Mat::Mat(const MatShape& _shape, int _type, const Scalar& _s)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows), step(0)
+{
+    create(_shape, _type);
+    *this = _s;
+}
+
+Mat::Mat(std::initializer_list<int> _shape, int _type, const Scalar& _s)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows), step(0)
+{
+    create(_shape, _type);
+    *this = _s;
+}
+
 Mat::Mat(const Mat& m)
     : flags(m.flags), dims(m.dims), rows(m.rows), cols(m.cols), data(m.data),
       datastart(m.datastart), dataend(m.dataend), datalimit(m.datalimit), allocator(m.allocator),
@@ -557,6 +990,65 @@ void Mat::createSameSize(InputArray m, int type)
     _OutputArray(*this).createSameSize(m, type);
 }
 
+void Mat::fit(int _dims, const int* _sizes, int _type)
+{
+    size_t oldTotalBytes = u ? u->size : 0;
+    size_t esz = CV_ELEM_SIZE(_type), newTotal = _dims >= 0;
+    for (int i = 0; i < _dims; i++)
+        newTotal *= _sizes[i];
+    size_t newTotalBytes = newTotal*esz;
+    if (newTotalBytes > 0 && (!isContinuous() ||
+                              newTotalBytes > oldTotalBytes ||
+                              data != datastart)) {
+        create(_dims, _sizes, _type);
+    } else {
+        flags = (flags & ~Mat::TYPE_MASK) | CV_MAT_TYPE(_type);
+        int _dummy_size = 0;
+        setSize(*this, (_dims >= 0 ? _dims : 1), (_dims >= 0 ? _sizes : &_dummy_size), nullptr, true);
+        finalizeHdr(*this);
+    }
+}
+
+void Mat::fit(const std::vector<int>& _shape, int _type)
+{
+    fit((int)_shape.size(), _shape.data(), _type);
+}
+
+void Mat::fit(const MatShape& _shape, int _type)
+{
+    fit(_shape.dims, _shape.p, _type);
+}
+
+void Mat::fit(std::initializer_list<int> _shape, int _type)
+{
+    int new_shape[MatShape::MAX_DIMS];
+    int new_ndims = (int)_shape.size();
+    CV_Assert(new_ndims <= MatShape::MAX_DIMS);
+    auto it = _shape.begin();
+    for (int i = 0; i < new_ndims; i++, ++it)
+        new_shape[i] = *it;
+    fit(new_ndims, new_shape, _type);
+}
+
+void Mat::fit(int _rows, int _cols, int _type)
+{
+    _type &= TYPE_MASK;
+    int sz[] = {_rows, _cols};
+    fit(2, sz, _type);
+}
+
+void Mat::fit(Size _sz, int _type)
+{
+    fit(_sz.height, _sz.width, _type);
+}
+
+void Mat::fitSameSize(InputArray m, int _type)
+{
+    int _sizes[CV_MAX_DIM];
+    int _dims = m.sizend(_sizes);
+    fit(_dims, _sizes, _type);
+}
+
 void Mat::addref()
 {
     if( u )
@@ -613,6 +1105,10 @@ size_t Mat::total(int startDim, int endDim) const
     return p;
 }
 
+MatShape Mat::shape() const
+{
+    return dims == 0 && data == 0 ? MatShape() : MatShape(dims, size.p);
+}
 
 Mat::Mat(Mat&& m) CV_NOEXCEPT
     : flags(m.flags), dims(m.dims), rows(m.rows), cols(m.cols), data(m.data),
@@ -747,6 +1243,27 @@ void Mat::create(const std::vector<int>& _sizes, int _type)
     create((int)_sizes.size(), _sizes.data(), _type);
 }
 
+void Mat::create(const MatShape& _shape, int _type)
+{
+    if (_shape.dims < 0) {
+        release();
+        return;
+    }
+    create(_shape.dims, _shape.p, _type);
+}
+
+void Mat::create(std::initializer_list<int> _shape, int _type)
+{
+    int new_shape[MatShape::MAX_DIMS];
+    int new_ndims = (int)_shape.size();
+    CV_Assert(new_ndims <= MatShape::MAX_DIMS);
+    auto it = _shape.begin();
+    for (int i = 0; i < new_ndims; i++, ++it)
+        new_shape[i] = *it;
+
+    create(new_ndims, new_shape, _type);
+}
+
 void Mat::copySize(const Mat& m)
 {
     setSize(*this, m.dims, 0, 0);
@@ -870,6 +1387,37 @@ Mat::Mat(const std::vector<int>& _sizes, int _type, void* _data, const size_t* _
     finalizeHdr(*this);
 }
 
+Mat::Mat(const MatShape& _shape, int _type, void* _data, const size_t* _steps)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows)
+{
+    flags |= CV_MAT_TYPE(_type);
+    datastart = data = (uchar*)_data;
+    if (_shape.dims >= 0) {
+        setSize(*this, _shape.dims, _shape.p, _steps, true);
+    }
+    else {
+        CV_Assert(!data);
+    }
+    finalizeHdr(*this);
+}
+
+Mat::Mat(std::initializer_list<int> _shape, int _type, void* _data, const size_t* _steps)
+    : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
+      datalimit(0), allocator(0), u(0), size(&rows)
+{
+    int new_shape[MatShape::MAX_DIMS];
+    int new_ndims = (int)_shape.size();
+    CV_Assert(new_ndims <= MatShape::MAX_DIMS);
+    auto it = _shape.begin();
+    for (int i = 0; i < new_ndims; i++, ++it)
+        new_shape[i] = *it;
+
+    flags |= CV_MAT_TYPE(_type);
+    datastart = data = (uchar*)_data;
+    setSize(*this, new_ndims, new_shape, _steps, true);
+    finalizeHdr(*this);
+}
 
 Mat::Mat(const Mat& m, const Range* ranges)
     : flags(MAGIC_VAL), dims(0), rows(0), cols(0), data(0), datastart(0), dataend(0),
@@ -1297,6 +1845,26 @@ Mat Mat::reshape(int _cn, const std::vector<int>& _newshape) const
     return reshape(_cn, newdims, newdims > 0 ? &_newshape[0] : 0);
 }
 
+Mat Mat::reshape(int _cn, const MatShape& _newshape) const
+{
+    if (_newshape.dims < 0) {
+        int newshape[] = {0};
+        return reshape(_cn, 1, newshape);
+    }
+    return reshape(_cn, _newshape.dims, _newshape.p);
+}
+
+Mat Mat::reshape(int _cn, std::initializer_list<int> newshape_) const
+{
+    int newshape[MatShape::MAX_DIMS];
+    size_t i, newshape_dims = newshape_.size();
+    CV_Assert(newshape_dims <= (size_t)MatShape::MAX_DIMS);
+    auto it = newshape_.begin();
+    for (i = 0; i < newshape_dims; i++, ++it)
+        newshape[i] = *it;
+    return reshape(_cn, (int)newshape_dims, newshape);
+}
+
 Mat Mat::diag(const Mat& d)
 {
     CV_Assert( d.cols == 1 || d.rows == 1 );
diff --git a/modules/core/src/matrix_transform.cpp b/modules/core/src/matrix_transform.cpp
index bad17e7b6b..f59b4903c7 100644
--- a/modules/core/src/matrix_transform.cpp
+++ b/modules/core/src/matrix_transform.cpp
@@ -1081,6 +1081,16 @@ void broadcast(InputArray _src, InputArray _shape, OutputArray _dst) {
     }
 }
 
+void broadcast(InputArray _src, const MatShape& _shape, OutputArray _dst)
+{
+    if (_shape.dims < 0) {
+        _dst.release();
+    } else {
+        Mat shape(1, _shape.dims, CV_32S, (int*)_shape.p);
+        broadcast(_src, shape, _dst);
+    }
+}
+
 static void rotateImpl(InputArray _src, OutputArray _dst, int rotateMode)
 {
     switch (rotateMode)
diff --git a/modules/core/src/matrix_wrap.cpp b/modules/core/src/matrix_wrap.cpp
index 894380c878..b96dc40776 100644
--- a/modules/core/src/matrix_wrap.cpp
+++ b/modules/core/src/matrix_wrap.cpp
@@ -593,6 +593,41 @@ int _InputArray::sizend(int* arrsz, int i) const
     return d;
 }
 
+bool _InputArray::empty(int i) const
+{
+    _InputArray::KindFlag k = kind();
+    if (i >= 0) {
+        if (k == STD_VECTOR_MAT) {
+            auto mv = reinterpret_cast<const std::vector<Mat>*>(obj);
+            CV_Assert((size_t)i < mv->size());
+            return mv->at(i).empty();
+        }
+        else if (k == STD_VECTOR_MAT) {
+            auto umv = reinterpret_cast<const std::vector<UMat>*>(obj);
+            CV_Assert((size_t)i < umv->size());
+            return umv->at(i).empty();
+        }
+        else if (k == STD_VECTOR_VECTOR) {
+            auto vv = reinterpret_cast<const std::vector<std::vector<int> >*>(obj);
+            CV_Assert((size_t)i < vv->size());
+            return vv->at(i).empty();
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+    return empty();
+}
+
+MatShape _InputArray::shape(int i) const
+{
+    int sizes[CV_MAX_DIM];
+    int dims = sizend(sizes, i);
+
+    if (dims == 0 && empty(i))
+        return MatShape();
+    return MatShape(dims, sizes);
+}
+
 bool _InputArray::sameSize(const _InputArray& arr) const
 {
     _InputArray::KindFlag k1 = kind(), k2 = arr.kind();
@@ -1673,12 +1708,104 @@ void _OutputArray::create(int d, const int* sizes, int mtype, int i,
     CV_Error(Error::StsNotImplemented, "Unknown/unsupported array type");
 }
 
+void _OutputArray::create(const MatShape& shape, int mtype, int i,
+                          bool allowTransposed, _OutputArray::DepthMask fixedDepthMask) const
+{
+    if (shape.dims < 0) {
+        release();
+    } else {
+        create(shape.dims, shape.p, mtype, i, allowTransposed, fixedDepthMask);
+    }
+}
+
 void _OutputArray::createSameSize(const _InputArray& arr, int mtype) const
 {
     int arrsz[CV_MAX_DIM], d = arr.sizend(arrsz);
     create(d, arrsz, mtype);
 }
 
+void _OutputArray::fit(int d, const int* sizes, int mtype, int i,
+                       bool allowTransposed, _OutputArray::DepthMask fixedDepthMask) const
+{
+    int size0 = d > 0 ? sizes[0] : 1, size1 = d > 1 ? sizes[1] : 1;
+    _InputArray::KindFlag k = kind();
+    mtype = CV_MAT_TYPE(mtype);
+
+    if( (k == MAT && i < 0) || (k == STD_VECTOR_MAT && i >= 0) )
+    {
+        Mat* m;
+        if (k == MAT)
+            m = (Mat*)obj;
+        else {
+            std::vector<Mat>& v = *(std::vector<Mat>*)obj;
+            CV_Assert((size_t)i < v.size());
+            m = &v[i];
+        }
+        CV_Assert(!(m->empty() && fixedType() && fixedSize()) && "Can't reallocate empty Mat with locked layout (probably due to misused 'const' modifier)");
+        if (!m->empty() && d <= 2 && m->dims <= 2 &&
+            m->type() == mtype &&
+            ((m->rows == size0 && m->cols == size1) ||
+             (allowTransposed && m->rows == size1 && m->cols == size0 && m->isContinuous())))
+        {
+            return;
+        }
+
+        if(fixedType())
+        {
+            if(CV_MAT_CN(mtype) == m->channels() && ((1 << CV_MAT_DEPTH(flags)) & fixedDepthMask) != 0 )
+                mtype = m->type();
+            else
+                CV_CheckTypeEQ(m->type(), CV_MAT_TYPE(mtype), "Can't reallocate Mat with locked type (probably due to misused 'const' modifier)");
+        }
+        if(fixedSize())
+        {
+            CV_CheckEQ(m->dims, d, "Can't reallocate Mat with locked size (probably due to misused 'const' modifier)");
+            for(int j = 0; j < d; ++j)
+                CV_CheckEQ(m->size[j], sizes[j], "Can't reallocate Mat with locked size (probably due to misused 'const' modifier)");
+        }
+        m->fit(d, sizes, mtype);
+        return;
+    }
+
+    if( (k == UMAT && i < 0) || (k == STD_VECTOR_UMAT && i >= 0) )
+    {
+        UMat* m;
+        if (k == UMAT)
+            m = (UMat*)obj;
+        else {
+            std::vector<UMat>& v = *(std::vector<UMat>*)obj;
+            CV_Assert((size_t)i < v.size());
+            m = &v[i];
+        }
+        CV_Assert(!(m->empty() && fixedType() && fixedSize()) && "Can't reallocate empty Mat with locked layout (probably due to misused 'const' modifier)");
+        if (!m->empty() && d <= 2 && m->dims <= 2 &&
+            m->type() == mtype &&
+            ((m->rows == size0 && m->cols == size1) ||
+             (allowTransposed && m->rows == size1 && m->cols == size0 && m->isContinuous())))
+        {
+            return;
+        }
+
+        if(fixedType())
+        {
+            if(CV_MAT_CN(mtype) == m->channels() && ((1 << CV_MAT_DEPTH(flags)) & fixedDepthMask) != 0 )
+                mtype = m->type();
+            else
+                CV_CheckTypeEQ(m->type(), CV_MAT_TYPE(mtype), "Can't reallocate Mat with locked type (probably due to misused 'const' modifier)");
+        }
+        if(fixedSize())
+        {
+            CV_CheckEQ(m->dims, d, "Can't reallocate Mat with locked size (probably due to misused 'const' modifier)");
+            for(int j = 0; j < d; ++j)
+                CV_CheckEQ(m->size[j], sizes[j], "Can't reallocate Mat with locked size (probably due to misused 'const' modifier)");
+        }
+        m->fit(d, sizes, mtype);
+        return;
+    }
+
+    create(d, sizes, mtype, i, allowTransposed, fixedDepthMask);
+}
+
 void _OutputArray::release() const
 {
     CV_Assert(!fixedSize());
diff --git a/modules/core/src/out.cpp b/modules/core/src/out.cpp
index e35fafa1ab..4363a70a38 100644
--- a/modules/core/src/out.cpp
+++ b/modules/core/src/out.cpp
@@ -403,4 +403,186 @@ namespace cv
         }
         return makePtr<DefaultFormatter>();
     }
+
+    template<typename _Tp> struct Fmt
+    {
+        typedef int temp_type;
+        static const char* fmt() { return "%d"; }
+    };
+
+    template<> struct Fmt<uint32_t>
+    {
+        typedef unsigned temp_type;
+        static const char* fmt() { return "%u"; }
+    };
+
+    template<> struct Fmt<int64_t>
+    {
+        typedef long long temp_type;
+        static const char* fmt() { return "%lld"; }
+    };
+
+    template<> struct Fmt<uint64_t>
+    {
+        typedef unsigned long long temp_type;
+        static const char* fmt() { return "%llu"; }
+    };
+
+    template<> struct Fmt<float>
+    {
+        typedef float temp_type;
+        static const char* fmt() { return "%.5g"; }
+    };
+
+    template<> struct Fmt<double>
+    {
+        typedef double temp_type;
+        static const char* fmt() { return "%.5g"; }
+    };
+
+    template<> struct Fmt<hfloat>
+    {
+        typedef float temp_type;
+        static const char* fmt() { return "%.5g"; }
+    };
+
+    template<> struct Fmt<bfloat>
+    {
+        typedef float temp_type;
+        static const char* fmt() { return "%.4g"; }
+    };
+
+    template <typename _Tp>
+    static void pprintRow(std::ostream& strm, const _Tp* ptr, int n, size_t ofs, int edge)
+    {
+        char buf[128];
+        const char* fmt = Fmt<_Tp>::fmt();
+        int i, ndump = edge > 0 ? std::min(n, edge*2+1) : n;
+        if (edge == 0)
+            edge = ndump;
+        for (i = 0; i < ndump; i++) {
+            int j = n == ndump || i < edge ? i : i == edge ? -1 : n-edge*2-1+i;
+            if (i > 0)
+                strm << ", ";
+            if (j >= 0) {
+                snprintf(buf, sizeof(buf), fmt, (typename Fmt<_Tp>::temp_type)ptr[ofs + j]);
+                strm << buf;
+            } else
+                strm << "... ";
+        }
+    }
+
+    static void pprintSlice(std::ostream& strm, const Mat& tensor,
+                            const size_t* step, int d,
+                            size_t ofs, int edge)
+    {
+        MatShape shape = tensor.shape();
+        int ndims = shape.dims;
+        int n = d >= ndims ? 1 : shape[d];
+        if (d >= ndims - 1) {
+            int typ = tensor.depth();
+            void* data = tensor.data;
+            CV_Assert(data);
+            n *= tensor.channels();
+            if (typ == CV_8U)
+                pprintRow(strm, (const uint8_t*)data, n, ofs, edge);
+            else if (typ == CV_8S)
+                pprintRow(strm, (const int8_t*)data, n, ofs, edge);
+            else if (typ == CV_16U)
+                pprintRow(strm, (const uint16_t*)data, n, ofs, edge);
+            else if (typ == CV_16S)
+                pprintRow(strm, (const int16_t*)data, n, ofs, edge);
+            else if (typ == CV_32U)
+                pprintRow(strm, (const unsigned*)data, n, ofs, edge);
+            else if (typ == CV_32S)
+                pprintRow(strm, (const int*)data, n, ofs, edge);
+            else if (typ == CV_64U)
+                pprintRow(strm, (const uint64_t*)data, n, ofs, edge);
+            else if (typ == CV_64S)
+                pprintRow(strm, (const int64_t*)data, n, ofs, edge);
+            else if (typ == CV_32F)
+                pprintRow(strm, (const float*)data, n, ofs, edge);
+            else if (typ == CV_64F)
+                pprintRow(strm, (const double*)data, n, ofs, edge);
+            else if (typ == CV_16F)
+                pprintRow(strm, (const hfloat*)data, n, ofs, edge);
+            else if (typ == CV_16BF)
+                pprintRow(strm, (const bfloat*)data, n, ofs, edge);
+            else if (typ == CV_Bool)
+                pprintRow(strm, (const bool*)data, n, ofs, edge);
+            else {
+                CV_Error(Error::StsNotImplemented, "unsupported type");
+            }
+        } else {
+            int i, ndump = edge > 0 ? std::min(n, edge*2+1) : n;
+            bool dots = false;
+            for (i = 0; i < ndump; i++) {
+                if (i > 0 && !dots) {
+                    int nempty_lines = ndims - 2 - d;
+                    for (int k = 0; k < nempty_lines; k++)
+                        strm << "\n";
+                }
+                if (i > 0)
+                    strm << "\n";
+                int j = n == ndump || i < edge ? i :
+                        i == edge ? -1 :
+                        n - edge*2 - 1 + i;
+                dots = j < 0;
+                if (!dots)
+                    pprintSlice(strm, tensor, step, d+1, ofs + j*step[d], edge);
+                else
+                    strm << "...";
+            }
+        }
+    }
+
+    std::ostream& pprint(std::ostream& strm, InputArray array,
+                         int /*indent*/, int edge_,
+                         int wholeTensorThreshold,
+                         char parens)
+    {
+        char oparen = parens;
+        char cparen = parens == '(' ? ')' :
+                      parens == '[' ? ']' :
+                      parens == '{' ? '}' :
+                      parens == '<' ? '>' :
+                      parens;
+        int edge = edge_ > 0 ? edge_ : 3;
+        wholeTensorThreshold = wholeTensorThreshold > 0 ? wholeTensorThreshold : 100;
+
+        Mat tensor = array.getMat();
+        if (!tensor.isContinuous()) {
+            // [TODO] print non-continous arrays without copy
+            Mat temp;
+            tensor.copyTo(temp);
+            tensor = temp;
+        }
+
+        MatShape shape = tensor.shape();
+        size_t sz_all = tensor.total();
+
+        if (parens)
+            strm << oparen;
+        if (sz_all == 0) {
+            if (!parens)
+                strm << "<empty>";
+        } else {
+            if (sz_all <= (size_t)wholeTensorThreshold)
+                edge = 0;
+
+            int ndims = shape.dims;
+            int cn = tensor.channels();
+            size_t step[MatShape::MAX_DIMS];
+            step[std::max(ndims-1, 0)] = 1;
+            for (int i = ndims-2; i >= 0; i--) {
+                step[i] = step[i+1]*shape[i+1]*cn;
+                cn = 1;
+            }
+            pprintSlice(strm, tensor, step, 0, 0, edge);
+        }
+        if (parens)
+            strm << cparen;
+        return strm;
+    }
+
 } // cv
diff --git a/modules/core/src/umatrix.cpp b/modules/core/src/umatrix.cpp
index a98e98d4b4..cc011da4f4 100644
--- a/modules/core/src/umatrix.cpp
+++ b/modules/core/src/umatrix.cpp
@@ -412,6 +412,10 @@ size_t UMat::total() const
     return p;
 }
 
+MatShape UMat::shape() const
+{
+    return dims == 0 && u == 0 ? MatShape() : MatShape(dims, size.p);
+}
 
 UMat::UMat(UMat&& m)
 : flags(m.flags), dims(m.dims), rows(m.rows), cols(m.cols), allocator(m.allocator),
@@ -751,6 +755,67 @@ void UMat::create(const std::vector<int>& _sizes, int _type, UMatUsageFlags _usa
     create((int)_sizes.size(), _sizes.data(), _type, _usageFlags);
 }
 
+void UMat::create(const MatShape& _shape, int _type, UMatUsageFlags _usageFlags)
+{
+    if (_shape.dims < 0) {
+        release();
+    } else {
+        create(_shape.dims, _shape.p, _type, _usageFlags);
+    }
+}
+
+void UMat::fit(int _dims, const int* _sizes, int _type, UMatUsageFlags _usageFlags)
+{
+    if (_usageFlags == cv::USAGE_DEFAULT)
+        _usageFlags = usageFlags;
+    size_t oldTotalBytes = u ? u->size : 0;
+    size_t esz = CV_ELEM_SIZE(_type), newTotal = _dims >= 0;
+    for (int i = 0; i < _dims; i++)
+        newTotal *= _sizes[i];
+    size_t newTotalBytes = newTotal*esz;
+    if (newTotalBytes > 0 &&
+        (!isContinuous() ||
+         newTotalBytes > oldTotalBytes ||
+         offset != 0 ||
+         _usageFlags != usageFlags)) {
+        create(_dims, _sizes, _type, _usageFlags);
+    } else {
+        flags = (flags & ~Mat::TYPE_MASK) | CV_MAT_TYPE(_type);
+        int _dummy_size = 0;
+        setSize(*this, (_dims >= 0 ? _dims : 1), (_dims >= 0 ? _sizes : &_dummy_size), nullptr, true);
+        finalizeHdr(*this);
+    }
+}
+
+void UMat::fit(const std::vector<int>& _shape, int _type, UMatUsageFlags _usageFlags)
+{
+    fit((int)_shape.size(), _shape.data(), _type, _usageFlags);
+}
+
+void UMat::fit(const MatShape& _shape, int _type, UMatUsageFlags _usageFlags)
+{
+    fit(_shape.dims, _shape.p, _type, _usageFlags);
+}
+
+void UMat::fit(int _rows, int _cols, int _type, UMatUsageFlags _usageFlags)
+{
+    _type &= TYPE_MASK;
+    int sz[] = {_rows, _cols};
+    fit(2, sz, _type, _usageFlags);
+}
+
+void UMat::fit(Size _sz, int _type, UMatUsageFlags _usageFlags)
+{
+    fit(_sz.height, _sz.width, _type, _usageFlags);
+}
+
+void UMat::fitSameSize(InputArray m, int _type, UMatUsageFlags _usageFlags)
+{
+    int _sizes[CV_MAX_DIM];
+    int _dims = m.sizend(_sizes);
+    fit(_dims, _sizes, _type, _usageFlags);
+}
+
 void UMat::copySize(const UMat& m)
 {
     setSize(*this, m.dims, 0, 0);
@@ -1101,6 +1166,15 @@ UMat UMat::reshape(int _cn, int _newndims, const int* _newsz) const
     CV_Error(cv::Error::StsNotImplemented, "Reshaping of n-dimensional non-continuous matrices is not supported yet");
 }
 
+UMat UMat::reshape(int _cn, const MatShape& _newshape) const
+{
+    if (_newshape.dims < 0) {
+        int newshape[] = {0};
+        return reshape(_cn, 1, newshape);
+    }
+    return reshape(_cn, _newshape.dims, _newshape.p);
+}
+
 Mat UMat::getMat(AccessFlag accessFlags) const
 {
     if(!u)
diff --git a/modules/dnn/include/opencv2/dnn/all_layers.hpp b/modules/dnn/include/opencv2/dnn/all_layers.hpp
index 16305e8292..7d8ff6b058 100644
--- a/modules/dnn/include/opencv2/dnn/all_layers.hpp
+++ b/modules/dnn/include/opencv2/dnn/all_layers.hpp
@@ -86,6 +86,15 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<Layer> create(const LayerParams &params);
     };
 
+    /**
+     * Constant layer produces the same data blob at an every forward pass.
+     */
+    class CV_EXPORTS ConstantOfShapeLayer : public Layer
+    {
+    public:
+        static Ptr<ConstantOfShapeLayer> create(const LayerParams &params);
+    };
+
     //! LSTM recurrent layer
     class CV_EXPORTS LSTMLayer : public Layer
     {
@@ -360,6 +369,15 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<GatherLayer> create(const LayerParams& params);
     };
 
+    // ONNX-compliant implementation of Gather
+    class CV_EXPORTS Gather2Layer : public Layer
+    {
+    public:
+        int axis;
+
+        static Ptr<Gather2Layer> create(const LayerParams& params);
+    };
+
     /** @brief GatherElements layer
     * GatherElements takes two inputs data and indices of the same rank r >= 1 and an optional attribute axis and works such that:
     *   output[i][j][k] = data[index[i][j][k]][j][k] if axis = 0 and r = 3
@@ -462,6 +480,14 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<MVNLayer> create(const LayerParams& params);
     };
 
+    class CV_EXPORTS ShapeLayer : public Layer
+    {
+    public:
+        int start, end;
+
+        static Ptr<ShapeLayer> create(const LayerParams& params);
+    };
+
     /* Reshaping */
 
     class CV_EXPORTS ReshapeLayer : public Layer
@@ -473,12 +499,42 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<ReshapeLayer> create(const LayerParams& params);
     };
 
+    class CV_EXPORTS Reshape2Layer : public Layer
+    {
+    public:
+        MatShape newShapeDesc;
+
+        static Ptr<Reshape2Layer> create(const LayerParams& params);
+    };
+
     class CV_EXPORTS FlattenLayer : public Layer
     {
     public:
         static Ptr<FlattenLayer> create(const LayerParams &params);
     };
 
+    class CV_EXPORTS SqueezeLayer : public Layer
+    {
+    public:
+        std::vector<int> axes;
+
+        static Ptr<SqueezeLayer> create(const LayerParams& params);
+    };
+
+    class CV_EXPORTS UnsqueezeLayer : public Layer
+    {
+    public:
+        std::vector<int> axes;
+
+        static Ptr<UnsqueezeLayer> create(const LayerParams& params);
+    };
+
+    class CV_EXPORTS RangeLayer : public Layer
+    {
+    public:
+        static Ptr<RangeLayer> create(const LayerParams& params);
+    };
+
     class CV_EXPORTS QuantizeLayer : public Layer
     {
     public:
@@ -487,6 +543,17 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<QuantizeLayer> create(const LayerParams &params);
     };
 
+    class CV_EXPORTS QuantizeLinearLayer : public Layer
+    {
+    public:
+        int axis;
+        int block_size;
+        int output_dtype;
+        bool saturate;
+
+        static Ptr<QuantizeLinearLayer> create(const LayerParams& params);
+    };
+
     class CV_EXPORTS DequantizeLayer : public Layer
     {
     public:
@@ -495,6 +562,15 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<DequantizeLayer> create(const LayerParams &params);
     };
 
+    class CV_EXPORTS DequantizeLinearLayer : public Layer
+    {
+    public:
+        int axis;
+        int block_size;
+
+        static Ptr<DequantizeLinearLayer> create(const LayerParams& params);
+    };
+
     class CV_EXPORTS RequantizeLayer : public Layer
     {
     public:
@@ -518,6 +594,14 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<ConcatLayer> create(const LayerParams &params);
     };
 
+    class CV_EXPORTS Concat2Layer : public Layer
+    {
+    public:
+        int axis;
+
+        static Ptr<Concat2Layer> create(const LayerParams &params);
+    };
+
     class CV_EXPORTS SplitLayer : public Layer
     {
     public:
@@ -526,6 +610,16 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<SplitLayer> create(const LayerParams &params);
     };
 
+    // ONNX-compliant version of Split
+    class CV_EXPORTS Split2Layer : public Layer
+    {
+    public:
+        int axis;
+        std::vector<int> split;
+
+        static Ptr<Split2Layer> create(const LayerParams& params);
+    };
+
     /**
      * Slice layer has several modes:
      * 1. Caffe mode
@@ -567,12 +661,31 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<SliceLayer> create(const LayerParams &params);
     };
 
+    // ONNX-compliant version of Slice
+    class CV_EXPORTS Slice2Layer : public Layer
+    {
+    public:
+        std::vector<int> starts, ends, axes;
+
+        static Ptr<Slice2Layer> create(const LayerParams &params);
+    };
+
     class CV_EXPORTS PermuteLayer : public Layer
     {
     public:
         static Ptr<PermuteLayer> create(const LayerParams& params);
     };
 
+    // ONNX-compliant version of Transpose
+    // (previously implemented in PermuteLayer)
+    class CV_EXPORTS TransposeLayer : public Layer
+    {
+    public:
+        std::vector<int> perm;
+
+        static Ptr<TransposeLayer> create(const LayerParams& params);
+    };
+
     /**
      * Permute channels of 4-dimensional input blob.
      * @param group Number of groups to split input channels and pick in turns
@@ -616,6 +729,12 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<PaddingLayer> create(const LayerParams& params);
     };
 
+    class CV_EXPORTS Pad2Layer : public Layer
+    {
+    public:
+        static Ptr<Pad2Layer> create(const LayerParams& params);
+    };
+
     /* Activations */
     class CV_EXPORTS ActivationLayer : public Layer
     {
@@ -1157,6 +1276,12 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<TileLayer> create(const LayerParams& params);
     };
 
+    class CV_EXPORTS Tile2Layer : public Layer
+    {
+    public:
+        static Ptr<Tile2Layer> create(const LayerParams& params);
+    };
+
     class CV_EXPORTS LayerNormLayer : public Layer
     {
     public:
@@ -1188,6 +1313,12 @@ CV__DNN_INLINE_NS_BEGIN
         static Ptr<ExpandLayer> create(const LayerParams &params);
     };
 
+    class CV_EXPORTS Expand2Layer : public Layer
+    {
+    public:
+        static Ptr<Expand2Layer> create(const LayerParams &params);
+    };
+
     class CV_EXPORTS InstanceNormLayer : public Layer {
     public:
         float epsilon;
diff --git a/modules/dnn/include/opencv2/dnn/dict.hpp b/modules/dnn/include/opencv2/dnn/dict.hpp
index 059ce9b28e..870fdc3cdc 100644
--- a/modules/dnn/include/opencv2/dnn/dict.hpp
+++ b/modules/dnn/include/opencv2/dnn/dict.hpp
@@ -138,6 +138,10 @@ public:
     template <typename T>
     T get(const String &key, const T &defaultValue) const;
 
+    //! If the @p key in the dictionary then returns its value, else returns empty vector.
+    template <typename T>
+    std::vector<T> getVector(const String &key) const;
+
     //! Sets new @p value for the @p key, or adds new key-value pair into the dictionary.
     template<typename T>
     const T &set(const String &key, const T &value);
diff --git a/modules/dnn/include/opencv2/dnn/dnn.hpp b/modules/dnn/include/opencv2/dnn/dnn.hpp
index 0234e32e50..66a943fc66 100644
--- a/modules/dnn/include/opencv2/dnn/dnn.hpp
+++ b/modules/dnn/include/opencv2/dnn/dnn.hpp
@@ -42,6 +42,7 @@
 #ifndef OPENCV_DNN_DNN_HPP
 #define OPENCV_DNN_DNN_HPP
 
+#include <ostream>
 #include <vector>
 #include <opencv2/core.hpp>
 #include "opencv2/core/async.hpp"
@@ -61,7 +62,6 @@ CV__DNN_INLINE_NS_BEGIN
 //! @addtogroup dnn
 //! @{
 
-    typedef std::vector<int> MatShape;
     typedef int MatType;
 
     /**
@@ -107,21 +107,30 @@ CV__DNN_INLINE_NS_BEGIN
         DNN_TARGET_CPU_FP16, // Only the ARM platform is supported. Low precision computing, accelerate model inference.
     };
 
-    /**
-     * @brief Enum of data layout for model inference.
-     * @see Image2BlobParams
-     */
-    enum DataLayout
+    enum TracingMode
+    {
+        DNN_TRACE_NONE = 0, //!< Don't trace anything
+        DNN_TRACE_ALL = 1, //!< Print all executed operations along with the output tensors, more or less compatible with ONNX Runtime
+        DNN_TRACE_OP = 2 //!< Print all executed operations. Types and shapes of all inputs and outputs are printed, but the content is not.
+    };
+
+    enum ProfilingMode
     {
-        DNN_LAYOUT_UNKNOWN = 0,
-        DNN_LAYOUT_ND = 1,        //!< OpenCV data layout for 2D data.
-        DNN_LAYOUT_NCHW = 2,      //!< OpenCV data layout for 4D data.
-        DNN_LAYOUT_NCDHW = 3,      //!< OpenCV data layout for 5D data.
-        DNN_LAYOUT_NHWC = 4,      //!< Tensorflow-like data layout for 4D data.
-        DNN_LAYOUT_NDHWC = 5,      //!< Tensorflow-like data layout for 5D data.
-        DNN_LAYOUT_PLANAR = 6,     //!< Tensorflow-like data layout, it should only be used at tf or tflite model parsing.
+        DNN_PROFILE_NONE = 0, //!< Don't do any profiling
+        DNN_PROFILE_SUMMARY = 1, //!< Collect the summary statistics by layer type (e.g. all "Conv2D" or all "Add") and print it in the end, sorted by the execution time (most expensive layers first). Note that it may introduce some overhead and cause slowdown, especially in the case of non-CPU backends.
+        DNN_PROFILE_DETAILED = 2 //!< Print execution time of each single layer. Note that it may introduce some overhead and cause slowdown, especially in the case of non-CPU backends.
     };
 
+    enum ModelFormat {
+        DNN_MODEL_GENERIC = 0, //!< Some generic model format
+        DNN_MODEL_ONNX = 1, //!< ONNX model
+        DNN_MODEL_TF = 2, //!< TF model
+        DNN_MODEL_TFLITE = 3, //!< TFLite model
+        DNN_MODEL_CAFFE = 4, //!< Caffe model
+    };
+
+    CV_EXPORTS std::string modelFormatToString(ModelFormat modelFormat);
+
     CV_EXPORTS std::vector< std::pair<Backend, Target> > getAvailableBackends();
     CV_EXPORTS_W std::vector<Target> getAvailableTargets(dnn::Backend be);
 
@@ -218,6 +227,40 @@ CV__DNN_INLINE_NS_BEGIN
         int hostMatDepth = -1;
     };
 
+    struct CV_EXPORTS Arg
+    {
+        Arg();
+        explicit Arg(int idx_);
+        bool empty() const;
+        operator bool() const;
+        // idx > 0: the Arg is input or output argument of some operation inside inference graph
+        // idx < 0: the Arg is input or output argument of a pattern
+        // idx == 0: no/empty argument; used in operations where some of the inputs/outputs are optional.
+        int idx;
+    };
+
+    enum ArgKind {
+        DNN_ARG_EMPTY=0, //!< valid only for Arg.idx==0. It's "no-arg"
+        DNN_ARG_CONST=1, //!< a constant argument.
+        DNN_ARG_INPUT=2, //!< input of the whole model. Before Net::forward() or in Net::forward() all inputs must be set
+        DNN_ARG_OUTPUT=3, //!< output of the model.
+        DNN_ARG_TEMP=4,   //!< intermediate result, a result of some operation and input to some other operation(s).
+        DNN_ARG_PATTERN=5 //!< not used for now
+    };
+
+    CV_EXPORTS std::string argKindToString(ArgKind kind);
+
+    struct CV_EXPORTS ArgData
+    {
+        ArgData();
+        std::string name;
+        ArgKind kind;
+        MatShape shape;
+        int type;
+    };
+
+    class CV_EXPORTS Net;
+    class CV_EXPORTS Graph;
     class CV_EXPORTS ActivationLayer;
 
     /** @brief This interface class allows to build new Layers - are building blocks of networks.
@@ -231,6 +274,11 @@ CV__DNN_INLINE_NS_BEGIN
 
         //! List of learned parameters must be stored here to allow read them by using Net::getParam().
         CV_PROP_RW std::vector<Mat> blobs;
+        std::vector<Arg> inputs;
+        std::vector<Arg> outputs;
+        void* netimpl;
+
+        virtual std::vector<Ptr<Graph> >* subgraphs() const;
 
         /** @brief Computes and sets internal parameters according to inputs, outputs and blobs.
          *  @deprecated Use Layer::finalize(InputArrayOfArrays, OutputArrayOfArrays) instead
@@ -413,10 +461,30 @@ CV__DNN_INLINE_NS_BEGIN
                               std::vector<MatType>&internals) const;
 
         virtual int64 getFLOPS(const std::vector<MatShape> &inputs,
-                               const std::vector<MatShape> &outputs) const {CV_UNUSED(inputs); CV_UNUSED(outputs); return 0;}
+                               const std::vector<MatShape> &outputs) const;
 
         virtual bool updateMemoryShapes(const std::vector<MatShape> &inputs);
 
+        // returns true if the operation takes a single input and can always be performed in-place,
+        // assuming that the input is contiguous.
+        // Examples of such operations are: Reshape, Flatten, Squeeze, Unsqueeze,
+        // as well many unary element-wise operations (ReLU, Tanh, ...)
+        virtual bool alwaysSupportInplace() const;
+
+        // returns false if the shape of Layer outputs is defined only by the shapes of inputs.
+        // Sometimes the shape depends on the content of the input(s), then the method should return true.
+        // In such a rare case forward() method should take care of proper allocation of the output tensors.
+        // On the other hand, when this method returns false, the engine takes care of proper allocation of the outputs,
+        // so that forward() can assume that the outputs are already allocated.
+        virtual bool dynamicOutputShapes() const;
+
+        // dumps attributes of the layer (e.g. strides, dilations in Convolution, MaxPool)
+        virtual std::ostream& dumpAttrs(std::ostream& strm, int indent) const;
+
+        // dumps information about the layer. The default implementation is usually good enough,
+        // just override dumpAttrs().
+        virtual std::ostream& dump(std::ostream& strm, int indent, bool comma) const;
+
         CV_PROP String name; //!< Name of the layer instance, can be used for logging or other internal purposes.
         CV_PROP String type; //!< Type name which was used for creating layer by layer factory.
         CV_PROP int preferableTarget; //!< prefer target for layer forwarding
@@ -427,6 +495,32 @@ CV__DNN_INLINE_NS_BEGIN
         virtual ~Layer();
     };
 
+    /** @brief Represents graph or subgraph of a model.
+     * The graph (in mathematical terms it's rather a multigraph) is represented
+     * as a topologically-sorted linear sequence of operations.
+     * Each operation is a smart pointer to a Layer (some of its derivative class instance), which
+     * includes a list of inputs and outputs, as well as an optional list of subgraphs (e.g. 'If' contains 2 subgraphs).
+     */
+    class CV_EXPORTS Graph
+    {
+    public:
+        static Ptr<Graph> create(void* netimpl, const std::string& name,
+                                 const std::vector<Arg>& inputs);
+        virtual ~Graph();
+        virtual bool empty() const = 0;
+        virtual void clear() = 0;
+        virtual std::string name() const = 0;
+        virtual const std::vector<Arg>& append(Ptr<Layer>& layer,
+                    const std::vector<std::string>& outnames=std::vector<std::string>()) = 0;
+        virtual Arg append(Ptr<Layer>& layer, const std::string& outname=std::string()) = 0;
+        virtual std::ostream& dump(std::ostream& strm, int indent, bool comma) = 0;
+        virtual const std::vector<Arg>& inputs() const = 0;
+        virtual const std::vector<Arg>& outputs() const = 0;
+        virtual void setOutputs(const std::vector<Arg>& outputs) = 0;
+        virtual const std::vector<Ptr<Layer> >& prog() const = 0;
+        virtual void setProg(const std::vector<Ptr<Layer> >& newprog) = 0;
+    };
+
     /** @brief This class allows to create and manipulate comprehensive artificial neural networks.
      *
      * Neural network is presented as directed acyclic graph (DAG), where vertices are Layer instances,
@@ -491,6 +585,10 @@ CV__DNN_INLINE_NS_BEGIN
          *  Call method after setInput(). To see correct backend, target and fusion run after forward().
         */
         CV_WRAP void dumpToPbtxt(CV_WRAP_FILE_PATH const String& path);
+        /** @brief Dump net structure, hyperparameters, backend, target and fusion to the specified output stream
+         *  @param strm   the target stream
+        */
+        void dumpToStream(std::ostream& strm) const;
 
         /** @brief Adds new layer to the net.
          *  @param name   unique name of the adding layer.
@@ -650,6 +748,33 @@ CV__DNN_INLINE_NS_BEGIN
          */
         CV_WRAP void setPreferableTarget(int targetId);
 
+        /**
+         * @brief Set the tracing mode
+         * @param[in] tracingMode the tracing mode, see DNN_TRACE_*
+         */
+        CV_WRAP void setTracingMode(TracingMode tracingMode);
+
+        /**
+         * @brief Retrieve the current tracing mode
+         */
+        CV_WRAP TracingMode getTracingMode() const;
+
+        /**
+         * @brief Set the profiling mode
+         * @param[in] profilingMode the profiling mode, see DNN_PROFILE_*
+         */
+        CV_WRAP void setProfilingMode(ProfilingMode profilingMode);
+
+        /**
+         * @brief Retrieve the current profiling mode
+         */
+        CV_WRAP ProfilingMode getProfilingMode() const;
+
+        /**
+         * @brief Retrieve the current model format, see DNN_MODEL_*
+         */
+        CV_WRAP ModelFormat getModelFormat() const;
+
         /** @brief Sets the new input value for the network
          *  @param blob        A new blob. Should have CV_32F or CV_8U depth.
          *  @param name        A name of input layer.
@@ -703,20 +828,25 @@ CV__DNN_INLINE_NS_BEGIN
          *  @param inLayersShapes output parameter for input layers shapes;
          * order is the same as in layersIds
          *  @param outLayersShapes output parameter for output layers shapes;
-         * order is the same as in layersIds
+         * order is the same as in layersIds.
+         *
+         * This overload should be deprecated
          */
-        CV_WRAP void getLayersShapes(const std::vector<MatShape>& netInputShapes,
-                                     const std::vector<int>& netInputTypes,
-                                     CV_OUT std::vector<int>& layersIds,
-                                     CV_OUT std::vector<std::vector<MatShape> >& inLayersShapes,
-                                     CV_OUT std::vector<std::vector<MatShape> >& outLayersShapes) const;
+        void getLayersShapes(const std::vector<MatShape>& netInputShapes,
+                             const std::vector<int>& netInputTypes,
+                             CV_OUT std::vector<int>& layersIds,
+                             CV_OUT std::vector<std::vector<MatShape> >& inLayersShapes,
+                             CV_OUT std::vector<std::vector<MatShape> >& outLayersShapes) const;
 
-        /** @overload */
-        CV_WRAP void getLayersShapes(const MatShape& netInputShape,
-                                     const int& netInputType,
-                                     CV_OUT std::vector<int>& layersIds,
-                                     CV_OUT std::vector<std::vector<MatShape> >& inLayersShapes,
-                                     CV_OUT std::vector<std::vector<MatShape> >& outLayersShapes) const;
+        /** @overload
+         *
+         * This overload should be deprecated
+        */
+        void getLayersShapes(const MatShape& netInputShape,
+                             const int& netInputType,
+                             CV_OUT std::vector<int>& layersIds,
+                             CV_OUT std::vector<std::vector<MatShape> >& inLayersShapes,
+                             CV_OUT std::vector<std::vector<MatShape> >& outLayersShapes) const;
 
         /** @brief Returns input and output shapes for layer with specified
          * id in loaded model; preliminary inferencing isn't necessary.
@@ -727,15 +857,20 @@ CV__DNN_INLINE_NS_BEGIN
          * order is the same as in layersIds
          *  @param outLayerShapes output parameter for output layers shapes;
          * order is the same as in layersIds
-         */
-        CV_WRAP void getLayerShapes(const MatShape& netInputShape,
-                                    const int& netInputType,
-                                    const int layerId,
-                                    CV_OUT std::vector<MatShape>& inLayerShapes,
-                                    CV_OUT std::vector<MatShape>& outLayerShapes) const; // FIXIT: CV_WRAP
+         *
+         * This overload should be deprecated
+        */
+        void getLayerShapes(const MatShape& netInputShape,
+                            const int& netInputType,
+                            const int layerId,
+                            CV_OUT std::vector<MatShape>& inLayerShapes,
+                            CV_OUT std::vector<MatShape>& outLayerShapes) const; // FIXIT: CV_WRAP
 
-        /** @overload */
-        void getLayerShapes(const std::vector<MatShape>& netInputShapes,
+        /** @overload
+         *
+         * The only overload of getLayerShapes that should be kept in 5.x
+        */
+        CV_WRAP void getLayerShapes(const std::vector<MatShape>& netInputShapes,
                                     const std::vector<int>& netInputTypes,
                                     const int layerId,
                                     CV_OUT std::vector<MatShape>& inLayerShapes,
@@ -748,17 +883,19 @@ CV__DNN_INLINE_NS_BEGIN
          */
         CV_WRAP int64 getFLOPS(const std::vector<MatShape>& netInputShapes,
                                const std::vector<int>& netInputTypes) const;
+        /** @overload
+            These overloads should be deprecated
+        */
+        int64 getFLOPS(const MatShape& netInputShape,
+                       const int& netInputType) const;
         /** @overload */
-        CV_WRAP int64 getFLOPS(const MatShape& netInputShape,
-                               const int& netInputType) const;
-        /** @overload */
-        CV_WRAP int64 getFLOPS(const int layerId,
-                               const std::vector<MatShape>& netInputShapes,
-                               const std::vector<int>& netInputTypes) const;
+        int64 getFLOPS(const int layerId,
+                       const std::vector<MatShape>& netInputShapes,
+                       const std::vector<int>& netInputTypes) const;
         /** @overload */
-        CV_WRAP int64 getFLOPS(const int layerId,
-                               const MatShape& netInputShape,
-                               const int& netInputType) const;
+        int64 getFLOPS(const int layerId,
+                       const MatShape& netInputShape,
+                       const int& netInputType) const;
 
         /** @brief Returns list of types for layer used in model.
          * @param layersTypes output parameter for returning types.
@@ -778,20 +915,26 @@ CV__DNN_INLINE_NS_BEGIN
          * @param weights output parameter to store resulting bytes for weights.
          * @param blobs output parameter to store resulting bytes for intermediate blobs.
          */
-        void getMemoryConsumption(const std::vector<MatShape>& netInputShapes,
+        CV_WRAP void getMemoryConsumption(const std::vector<MatShape>& netInputShapes,
                                           const std::vector<int>& netInputTypes,
-                                          CV_OUT size_t& weights, CV_OUT size_t& blobs) const; // FIXIT: CV_WRAP
-        /** @overload */
-        CV_WRAP void getMemoryConsumption(const MatShape& netInputShape,
+                                          CV_OUT size_t& weights, CV_OUT size_t& blobs) const;
+        /** @overload
+            It should be deprecated
+        */
+        void getMemoryConsumption(const MatShape& netInputShape,
                                           const int& netInputType,
                                           CV_OUT size_t& weights, CV_OUT size_t& blobs) const;
-        /** @overload */
-        CV_WRAP void getMemoryConsumption(const int layerId,
+        /** @overload
+            It should be deprecated
+        */
+        void getMemoryConsumption(const int layerId,
                                           const std::vector<MatShape>& netInputShapes,
                                           const std::vector<int>& netInputTypes,
                                           CV_OUT size_t& weights, CV_OUT size_t& blobs) const;
-        /** @overload */
-        CV_WRAP void getMemoryConsumption(const int layerId,
+        /** @overload
+            It should be deprecated
+        */
+        void getMemoryConsumption(const int layerId,
                                           const MatShape& netInputShape,
                                           const int& netInputType,
                                           CV_OUT size_t& weights, CV_OUT size_t& blobs) const;
@@ -803,18 +946,23 @@ CV__DNN_INLINE_NS_BEGIN
          * @param layerIds output vector to save layer IDs.
          * @param weights output parameter to store resulting bytes for weights.
          * @param blobs output parameter to store resulting bytes for intermediate blobs.
-         */
+         *
+         * It should be deprecated
+        */
         void getMemoryConsumption(const std::vector<MatShape>& netInputShapes,
                                           const std::vector<int>& netInputTypes,
                                           CV_OUT std::vector<int>& layerIds,
                                           CV_OUT std::vector<size_t>& weights,
-                                          CV_OUT std::vector<size_t>& blobs) const; // FIXIT: CV_WRAP
-        /** @overload */
+                                          CV_OUT std::vector<size_t>& blobs) const;
+        /** @overload
+         *
+         *  It should be deprecated
+         */
         void getMemoryConsumption(const MatShape& netInputShape,
                                           const int& netInputType,
                                           CV_OUT std::vector<int>& layerIds,
                                           CV_OUT std::vector<size_t>& weights,
-                                          CV_OUT std::vector<size_t>& blobs) const; // FIXIT: CV_WRAP
+                                          CV_OUT std::vector<size_t>& blobs) const;
 
         /** @brief Enables or disables layer fusion in the network.
          * @param fusion true to enable the fusion, false to disable. The fusion is enabled by default.
@@ -837,6 +985,28 @@ CV__DNN_INLINE_NS_BEGIN
          */
         CV_WRAP int64 getPerfProfile(CV_OUT std::vector<double>& timings);
 
+        // Get the main model graph
+        Ptr<Graph> getMainGraph() const;
+
+        const ArgData& argData(Arg arg) const;
+        const std::string& argName(Arg arg) const;
+        ArgKind argKind(Arg arg) const;
+
+        // if the name is empty, always creates a new argument;
+        // if it's not empty, returns argument with the specific name if it already exists,
+        // otherwise creates new argument with the specified name
+        Arg getArg(const std::string& name);
+        bool haveArg(const std::string& name) const;
+
+        bool isConstArg(Arg arg) const;
+        Mat& argTensor(Arg arg) const;
+        int argType(Arg arg) const;
+
+        int findDim(const std::string& name, bool insert=false);
+
+        std::ostream& dumpArg(std::ostream& strm, Arg arg, int indent,
+                              bool comma=true, bool dump_details=false) const;
+        std::ostream& dumpDim(std::ostream& strm, int value) const;
 
         struct Impl;
         inline Impl* getImpl() const { return impl.get(); }
@@ -846,6 +1016,13 @@ CV__DNN_INLINE_NS_BEGIN
         Ptr<Impl> impl;
     };
 
+    enum EngineType
+    {
+        ENGINE_CLASSIC=1, //!< Force use the new dnn engine. The engine does not support non CPU back-ends for now.
+        ENGINE_NEW=2,     //!< Force use the old dnn engine similar to 4.x branch
+        ENGINE_AUTO=3     //!< Try to use the new engine and then fall back to the classic version.
+    };
+
     /** @brief Reads a network model stored in <a href="https://pjreddie.com/darknet/">Darknet</a> model files.
     *  @param cfgFile      path to the .cfg file with text description of the network architecture.
     *  @param darknetModel path to the .weights file with learned network.
@@ -962,6 +1139,9 @@ CV__DNN_INLINE_NS_BEGIN
       *                  * `*.cfg` (Darknet, https://pjreddie.com/darknet/)
       *                  * `*.xml` (OpenVINO, https://software.intel.com/openvino-toolkit)
       * @param[in] framework Explicit framework name tag to determine a format.
+      * @param[in] engine select DNN engine to be used. With auto selection the new engine is used first and falls back to classic.
+      * Please pay attention that the new DNN does not support non-CPU back-ends for now.
+      * Use ENGINE_CLASSIC if you want to use other back-ends.
       * @returns Net object.
       *
       * This function automatically detects an origin framework of trained model
@@ -969,7 +1149,10 @@ CV__DNN_INLINE_NS_BEGIN
       * or @ref readNetFromDarknet. An order of @p model and @p config
       * arguments does not matter.
       */
-     CV_EXPORTS_W Net readNet(CV_WRAP_FILE_PATH const String& model, CV_WRAP_FILE_PATH const String& config = "", const String& framework = "");
+     CV_EXPORTS_W Net readNet(CV_WRAP_FILE_PATH const String& model,
+                              CV_WRAP_FILE_PATH const String& config = "",
+                              const String& framework = "",
+                              int engine = ENGINE_AUTO);
 
      /**
       * @brief Read deep learning network represented in one of the supported formats.
@@ -978,10 +1161,14 @@ CV__DNN_INLINE_NS_BEGIN
       * @param[in] framework    Name of origin framework.
       * @param[in] bufferModel  A buffer with a content of binary file with weights
       * @param[in] bufferConfig A buffer with a content of text file contains network configuration.
+      * @param engine select DNN engine to be used. With auto selection the new engine is used first and falls back to classic.
+      * Please pay attention that the new DNN does not support non-CPU back-ends for now.
+      * Use ENGINE_CLASSIC if you want to use other back-ends.
       * @returns Net object.
       */
      CV_EXPORTS_W Net readNet(const String& framework, const std::vector<uchar>& bufferModel,
-                              const std::vector<uchar>& bufferConfig = std::vector<uchar>());
+                              const std::vector<uchar>& bufferConfig = std::vector<uchar>(),
+                              int engine = ENGINE_AUTO);
 
     /** @brief Load a network from Intel's Model Optimizer intermediate representation.
      *  @param[in] xml XML configuration file with network's topology.
@@ -1016,28 +1203,34 @@ CV__DNN_INLINE_NS_BEGIN
     Net readNetFromModelOptimizer(const uchar* bufferModelConfigPtr, size_t bufferModelConfigSize,
                                            const uchar* bufferWeightsPtr, size_t bufferWeightsSize);
 
+
     /** @brief Reads a network model <a href="https://onnx.ai/">ONNX</a>.
      *  @param onnxFile path to the .onnx file with text description of the network architecture.
+     *  @param engine select DNN engine to be used. With auto selection the new engine is used first and falls back to classic.
+     *  Please pay attention that the new DNN does not support non-CPU back-ends for now.
      *  @returns Network object that ready to do forward, throw an exception in failure cases.
      */
-    CV_EXPORTS_W Net readNetFromONNX(CV_WRAP_FILE_PATH const String &onnxFile);
+    CV_EXPORTS_W Net readNetFromONNX(CV_WRAP_FILE_PATH const String &onnxFile, int engine=ENGINE_AUTO);
 
     /** @brief Reads a network model from <a href="https://onnx.ai/">ONNX</a>
      *         in-memory buffer.
      *  @param buffer memory address of the first byte of the buffer.
      *  @param sizeBuffer size of the buffer.
+     *  @param engine select DNN engine to be used. With auto selection the new engine is used first and falls back to classic.
      *  @returns Network object that ready to do forward, throw an exception
      *        in failure cases.
      */
-    CV_EXPORTS Net readNetFromONNX(const char* buffer, size_t sizeBuffer);
+    CV_EXPORTS Net readNetFromONNX(const char* buffer, size_t sizeBuffer, int engine=ENGINE_AUTO);
 
     /** @brief Reads a network model from <a href="https://onnx.ai/">ONNX</a>
      *         in-memory buffer.
      *  @param buffer in-memory buffer that stores the ONNX model bytes.
+     *  @param engine select DNN engine to be used. With auto selection the new engine is used first and falls back to classic.
+     *  Please pay attention that the new DNN does not support non-CPU back-ends for now.
      *  @returns Network object that ready to do forward, throw an exception
      *        in failure cases.
      */
-    CV_EXPORTS_W Net readNetFromONNX(const std::vector<uchar>& buffer);
+    CV_EXPORTS_W Net readNetFromONNX(const std::vector<uchar>& buffer, int engine=ENGINE_AUTO);
 
     /** @brief Creates blob from .pb file.
      *  @param path to the .pb file with input tensor.
diff --git a/modules/dnn/include/opencv2/dnn/dnn.inl.hpp b/modules/dnn/include/opencv2/dnn/dnn.inl.hpp
index 8312a418f3..bc6b7c7833 100644
--- a/modules/dnn/include/opencv2/dnn/dnn.inl.hpp
+++ b/modules/dnn/include/opencv2/dnn/dnn.inl.hpp
@@ -116,12 +116,24 @@ inline int64 DictValue::get<int64>(int idx) const
 template<>
 inline int DictValue::get<int>(int idx) const
 {
-    return (int)get<int64>(idx);
+    return saturate_cast<int>(get<int64>(idx));
 }
 
 inline int DictValue::getIntValue(int idx) const
 {
-    return (int)get<int64>(idx);
+    return saturate_cast<int>(get<int64>(idx));
+}
+
+template<>
+inline std::vector<int> DictValue::get<std::vector<int> >(int idx) const
+{
+    CV_Assert(idx == -1);
+    int size_ = size();
+    std::vector<int> values(size_);
+
+    for (int i = 0; i < size_; i++)
+        values[i] = get<int>(i);
+    return values;
 }
 
 template<>
@@ -368,6 +380,17 @@ inline T Dict::get(const String &key, const T &defaultValue) const
         return defaultValue;
 }
 
+template <typename T>
+inline std::vector<T> Dict::getVector(const String &key) const
+{
+    _Dict::const_iterator i = dict.find(key);
+
+    if (i != dict.end())
+        return i->second.get<std::vector<T> >();
+    else
+        return std::vector<T>();
+}
+
 template<typename T>
 inline const T &Dict::set(const String &key, const T &value)
 {
@@ -405,6 +428,18 @@ inline std::map<String, DictValue>::const_iterator Dict::end() const
     return dict.end();
 }
 
+/////////////////////////////////////////////////////////////////
+
+inline Arg::Arg() : idx(0) {}
+
+inline Arg::Arg(int idx_) : idx(idx_) {}
+
+inline bool Arg::empty() const { return idx == 0; }
+
+inline Arg::operator bool() const { return idx != 0; }
+
+inline bool operator == (const Arg& a, const Arg& b) { return a.idx == b.idx; }
+
 CV__DNN_INLINE_NS_END
 }
 }
diff --git a/modules/dnn/include/opencv2/dnn/shape_utils.hpp b/modules/dnn/include/opencv2/dnn/shape_utils.hpp
index 660a87743f..ec5914ce3b 100644
--- a/modules/dnn/include/opencv2/dnn/shape_utils.hpp
+++ b/modules/dnn/include/opencv2/dnn/shape_utils.hpp
@@ -125,17 +125,12 @@ static inline MatShape shape(const int* dims, const int n)
 
 static inline MatShape shape(const Mat& mat)
 {
-    return shape(mat.size.p, mat.dims);
-}
-
-static inline MatShape shape(const MatSize& sz)
-{
-    return shape(sz.p, sz.dims());
+    return mat.shape();
 }
 
 static inline MatShape shape(const UMat& mat)
 {
-    return shape(mat.size.p, mat.dims);
+    return mat.shape();
 }
 
 #if 0  // issues with MatExpr wrapped into InputArray
@@ -152,10 +147,9 @@ namespace {inline bool is_neg(int i) { return i < 0; }}
 
 static inline MatShape shape(int a0, int a1=-1, int a2=-1, int a3=-1)
 {
-    int dims[] = {a0, a1, a2, a3};
-    MatShape s = shape(dims, 4);
-    s.erase(std::remove_if(s.begin(), s.end(), is_neg), s.end());
-    return s;
+    int shape_[] = {a0, a1, a2, a3};
+    int dims = 1 + (a1 >= 0) + (a1 >= 0 && a2 >= 0) + (a1 >= 0 && a2 >= 0 && a3 >= 0);
+    return shape(shape_, dims);
 }
 
 static inline int total(const MatShape& shape, int start = -1, int end = -1)
@@ -206,11 +200,41 @@ static inline int total(const Mat& mat, int start = -1, int end = -1)
 static inline MatShape concat(const MatShape& a, const MatShape& b)
 {
     MatShape c = a;
-    c.insert(c.end(), b.begin(), b.end());
-
+    size_t a_size = a.size(), b_size = b.size(), c_size = a_size + b_size;
+    c.resize(c_size);
+    for (size_t i = 0; i < b_size; i++) {
+        c[i+a_size] = b[i];
+    }
     return c;
 }
 
+static inline std::ostream& operator << (std::ostream& strm, const MatShape& shape)
+{
+    strm << '[';
+    if (shape.empty()) {
+        strm << "<empty>";
+    } else {
+        size_t n = shape.size();
+        if (n == 0) {
+            strm << "<scalar>";
+        } else {
+            for(size_t i = 0; i < n; ++i)
+                strm << (i > 0 ? " x " : "") << shape[i];
+        }
+    }
+    strm << "]";
+    return strm;
+}
+
+static inline std::string toString(const MatShape& shape, const String& name = "")
+{
+    std::ostringstream ss;
+    if (!name.empty())
+        ss << name << ' ';
+    ss << shape;
+    return ss.str();
+}
+
 template<typename _Tp>
 static inline std::string toString(const std::vector<_Tp>& shape, const String& name = "")
 {
@@ -269,14 +293,11 @@ Range normalize_axis_range(const Range& r, int axisSize)
 static inline
 bool isAllOnes(const MatShape &inputShape, int startPos, int endPos)
 {
-    CV_Assert(!inputShape.empty());
-
-    CV_CheckGE((int) inputShape.size(), startPos, "");
     CV_CheckGE(startPos, 0, "");
     CV_CheckLE(startPos, endPos, "");
-    CV_CheckLE((size_t)endPos, inputShape.size(), "");
+    CV_CheckLE(endPos, inputShape.dims, "");
 
-    for (size_t i = startPos; i < endPos; i++)
+    for (int i = startPos; i < endPos; i++)
     {
         if (inputShape[i] != 1)
             return false;
diff --git a/modules/dnn/misc/java/src/cpp/dnn_converters.cpp b/modules/dnn/misc/java/src/cpp/dnn_converters.cpp
index 95184c0e90..f03adb4d87 100644
--- a/modules/dnn/misc/java/src/cpp/dnn_converters.cpp
+++ b/modules/dnn/misc/java/src/cpp/dnn_converters.cpp
@@ -8,19 +8,19 @@
 
 #define LOG_TAG "org.opencv.dnn"
 
-void Mat_to_MatShape(cv::Mat& mat, MatShape& matshape)
+void Mat_to_MatShape(cv::Mat& mat, cv::MatShape& matshape)
 {
     matshape.clear();
     CHECK_MAT(mat.type()==CV_32SC1 && mat.cols==1);
-    matshape = (MatShape) mat;
+    matshape = (cv::MatShape) mat;
 }
 
-void MatShape_to_Mat(MatShape& matshape, cv::Mat& mat)
+void MatShape_to_Mat(cv::MatShape& matshape, cv::Mat& mat)
 {
     mat = cv::Mat(matshape, true);
 }
 
-std::vector<MatShape> List_to_vector_MatShape(JNIEnv* env, jobject list)
+std::vector<cv::MatShape> List_to_vector_MatShape(JNIEnv* env, jobject list)
 {
     static jclass juArrayList       = ARRAYLIST(env);
     jmethodID m_size       = LIST_SIZE(env, juArrayList);
@@ -29,13 +29,13 @@ std::vector<MatShape> List_to_vector_MatShape(JNIEnv* env, jobject list)
     static jclass jMatOfInt = MATOFINT(env);
 
     jint len = env->CallIntMethod(list, m_size);
-    std::vector<MatShape> result;
+    std::vector<cv::MatShape> result;
     result.reserve(len);
     for (jint i=0; i<len; i++)
     {
         jobject element = static_cast<jobject>(env->CallObjectMethod(list, m_get, i));
         cv::Mat& mat = *((cv::Mat*) GETNATIVEOBJ(env, jMatOfInt, element) );
-        MatShape matshape = (MatShape) mat;
+        cv::MatShape matshape = (cv::MatShape) mat;
         result.push_back(matshape);
         env->DeleteLocalRef(element);
     }
diff --git a/modules/dnn/misc/java/src/cpp/dnn_converters.hpp b/modules/dnn/misc/java/src/cpp/dnn_converters.hpp
index e1f63e0a00..d4757911d7 100644
--- a/modules/dnn/misc/java/src/cpp/dnn_converters.hpp
+++ b/modules/dnn/misc/java/src/cpp/dnn_converters.hpp
@@ -15,14 +15,13 @@
 #define LAYER(ENV) static_cast<jclass>(ENV->NewGlobalRef(ENV->FindClass("org/opencv/dnn/Layer")))
 #define LAYER_CONSTRUCTOR(ENV, CLS) ENV->GetMethodID(CLS, "<init>", "(J)V")
 
-
 using namespace cv::dnn;
 
-void Mat_to_MatShape(cv::Mat& mat, MatShape& matshape);
+void Mat_to_MatShape(cv::Mat& mat, cv::MatShape& matshape);
 
-void MatShape_to_Mat(MatShape& matshape, cv::Mat& mat);
+void MatShape_to_Mat(cv::MatShape& matshape, cv::Mat& mat);
 
-std::vector<MatShape> List_to_vector_MatShape(JNIEnv* env, jobject list);
+std::vector<cv::MatShape> List_to_vector_MatShape(JNIEnv* env, jobject list);
 
 jobject vector_Ptr_Layer_to_List(JNIEnv* env, std::vector<cv::Ptr<cv::dnn::Layer> >& vs);
 
diff --git a/modules/dnn/misc/java/test/DnnListRegressionTest.java b/modules/dnn/misc/java/test/DnnListRegressionTest.java
index d30c9fcc86..21c7dfb338 100644
--- a/modules/dnn/misc/java/test/DnnListRegressionTest.java
+++ b/modules/dnn/misc/java/test/DnnListRegressionTest.java
@@ -94,26 +94,24 @@ public class DnnListRegressionTest extends OpenCVTestCase {
     }
 
     public void testGetMemoryConsumption() {
-        int layerId = 1;
         List<MatOfInt> netInputShapes = new ArrayList();
         netInputShapes.add(new MatOfInt(1, 3, 224, 224));
         MatOfInt netInputTypes = new MatOfInt(5);
         long[] weights=null;
         long[] blobs=null;
         try {
-            net.getMemoryConsumption(layerId, netInputShapes, netInputTypes, weights, blobs);
+            net.getMemoryConsumption(netInputShapes, netInputTypes, weights, blobs);
         } catch(Exception e) {
             fail("Net getMemoryConsumption failed: " + e.getMessage());
         }
     }
 
     public void testGetFLOPS() {
-        int layerId = 1;
         List<MatOfInt> netInputShapes = new ArrayList();
         netInputShapes.add(new MatOfInt(1, 3, 224, 224));
         MatOfInt netInputTypes = new MatOfInt(5);
         try {
-            net.getFLOPS(layerId, netInputShapes, netInputTypes);
+            net.getFLOPS(netInputShapes, netInputTypes);
         } catch(Exception e) {
             fail("Net getFLOPS failed: " + e.getMessage());
         }
diff --git a/modules/dnn/misc/objc/gen_dict.json b/modules/dnn/misc/objc/gen_dict.json
index 8aab0a5500..e45023e9d8 100644
--- a/modules/dnn/misc/objc/gen_dict.json
+++ b/modules/dnn/misc/objc/gen_dict.json
@@ -5,8 +5,8 @@
             "(Net*)readNetFromCaffe:(ByteVector*)bufferProto bufferModel:(ByteVector*)bufferModel" : { "readNetFromCaffe" : {"name" : "readNetFromCaffeBuffer"} },
             "(Net*)readNetFromDarknet:(NSString*)cfgFile darknetModel:(NSString*)darknetModel" : { "readNetFromDarknet" : {"name" : "readNetFromDarknetFile"} },
             "(Net*)readNetFromDarknet:(ByteVector*)bufferCfg bufferModel:(ByteVector*)bufferModel" : { "readNetFromDarknet" : {"name" : "readNetFromDarknetBuffer"} },
-            "(Net*)readNetFromONNX:(NSString*)onnxFile" : { "readNetFromONNX" : {"name" : "readNetFromONNXFile"} },
-            "(Net*)readNetFromONNX:(ByteVector*)buffer" : { "readNetFromONNX" : {"name" : "readNetFromONNXBuffer"} },
+            "(Net*)readNetFromONNX:(NSString*)onnxFile engine:(int)engine" : { "readNetFromONNX" : {"name" : "readNetFromONNXFile"} },
+            "(Net*)readNetFromONNX:(ByteVector*)buffer engine:(int)engine" : { "readNetFromONNX" : {"name" : "readNetFromONNXBuffer"} },
             "(Net*)readNetFromTensorflow:(NSString*)model config:(NSString*)config" : { "readNetFromTensorflow" : {"name" : "readNetFromTensorflowFile"} },
             "(Net*)readNetFromTensorflow:(ByteVector*)bufferModel bufferConfig:(ByteVector*)bufferConfig" : { "readNetFromTensorflow" : {"name" : "readNetFromTensorflowBuffer"} },
             "(Net*)readNetFromTFLite:(NSString*)model" : { "readNetFromTFLite" : {"name" : "readNetFromTFLiteFile"} },
@@ -16,14 +16,8 @@
             "(void)forward:(NSMutableArray<Mat*>*)outputBlobs outputName:(NSString*)outputName" : { "forward" : {"name" : "forwardOutputBlobs"} },
             "(void)forward:(NSMutableArray<Mat*>*)outputBlobs outBlobNames:(NSArray<NSString*>*)outBlobNames" : { "forward" : {"name" : "forwardOutputBlobs"} },
             "(void)forwardAndRetrieve:(NSMutableArray<NSMutableArray<Mat*>*>*)outputBlobs outBlobNames:(NSArray<NSString*>*)outBlobNames" : { "forward" : {"swift_name" : "forwardAndRetrieve"} },
-            "(long)getFLOPS:(IntVector*)netInputShape" : { "getFLOPS" : {"name" : "getFLOPSWithNetInputShape"} },
-            "(long)getFLOPS:(NSArray<IntVector*>*)netInputShapes" : { "getFLOPS" : {"name" : "getFLOPSWithNetInputShapes"} },
-            "(long)getFLOPS:(int)layerId netInputShape:(IntVector*)netInputShape" : { "getFLOPS" : {"name" : "getFLOPSWithLayerId"} },
-            "(long)getFLOPS:(int)layerId netInputShapes:(NSArray<IntVector*>*)netInputShapes" : { "getFLOPS" : {"name" : "getFLOPSWithLayerId"} },
             "(Layer*)getLayer:(NSString*)layerName" : { "getLayer" : {"name" : "getLayerByName"} },
             "(Layer*)getLayer:(DictValue*)layerId" : { "getLayer" : {"name" : "getLayerByDictValue"} },
-            "(void)getLayersShapes:(IntVector*)netInputShape layersIds:(IntVector*)layersIds inLayersShapes:(NSMutableArray<NSMutableArray<IntVector*>*>*)inLayersShapes outLayersShapes:(NSMutableArray<NSMutableArray<IntVector*>*>*)outLayersShapes" : { "getLayersShapes" : {"name" : "getLayersShapesWithNetInputShape"} },
-            "(void)getLayersShapes:(NSArray<IntVector*>*)netInputShapes layersIds:(IntVector*)layersIds inLayersShapes:(NSMutableArray<NSMutableArray<IntVector*>*>*)inLayersShapes outLayersShapes:(NSMutableArray<NSMutableArray<IntVector*>*>*)outLayersShapes" : { "getLayersShapes" : {"name" : "getLayersShapesWithNetInputShapes"} },
             "(Mat*)getParam:(NSString*)layerName numParam:(int)numParam" : { "getParam" : {"name" : "getParamByName"} },
             "(void)setParam:(NSString*)layerName numParam:(int)numParam blob:(Mat*)blob" : { "setParam" : {"name" : "setParamByName"} }
         }
@@ -31,17 +25,20 @@
     "type_dict": {
         "MatShape": {
             "objc_type": "IntVector*",
-            "to_cpp": "%(n)s.nativeRef",
-            "from_cpp": "[IntVector fromNative:%(n)s]",
-            "cast_to": "std::vector<int>"
+            "to_cpp": "cv::MatShape(%(n)s.nativeRef)",
+            "from_cpp": "[IntVector fromNative:(std::vector<int>)%(n)s]"
         },
         "vector_MatShape": {
             "objc_type": "IntVector*",
-            "v_type": "IntVector"
+            "to_cpp": "cv::MatShape(%(n)s.nativeRef)",
+            "from_cpp": "[IntVector fromNative:(std::vector<int>)%(n)s]",
+            "v_type": "MatShape"
         },
         "vector_vector_MatShape": {
             "objc_type": "IntVector*",
-            "v_v_type": "IntVector"
+            "to_cpp": "cv::MatShape(%(n)s.nativeRef)",
+            "from_cpp": "[IntVector fromNative:(std::vector<int>)%(n)s]",
+            "v_v_type": "MatShape"
         },
         "LayerId": {
             "objc_type": "DictValue*",
diff --git a/modules/dnn/misc/python/pyopencv_dnn.hpp b/modules/dnn/misc/python/pyopencv_dnn.hpp
index 0fc0c45e82..0cc2fe63cc 100644
--- a/modules/dnn/misc/python/pyopencv_dnn.hpp
+++ b/modules/dnn/misc/python/pyopencv_dnn.hpp
@@ -1,7 +1,7 @@
 #ifdef HAVE_OPENCV_DNN
 typedef dnn::DictValue LayerId;
-typedef std::vector<dnn::MatShape> vector_MatShape;
-typedef std::vector<std::vector<dnn::MatShape> > vector_vector_MatShape;
+typedef std::vector<MatShape> vector_MatShape;
+typedef std::vector<std::vector<MatShape> > vector_vector_MatShape;
 
 template<>
 bool pyopencv_to(PyObject *o, dnn::DictValue &dv, const ArgInfo& info)
@@ -143,37 +143,16 @@ public:
         return Ptr<dnn::Layer>(new pycvLayer(params, it->second.back()));
     }
 
-    virtual bool getMemoryShapes(const std::vector<std::vector<int> > &inputs,
-                                 const int,
-                                 std::vector<std::vector<int> > &outputs,
-                                 std::vector<std::vector<int> > &) const CV_OVERRIDE
-    {
-        PyGILState_STATE gstate;
-        gstate = PyGILState_Ensure();
-
-        PyObject* args = PyList_New(inputs.size());
-        for(size_t i = 0; i < inputs.size(); ++i)
-            PyList_SetItem(args, i, pyopencv_from_generic_vec(inputs[i]));
-
-        PyObject* res = PyObject_CallMethodObjArgs(o, PyString_FromString("getMemoryShapes"), args, NULL);
-        Py_DECREF(args);
-        PyGILState_Release(gstate);
-        if (!res)
-            CV_Error(Error::StsNotImplemented, "Failed to call \"getMemoryShapes\" method");
-        CV_Assert(pyopencv_to_generic_vec(res, outputs, ArgInfo("", 0)));
-        return false;
-    }
-
     virtual void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr, OutputArrayOfArrays) CV_OVERRIDE
     {
         PyGILState_STATE gstate;
         gstate = PyGILState_Ensure();
 
-        std::vector<Mat> inputs, outputs;
-        inputs_arr.getMatVector(inputs);
-        outputs_arr.getMatVector(outputs);
+        std::vector<Mat> ins, outs;
+        inputs_arr.getMatVector(ins);
+        outputs_arr.getMatVector(outs);
 
-        PyObject* args = pyopencv_from(inputs);
+        PyObject* args = pyopencv_from(ins);
         PyObject* res = PyObject_CallMethodObjArgs(o, PyString_FromString("forward"), args, NULL);
         Py_DECREF(args);
         if (!res)
@@ -184,12 +163,12 @@ public:
         Py_DECREF(res);
         PyGILState_Release(gstate);
 
-        CV_Assert(pyOutputs.size() == outputs.size());
-        for (size_t i = 0; i < outputs.size(); ++i)
+        CV_Assert(pyOutputs.size() == outs.size());
+        for (size_t i = 0; i < outs.size(); ++i)
         {
-            CV_Assert(pyOutputs[i].size == outputs[i].size);
-            CV_Assert(pyOutputs[i].type() == outputs[i].type());
-            pyOutputs[i].copyTo(outputs[i]);
+            CV_Assert(pyOutputs[i].size == outs[i].size);
+            CV_Assert(pyOutputs[i].type() == outs[i].type());
+            pyOutputs[i].copyTo(outs[i]);
         }
     }
 
diff --git a/modules/dnn/misc/python/test/test_dnn.py b/modules/dnn/misc/python/test/test_dnn.py
old mode 100644
new mode 100755
index af67c1ffbf..d55b810e9b
--- a/modules/dnn/misc/python/test/test_dnn.py
+++ b/modules/dnn/misc/python/test/test_dnn.py
@@ -134,7 +134,7 @@ class dnn_test(NewOpenCVTests):
         paramNet.mean = [0.485, 0.456, 0.406]
         paramNet.scalefactor = [0.229, 0.224, 0.225]
         paramNet.swapRB = False
-        paramNet.datalayout = cv.dnn.DNN_LAYOUT_NCHW
+        paramNet.datalayout = cv.DATA_LAYOUT_NCHW
         paramNet.paddingmode = cv.dnn.DNN_PMODE_LETTERBOX
         rBlob = np.zeros(shape=(20, 4), dtype=np.int32)
         rImg = paramNet.blobRectsToImageRects(rBlob, (356, 356))
@@ -148,7 +148,7 @@ class dnn_test(NewOpenCVTests):
         paramNet.mean = [0.485, 0.456, 0.406]
         paramNet.scalefactor = [0.229, 0.224, 0.225]
         paramNet.swapRB = False
-        paramNet.datalayout = cv.dnn.DNN_LAYOUT_NCHW
+        paramNet.datalayout = cv.DATA_LAYOUT_NCHW
         paramNet.paddingmode = cv.dnn.DNN_PMODE_LETTERBOX
         rBlob = np.zeros(shape=(20, 4), dtype=np.int32)
         rImg = paramNet.blobRectToImageRect((0, 0, 0, 0), (356, 356))
@@ -198,11 +198,11 @@ class dnn_test(NewOpenCVTests):
         param.size = (6, 7)
         param.mean = mean
         param.swapRB=True
-        param.datalayout = cv.dnn.DNN_LAYOUT_NHWC
+        param.datalayout = cv.DATA_LAYOUT_NHWC
 
         blob = cv.dnn.blobFromImageWithParams(img, param)
         blob_args = cv.dnn.blobFromImageWithParams(img, cv.dnn.Image2BlobParams(scalefactor=scalefactor, size=(6, 7), mean=mean,
-                                                                      swapRB=True, datalayout=cv.dnn.DNN_LAYOUT_NHWC))
+                                                                      swapRB=True, datalayout=cv.DATA_LAYOUT_NHWC))
         normAssert(self, blob, blob_args)
 
         target2 = cv.resize(img, (width, height), interpolation=cv.INTER_LINEAR).astype(np.float32)
@@ -374,6 +374,8 @@ class dnn_test(NewOpenCVTests):
 
         self.assertTrue(all(cv.dnn.NMSBoxes(rects, confs, 0, 0.6).ravel() == (0, 1)))
 
+    # BUG: https://github.com/opencv/opencv/issues/26200
+    @unittest.skip("custom layers are partially broken with transition to the new dnn engine")
     def test_custom_layer(self):
         class CropLayer(object):
             def __init__(self, params, blobs):
@@ -510,7 +512,7 @@ class dnn_test(NewOpenCVTests):
         for backend, target in self.dnnBackendsAndTargets:
             printParams(backend, target)
 
-            net = cv.dnn.readNet(model_path)
+            net = cv.dnn.readNet(model_path, "", "", engine=cv.dnn.ENGINE_CLASSIC)
 
             node_name = net.getLayerNames()[0]
             w = net.getParam(node_name, 0) # returns the original tensor of three-dimensional shape
diff --git a/modules/dnn/perf/perf_einsum.cpp b/modules/dnn/perf/perf_einsum.cpp
index bad9d956be..5c8bbb204f 100644
--- a/modules/dnn/perf/perf_einsum.cpp
+++ b/modules/dnn/perf/perf_einsum.cpp
@@ -10,8 +10,8 @@ struct EinsumParams {
     int inputSize;
     int outputSize;
     std::string equation;
-    std::vector<MatShape> einsumInpShapes;
-    EinsumParams(std::string equation_, std::vector<MatShape> einsumInpShapes_ = std::vector<MatShape>())
+    std::vector<std::vector<int> > einsumInpShapes;
+    EinsumParams(std::string equation_, std::vector<std::vector<int> > einsumInpShapes_ = std::vector<std::vector<int> >())
     {
         inputSize = einsumInpShapes_.size();
         equation = equation_;
@@ -80,7 +80,7 @@ PERF_TEST_P_(Layer_Einsum, einsum) {
 
     for (int i = 0; i < params.inputSize; ++i) {
         // create inputs
-        inputs.emplace_back(Mat(params.einsumInpShapes[i].size(), params.einsumInpShapes[i].data(), CV_32FC1));
+        inputs.emplace_back(Mat(params.einsumInpShapes[i], CV_32FC1));
 
         // connect each input to the layer
         net.connect(0, i, id, i);
diff --git a/modules/dnn/perf/perf_net.cpp b/modules/dnn/perf/perf_net.cpp
index 1adb8a16fb..be489e627f 100644
--- a/modules/dnn/perf/perf_net.cpp
+++ b/modules/dnn/perf/perf_net.cpp
@@ -65,7 +65,9 @@ public:
         size_t weightsMemory = 0, blobsMemory = 0;
         net.getMemoryConsumption(netMatShapes, netMatTypes, weightsMemory, blobsMemory);
         int64 flops = net.getFLOPS(netMatShapes, netMatTypes);
-        CV_Assert(flops > 0);
+        // [TODO] implement getFLOPS in the new engine
+        // Issue: https://github.com/opencv/opencv/issues/26199
+        CV_Assert(flops > 0 || net.getMainGraph());
         std::cout << "Memory consumption:" << std::endl;
         std::cout << "    Weights(parameters): " << divUp(weightsMemory, 1u<<20) << " Mb" << std::endl;
         std::cout << "    Blobs: " << divUp(blobsMemory, 1u<<20) << " Mb" << std::endl;
diff --git a/modules/dnn/src/cuda4dnn/primitives/depth_space_ops.hpp b/modules/dnn/src/cuda4dnn/primitives/depth_space_ops.hpp
index 7846881e70..92a2e0ed68 100644
--- a/modules/dnn/src/cuda4dnn/primitives/depth_space_ops.hpp
+++ b/modules/dnn/src/cuda4dnn/primitives/depth_space_ops.hpp
@@ -22,17 +22,19 @@ namespace cv { namespace dnn { namespace cuda4dnn {
     public:
         using wrapper_type = GetCUDABackendWrapperType<T>;
 
-        DepthSpaceOps(csl::Stream stream_, const std::vector<int> &internal_shape_,
+        DepthSpaceOps(csl::Stream stream_, const MatShape& internal_shape_,
                      const std::vector<size_t> &permutation_)
             : stream(std::move(stream_)), internal_shape(internal_shape_),
               permutation(permutation_)
         {
-            transposed_internal_shape = std::vector<int>(internal_shape.size());
-            for (size_t i = 0; i < permutation.size(); i++) {
-                transposed_internal_shape[i] = internal_shape[permutation[i]];
+            int dims = internal_shape.dims;
+            int nperm = (int)permutation_.size();
+            transposed_internal_shape = MatShape(dims);
+            for (int i = 0; i < nperm; i++) {
+                transposed_internal_shape[i] = internal_shape[(int)permutation[i]];
             }
 
-            size_t num_elements = std::accumulate(internal_shape.begin(), internal_shape.end(), 1, std::multiplies<size_t>());
+            size_t num_elements = internal_shape.total();
             csl::WorkspaceBuilder builder;
             builder.require<T>(num_elements);
             scratch_mem_in_bytes = builder.required_workspace_size();
@@ -64,9 +66,9 @@ namespace cv { namespace dnn { namespace cuda4dnn {
 
     private:
         csl::Stream stream;
-        std::vector<int> internal_shape;
+        MatShape internal_shape;
         std::vector<size_t> permutation;
-        std::vector<int> transposed_internal_shape;
+        MatShape transposed_internal_shape;
 
         std::size_t scratch_mem_in_bytes;
     };
diff --git a/modules/dnn/src/dnn_common.hpp b/modules/dnn/src/dnn_common.hpp
index 83709443ad..13f124d7e3 100644
--- a/modules/dnn/src/dnn_common.hpp
+++ b/modules/dnn/src/dnn_common.hpp
@@ -172,6 +172,11 @@ static inline Scalar_<double> broadcastRealScalar(const Scalar_<double>& _scale)
     return scale;
 }
 
+static inline void prindent(std::ostream& strm, int indent)
+{
+    for (int i = 0; i < indent; i++)
+        strm << ' ';
+}
 
 CV__DNN_INLINE_NS_END
 
diff --git a/modules/dnn/src/dnn_read.cpp b/modules/dnn/src/dnn_read.cpp
index bb9fedb29a..ed42d57942 100644
--- a/modules/dnn/src/dnn_read.cpp
+++ b/modules/dnn/src/dnn_read.cpp
@@ -10,7 +10,7 @@ namespace dnn {
 CV__DNN_INLINE_NS_BEGIN
 
 
-Net readNet(const String& _model, const String& _config, const String& _framework)
+Net readNet(const String& _model, const String& _config, const String& _framework, int engine)
 {
     String framework = toLowerCase(_framework);
     String model = _model;
@@ -49,17 +49,17 @@ Net readNet(const String& _model, const String& _config, const String& _framewor
     }
     if (framework == "onnx" || modelExt == "onnx")
     {
-        return readNetFromONNX(model);
+        return readNetFromONNX(model, engine);
     }
     CV_Error(Error::StsError, "Cannot determine an origin framework of files: " + model + (config.empty() ? "" : ", " + config));
 }
 
 Net readNet(const String& _framework, const std::vector<uchar>& bufferModel,
-        const std::vector<uchar>& bufferConfig)
+        const std::vector<uchar>& bufferConfig, int engine)
 {
     String framework = toLowerCase(_framework);
     if (framework == "onnx")
-        return readNetFromONNX(bufferModel);
+        return readNetFromONNX(bufferModel, engine);
     else if (framework == "caffe")
         return readNetFromCaffe(bufferConfig, bufferModel);
     else if (framework == "tensorflow")
diff --git a/modules/dnn/src/graph_buffer_allocator.cpp b/modules/dnn/src/graph_buffer_allocator.cpp
new file mode 100644
index 0000000000..1d10b1034c
--- /dev/null
+++ b/modules/dnn/src/graph_buffer_allocator.cpp
@@ -0,0 +1,336 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "precomp.hpp"
+#include "net_impl.hpp"
+
+namespace cv { namespace dnn {
+CV__DNN_INLINE_NS_BEGIN
+
+using std::vector;
+using std::string;
+
+/* Assigns buffers for all intermediate tensors of the graph/model
+
+ The algorithm is quite simple, but there are some nuances in the attempt to re-use memory more efficiently:
+
+ All layer arguments in graph and sub-graphs are classified into 4 categories:
+ a) inputs, b) outputs, c) constants and d) temporary values/tensors.
+
+ Except for the temporary values ("d" category), each other argument gets
+ its own dedicated storage, which makes things more clear and predictable.
+ So, this algorithm assigns buffers only for the temporary values.
+
+ During the inference process, each temporary value is computed
+ by one of the layers and then used by zero or more subsequent layers (only as input).
+ An example of a model where some tensors are used more than once is Resnet.
+ After a tensor is used for the last time and
+ won't be used in any subsequent layer, the memory buffer for that tensor could be re-used for
+ other arguments. We want to assign each temporary tensor to some temporary buffer,
+ and it's typically N:1 mapping.
+
+ We do it using 2-stage algorithm:
+
+ 1. First, we calculate, how many times each argument is used and store the counters into 'usecounts'.
+ 2. Second, we scan the layers in topologically sorted order
+ 2.0. Sanity check: We check that each input argument of the operation is either input or constant,
+ or it's a temporary tensor with the buffer assigned to it.
+ If not, then the layers are not sorted in a topological order.
+ 2.1. For in-place reshape operations, such as squeeze/unsqueeze/flatten etc.
+ or for unary element-wise operations,
+ we check whether the input is a temporary value and is not used in any subsequent operations.
+ If these checks all pass, we assign output argument to the same buffer as input. Note that
+ we don't try to reuse inputs of binary/ternary etc. operation because of the broadcasting.
+ We need to do symbolic shape inference to proof that the output is of the same shape as one of the inputs.
+ 2.2. Otherwise, for each output argument of operation, which is not a network output argument.
+ we assign the most recently-used free buffer (i.e. the top buffer in the stack of free buffers).
+ If there is no free buffers, i.e. the stack is empty, we create a new buffer, and use it.
+ 2.3. For each input we decrement the corresponding element of 'usecounts'. If the counter reaches 0 and the input
+ is not aliased with one of the outputs (see 2.1),
+ we push the corresponding buffer index into the stack of free buffers.
+ 2.4. In the case of in-place operations and sometimes when using subgraphs (e.g. in If, Loop operations) we may
+ re-use the same buffer for several arguments
+ (which can be ouputs for some operations and inputs for some subsequent operations).
+ In order to handle it all properly, during the buffer assignment algorithm we maintain use counter for each
+ buffer, which should not be confused with use counters for arguments. A pool of free buffers contains zero or
+ more "spare" buffers with 0 use counts. A buffer in use has the corresponding usage count > 0.
+ When some argument is not needed anymore, and if it's not a constant, it decrements the usage counter of the buffer
+ where it resides. When the counter reaches zero, we return the buffer into the pool of free buffers and then
+ we can reuse the same buffer for another argument (or probably different shape and/or type, see below).
+ In principle, we could 'protect' some buffers from the premature release and re-use by incrementing the use counts
+ of the respective arguments that reside in those buffers, but that would make the bookkeeping much more complex.
+
+ Please, note that when we reuse buffers, we don't check any types, shape or a total size of the buffer needed.
+ We reallocate each buffer at runtime to fit each single argument that it's used for. For example, let's say the buffer #3
+ is used for arguments #5 (10x10x10 FP32), #10 (6x6x32 FP32) and #14 (300x1 UINT64). Then during the the first run of
+ the inference the buffer #3 will be reallocated from 0 bytes to 1000*4 bytes to fit arg #10,
+ then from 4000 to 6*6*32*4=4608 bytes to fit arg #10 and then it will fit arg #14 without reallocations.
+ During the second run of inference with the same resolution input the buffer will not be reallocated.
+
+ The reallocation is done using Buffer.fit() function.
+ */
+
+struct BufferAllocator
+{
+    Net::Impl* netimpl;
+    vector<int> usecounts;
+    vector<int> freebufs;
+    vector<int> buf_usecounts;
+    vector<int> bufidxs;
+    int nbufs = 0;
+
+    BufferAllocator(Net::Impl* netimpl_) : netimpl(netimpl_) {}
+
+    /*
+     Here are 3 workhorse methods that abstract the use and bookkeeping of buffers:
+     1. getFreeBuffer() takes the first spare buffer from the pool of free buffers. Since
+     we don't necessarily know the shape/type of tensor type at this stage, this is quite
+     reasonable behaviour - we cannot do anything more complex that that. On the positive side,
+     since the pool of free buffers operates like a stack, the first free buffer is the most
+     recently released buffer, so we improve cache locality using this pattern.
+     When we don't have spare buffers in the pool, we "virtually" create a new buffer
+     (by incrementing the number of buffers used) and return it.
+
+     For the retrieved buffer we set its use count to 1.
+     2. releaseBuffer(bufidx) decrements the buffer use count and returns it to the pool
+     of free buffers as long as the use counter reaches 0.
+     3. shareBuffer(from_arg, to_arg) takes two argument indices.
+     It makes argument 'to_arg' use the same buffer as 'from_arg'.
+     Use counter for the assigned to 'to_arg' buffer (if any) is decremented.
+     Use counter for the 'from_arg' buffer is incremented, correpondingly.
+     */
+
+    int getFreeBuffer()
+    {
+        if (freebufs.empty()) {
+            freebufs.push_back(nbufs);
+            buf_usecounts.push_back(0);
+            //printf("added buf %d\n", nbufs);
+            nbufs++;
+        }
+        int outidx = freebufs.back();
+        freebufs.pop_back();
+        buf_usecounts[outidx] = 1;
+        return outidx;
+    }
+
+    void releaseBuffer(int bufidx)
+    {
+        if (bufidx >= 0) {
+            CV_Assert(buf_usecounts[bufidx] > 0);
+            if (--buf_usecounts[bufidx] == 0)
+                freebufs.push_back(bufidx);
+        }
+    }
+
+    void shareBuffer(Arg fromArg, Arg toArg)
+    {
+        CV_Assert(!netimpl->isConstArg(fromArg) && !netimpl->isConstArg(toArg));
+        int fromBuf = bufidxs[fromArg.idx], toBuf = bufidxs[toArg.idx];
+        CV_Assert(fromBuf >= 0);
+        bufidxs[toArg.idx] = fromBuf;
+        buf_usecounts[fromBuf]++;
+        if (toBuf >= 0)
+            releaseBuffer(toBuf);
+    }
+
+    void assign()
+    {
+        netimpl->useCounts(usecounts);
+        size_t nargs = usecounts.size();
+        bufidxs.assign(nargs, -1);
+        nbufs = 0;
+        assign(netimpl->mainGraph);
+        netimpl->bufidxs = bufidxs;
+        netimpl->buffers.resize(nbufs);
+        for (int i = 0; i < nbufs; i++)
+            netimpl->buffers[i] = Mat();
+    }
+
+    void assign(const Ptr<Graph>& graph)
+    {
+        if (!graph)
+            return;
+        const std::vector<Ptr<Layer> >& prog = graph->prog();
+        for (const auto& layer: prog) {
+            bool inplace = false;
+            Arg reuseArg;
+
+            if (!layer) continue;
+
+            const std::vector<Arg>& inputs = layer->inputs;
+            const std::vector<Arg>& outputs = layer->outputs;
+            size_t ninputs = inputs.size();
+            size_t noutputs = outputs.size();
+
+            /*
+             Determine if we can possibly re-use some of the input buffers for the output as well,
+             in other words, whether we can run the operation in-place.
+             Not only it saves memory, but it can also:
+             1. improve L2/L3 cache re-use
+             2. effectively convert some copy/re-shape operations
+             (Identity, Flatten, Reshape, Squeeze, Unsqueeze)
+             into Nop (no-operation).
+             */
+            //const ElemwiseOp* elemwise_op = dynamic_cast<const ElemwiseOp*>(op);
+
+            if (/*dynamic_cast<const BatchNormOp*>(op) != 0 ||
+                dynamic_cast<const FlattenOp*>(op) != 0 ||
+                (elemwise_op != 0 && elemwise_op->getActivation(CV_32F) != 0) ||
+                dynamic_cast<const ReshapeOp*>(op) != 0 ||
+                dynamic_cast<const SqueezeOp*>(op) != 0 ||
+                dynamic_cast<const UnsqueezeOp*>(op) != 0*/
+                layer->alwaysSupportInplace()) {
+                CV_Assert(ninputs >= 1);
+                Arg inp0 = inputs[0];
+                inplace = netimpl->argKind(inp0) == DNN_ARG_TEMP && usecounts[inp0.idx] == 1;
+                reuseArg = inp0;
+            }
+
+            /*
+             Unless the operation is in-place, assign buffers for each output.
+             We do it before we recursively process subgraphs inside If/Loop/Scan.
+             this way we avoid any possible influence of buffer allocation inside a subgraph
+             to the parent graphs.
+             */
+            //if (layer->type == "Softmax")
+            //    putchar('.');
+            if (noutputs > 0) {
+                Arg out0 = outputs[0];
+                if (inplace &&
+                    noutputs == 1 &&
+                    netimpl->argKind(out0) == DNN_ARG_TEMP &&
+                    bufidxs.at(out0.idx) < 0)
+                    shareBuffer(reuseArg, out0);
+                else {
+                    for (auto out: outputs) {
+                        if (netimpl->argKind(out) == DNN_ARG_TEMP &&
+                            bufidxs.at(out.idx) < 0) {
+                            bufidxs.at(out.idx) = getFreeBuffer();
+                        }
+                    }
+                }
+            }
+
+            std::string opname = layer->type;
+
+            if (opname == "If") {
+                /*
+                 Pre-allocate buffers for the output nodes of then- and else- branches.
+                 We try to alias them with the corresponding t_out[i] elements, so
+                 that we save one copy operation.
+                 [TODO]
+                 It's not the most optimal buffer allocation.
+                 In the ideal case, e.g. when both then- and else- branches
+                 are just sequences of element-wise operations that can be executed in-place,
+                 we could simply use a single buffer for both then- and else- branches.
+                 Here we will use separate buffers, but let's assume we could
+                 optimize out such trivial branches at the graph fusion level
+                 (especially when we have JIT).
+                 */
+                auto branches = layer->subgraphs();
+                CV_Assert(branches->size() == 2);
+
+                const Ptr<Graph>& thenBranch = branches->at(0);
+                const Ptr<Graph>& elseBranch = branches->at(1);
+                const vector<Arg>& thenOutargs = thenBranch->outputs();
+                const vector<Arg>& elseOutargs = elseBranch->outputs();
+                CV_Assert(thenOutargs.size() == noutputs && elseOutargs.size() == noutputs);
+                for (size_t i = 0; i < noutputs; i++) {
+                    Arg outarg = outputs[i];
+                    Arg thenOutarg = thenOutargs[i];
+                    Arg elseOutarg = elseOutargs[i];
+
+                    if (!netimpl->isConstArg(thenOutarg) && usecounts[thenOutarg.idx] == 1)
+                        shareBuffer(outarg, thenOutarg);
+                    if (!netimpl->isConstArg(elseOutarg) && usecounts[elseOutarg.idx] == 1)
+                        shareBuffer(outarg, elseOutarg);
+                }
+
+                assign(thenBranch);
+                assign(elseBranch);
+
+                for (size_t i = 0; i < noutputs; i++) {
+                    Arg thenOutarg = thenOutargs[i];
+                    Arg elseOutarg = elseOutargs[i];
+                    releaseBuffer(bufidxs[thenOutarg.idx]);
+                    releaseBuffer(bufidxs[elseOutarg.idx]);
+                }
+            } else if (opname == "Loop") {
+                /*
+                 In the case of loop we try to alias t_v_in[i] and t_v_out[i] so that
+                 we eliminate some copy operations after each loop iteration.
+                 */
+                //LoopLayer* loop = dynamic_cast<LoopLayer*>(op);
+                CV_Assert(ninputs >= 2);
+                auto subgraphs = layer->subgraphs();
+                CV_Assert(subgraphs && subgraphs->size() == 1);
+                const Ptr<Graph>& body = subgraphs->at(0);
+                Arg trip_count = inputs[0];
+                const std::vector<Arg>& body_inputs = body->inputs();
+                const std::vector<Arg>& body_outputs = body->outputs();
+                size_t body_ninputs = body_inputs.size();
+                size_t body_noutputs = body_outputs.size();
+                int n_state_vars = (int)(ninputs - 2);
+                int n_accums = (int)(body_noutputs - n_state_vars - 1);
+                CV_Assert(body_ninputs == ninputs);
+                CV_Assert(body_noutputs == noutputs+1);
+                CV_Assert(n_state_vars >= 0 && n_accums >= 0);
+                Arg inp0 = inputs[0];
+                if (inp0.idx > 0 && usecounts[inp0.idx] > 0) {
+                    CV_Assert(!netimpl->isConstArg(inp0));
+                    if (!netimpl->isConstArg(trip_count))
+                        shareBuffer(trip_count, inputs[0]);
+                    else
+                        bufidxs.at(inputs[0].idx) = getFreeBuffer();
+                }
+
+                for (int i = -1; i < n_state_vars; i++) {
+                    Arg inparg = body_inputs[i+2];
+                    Arg outarg = body_outputs[i+1];
+                    Arg v_inp = inputs[i+2];
+                    Arg v_out = i >= 0 ? outputs[i] : Arg();
+                    if (inparg.idx > 0 && usecounts[inparg.idx] > 0) {
+                        CV_Assert(!netimpl->isConstArg(inparg));
+                        if (!netimpl->isConstArg(v_inp))
+                            shareBuffer(v_inp, inparg);
+                        else
+                            bufidxs[inparg.idx] = getFreeBuffer();
+                    }
+                    if (!netimpl->isConstArg(v_out)) {
+                        if (!netimpl->isConstArg(outarg) && usecounts[outarg.idx] == 1)
+                            shareBuffer(v_out, outarg);
+                    }
+                }
+
+                assign(body);
+                for (auto body_out: body_outputs)
+                    releaseBuffer(bufidxs.at(body_out.idx));
+            }
+
+            for (auto out: outputs) {
+                if (usecounts[out.idx] == 0)
+                    releaseBuffer(bufidxs.at(out.idx));
+            }
+            // let's release inputs in the reverse order to keep the buffer allocation consistent across the network
+            for (size_t i = 0; i < ninputs; i++) {
+                Arg inp = inputs[ninputs-i-1];
+                int bufidx = bufidxs[inp.idx];
+                if (bufidx >= 0) {
+                    if (--usecounts.at(inp.idx) == 0)
+                        releaseBuffer(bufidx);
+                }
+            }
+        }
+    }
+};
+
+void Net::Impl::assignBuffers()
+{
+    BufferAllocator buf_allocator(this);
+    buf_allocator.assign();
+}
+
+CV__DNN_INLINE_NS_END
+}}
diff --git a/modules/dnn/src/graph_const_fold.cpp b/modules/dnn/src/graph_const_fold.cpp
new file mode 100644
index 0000000000..8cfaca617c
--- /dev/null
+++ b/modules/dnn/src/graph_const_fold.cpp
@@ -0,0 +1,139 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "precomp.hpp"
+#include "net_impl.hpp"
+
+namespace cv { namespace dnn {
+CV__DNN_INLINE_NS_BEGIN
+
+using std::vector;
+using std::string;
+
+typedef std::pair<int, int> int_pair;
+typedef std::pair<int, Arg> int_arg_pair;
+
+struct ConstFolding
+{
+    Net::Impl* netimpl;
+    std::vector<int> usecounts;
+
+    ConstFolding(Net::Impl* netimpl_) : netimpl(netimpl_) {}
+
+    void process()
+    {
+        size_t nargs = netimpl->args.size();
+        netimpl->__tensors__.resize(nargs);
+        netimpl->useCounts(usecounts);
+        netimpl->scratchBufs.clear();
+        processGraph(netimpl->mainGraph);
+        netimpl->scratchBufs.clear();
+    }
+
+    Layer* getLayer(std::vector<Ptr<Layer> >& newprog, int op_idx) const
+    {
+        return op_idx >= 0 ? newprog.at(op_idx).get() : 0;
+    }
+
+    void unuse(Arg inp)
+    {
+        CV_Assert(usecounts[inp.idx] > 0);
+        if (--usecounts[inp.idx] == 0 && netimpl->isConstArg(inp)) {
+            netimpl->__tensors__[inp.idx] = Mat(); // deallocate unused tensor
+        }
+    }
+
+    bool processGraph(Ptr<Graph>& graph)
+    {
+        bool modified = false;
+        const std::vector<Ptr<Layer> >& prog = graph->prog();
+        size_t i, nops = prog.size();
+        std::vector<Ptr<Layer> > newprog;
+        std::vector<Arg> removed_args;
+        std::vector<Mat> inpMats, tempMats;
+        std::vector<int> inpTypes, outTypes, tempTypes;
+        std::vector<MatShape> inpShapes, outShapes, tempShapes;
+
+        for (i = 0; i < nops; i++) {
+            const Ptr<Layer>& layer = prog[i];
+            std::vector<Ptr<Graph> >* subgraphs = layer->subgraphs();
+            if (subgraphs) {
+                for (Ptr<Graph>& g: *subgraphs) {
+                    if (processGraph(g))
+                        modified = true;
+                }
+            }
+            const std::vector<Arg>& inputs = layer->inputs;
+            const std::vector<Arg>& outputs = layer->outputs;
+            size_t j, ninputs = inputs.size(), noutputs = outputs.size();
+            bool all_const = true;
+            inpMats.assign(ninputs, Mat());
+            inpTypes.resize(ninputs);
+            inpShapes.resize(ninputs);
+            for (j = 0; j < ninputs; j++) {
+                Arg inp = inputs[j];
+                bool const_arg = netimpl->isConstArg(inp);
+                if (!const_arg)
+                    all_const = false;
+                if (all_const) {
+                    const Mat& m = netimpl->argTensor(inp);
+                    inpMats[j] = m;
+                    inpTypes[j] = m.type();
+                    inpShapes[j] = m.shape();
+                }
+            }
+
+            if (all_const /*&&
+                op->supportBlockLayout(0, (int)ninputs) <= 0 // we don't currently support constant folding
+                                               // for block-layout operations (Convolution, MaxPool, AveragePool)
+                */) {
+                // Use a fresh vector of Mat's for outputs since we want to make these outputs the new constant tensors.
+                // So, they must be unique and don't interfere with other tensors.
+                std::vector<Mat> outMats(noutputs);
+                std::vector<std::pair<uchar*, size_t> > outOrigData;
+                if (!layer->dynamicOutputShapes())
+                    netimpl->allocateLayerOutputs(layer, inpTypes, inpShapes, outTypes,
+                                                  outShapes, outOrigData, outMats, tempTypes, tempShapes, tempMats,
+                                                  netimpl->scratchBufs, false);
+                layer->finalize(inpMats, outMats);
+                layer->forward(inpMats, outMats, tempMats);
+                CV_Assert(outMats.size() == noutputs);
+                for (j = 0; j < noutputs; j++) {
+                    Arg out = outputs[j];
+                    ArgData& out_data = netimpl->args.at(out.idx);
+                    const Mat& m = outMats[j];
+                    out_data.type = m.type();
+                    out_data.shape = m.shape();
+                    out_data.kind = DNN_ARG_CONST; // re-classify each output as constant
+                    netimpl->__tensors__.at(out.idx) = m;
+                }
+
+                modified = true;
+                for (size_t i = 0; i < ninputs; i++)
+                    unuse(inputs[i]);
+                //printf("folded %s: %s\n", op->name().data(), node->name().data());
+                // we don't add operation into the new program,
+                // because the output of the all-const inputs operation is now a constant,
+                // stored in a separate tensor
+            } else {
+                newprog.push_back(layer);
+            }
+        }
+
+        if (modified) {
+            graph->setProg(newprog);
+        }
+
+        return modified;
+    }
+};
+
+void Net::Impl::constFold()
+{
+    ConstFolding constfolder(this);
+    constfolder.process();
+}
+
+CV__DNN_INLINE_NS_END
+}}
diff --git a/modules/dnn/src/init.cpp b/modules/dnn/src/init.cpp
index e3f9ebf712..8e8932aab6 100644
--- a/modules/dnn/src/init.cpp
+++ b/modules/dnn/src/init.cpp
@@ -84,14 +84,29 @@ void initializeLayerFactory()
     static ProtobufShutdown protobufShutdown; CV_UNUSED(protobufShutdown);
 #endif
 
-    CV_DNN_REGISTER_LAYER_CLASS(Slice,          SliceLayer);
-    CV_DNN_REGISTER_LAYER_CLASS(Split,          SplitLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Concat,         ConcatLayer);
-    CV_DNN_REGISTER_LAYER_CLASS(Reshape,        ReshapeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Concat2,        Concat2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(ConstantOfShape, ConstantOfShapeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(CropAndResize,  CropAndResizeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(DequantizeLinear, DequantizeLinearLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Expand2,        Expand2Layer);
     CV_DNN_REGISTER_LAYER_CLASS(Flatten,        FlattenLayer);
-    CV_DNN_REGISTER_LAYER_CLASS(Resize,         ResizeLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Interp,         InterpLayer);
-    CV_DNN_REGISTER_LAYER_CLASS(CropAndResize,  CropAndResizeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Pad2,           Pad2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(QuantizeLinear, QuantizeLinearLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Range,          RangeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Reshape,        ReshapeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Reshape2,       Reshape2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(Resize,         ResizeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Shape,          ShapeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Slice,          SliceLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Slice2,         Slice2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(Split,          SplitLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Split2,         Split2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(Squeeze,        SqueezeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Tile2,          Tile2Layer);
+    CV_DNN_REGISTER_LAYER_CLASS(Transpose,      TransposeLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Unsqueeze,      UnsqueezeLayer);
 
     CV_DNN_REGISTER_LAYER_CLASS(Convolution,    ConvolutionLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Deconvolution,  DeconvolutionLayer);
@@ -158,6 +173,7 @@ void initializeLayerFactory()
     CV_DNN_REGISTER_LAYER_CLASS(Arg,            ArgLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Reciprocal,     ReciprocalLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Gather,         GatherLayer);
+    CV_DNN_REGISTER_LAYER_CLASS(Gather2,        Gather2Layer);
     CV_DNN_REGISTER_LAYER_CLASS(GatherElements, GatherElementsLayer);
     CV_DNN_REGISTER_LAYER_CLASS(LayerNormalization, LayerNormLayer);
     CV_DNN_REGISTER_LAYER_CLASS(Expand,         ExpandLayer);
diff --git a/modules/dnn/src/int8layers/convolution_layer.cpp b/modules/dnn/src/int8layers/convolution_layer.cpp
index 18c9205c22..aca01c3e9d 100644
--- a/modules/dnn/src/int8layers/convolution_layer.cpp
+++ b/modules/dnn/src/int8layers/convolution_layer.cpp
@@ -172,7 +172,7 @@ public:
     MatShape computeColRowShape(const MatShape &inpShape, const MatShape &outShape) const CV_OVERRIDE
     {
         CV_Assert(!blobs.empty());
-        int dims = inpShape.size();
+        int dims = (int)inpShape.size();
         int inpD = dims == 5 ? inpShape[2] : 1;
         int inpH = inpShape[dims - 2];
         int inpW = inpShape.back();
@@ -236,7 +236,7 @@ public:
                      "be multiple of %d but got %d", weightShape[1], inpCn));
         CV_Assert(ngroups > 0 && inpCn % ngroups == 0 && outCn % ngroups == 0);
 
-        outputs.resize(1, outShape);
+        outputs.resize(1, MatShape(outShape));
 
         return false;
     }
diff --git a/modules/dnn/src/int8layers/eltwise_layer.cpp b/modules/dnn/src/int8layers/eltwise_layer.cpp
index 214d11525a..a42f45e070 100644
--- a/modules/dnn/src/int8layers/eltwise_layer.cpp
+++ b/modules/dnn/src/int8layers/eltwise_layer.cpp
@@ -233,8 +233,8 @@ public:
 
         for (size_t i = 0; i < inputs.size(); i++)
         {
-            MatShape inpShape = shape(inputs[i].size);
-            if (isAllOnes(inpShape, 2, inputs[i].dims))
+            MatShape inpShape = inputs[i].shape();
+            if (isAllOnes(inpShape, 2, inpShape.dims))
             {
                 hasVecInput = true;
                 return;
@@ -679,15 +679,15 @@ public:
         {
             for (size_t i = 0; i < inputs.size(); i++)
             {
-                MatShape inpShape = shape(inputs[i].size);
+                MatShape inpShape = inputs[i].shape();
                 bool allOnes = isAllOnes(inpShape, 2, inputs[i].dims);
 
                 if (allOnes)
                 {
                     Mat tmpInput = inputs[i];
-                    MatShape outShape = shape(outputs[0].size);
+                    MatShape outShape = outputs[0].shape();
                     size_t xSize = outShape[2];
-                    for (size_t j = 3; j < outShape.size(); j++)
+                    for (int j = 3; j < outShape.dims; j++)
                         xSize *= outShape[j];
 
                     int dimVec[3] = {outShape[0], outShape[1], (int) xSize};
diff --git a/modules/dnn/src/int8layers/pooling_layer.cpp b/modules/dnn/src/int8layers/pooling_layer.cpp
index cfd04bd2f4..77b0754b5b 100644
--- a/modules/dnn/src/int8layers/pooling_layer.cpp
+++ b/modules/dnn/src/int8layers/pooling_layer.cpp
@@ -706,7 +706,7 @@ public:
         std::vector<size_t> local_kernel;
         if (globalPooling) {
             for (int i = 0; i < inpShape.size(); i++) {
-                int idx = isGlobalPooling.size() - inpShape.size() + i;
+                size_t idx = isGlobalPooling.size() - inpShape.size() + i;
                 local_kernel.push_back(isGlobalPooling[idx] ? inpShape[i] : kernel_size[idx]);
             }
         } else {
@@ -741,7 +741,7 @@ public:
                                  std::vector<size_t>(local_kernel.size(), 1), outShape);
         }
 
-        outputs.assign(1, outShape);
+        outputs.assign(1, MatShape(outShape));
         return false;
     }
 
diff --git a/modules/dnn/src/int8layers/quantization_utils.cpp b/modules/dnn/src/int8layers/quantization_utils.cpp
index d9c64150f2..10ca90a1f8 100644
--- a/modules/dnn/src/int8layers/quantization_utils.cpp
+++ b/modules/dnn/src/int8layers/quantization_utils.cpp
@@ -185,7 +185,8 @@ public:
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
         CV_Check(inputs.size(), inputs.size() >= 1 && inputs.size() <= 3, "Number of inputs must be between 1 and 3 inclusive.");
-        Layer::getMemoryShapes(inputs, requiredOutputs, outputs, internals);
+        CV_Assert(requiredOutputs <= 1);
+        outputs.assign(1, inputs[0]);
         return false;
     }
 
@@ -356,7 +357,8 @@ public:
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
         CV_Check(inputs.size(), inputs.size() >= 1 && inputs.size() <= 3, "Number of inputs must be between 1 and 3 inclusive.");
-        Layer::getMemoryShapes(inputs, requiredOutputs, outputs, internals);
+        CV_Assert(requiredOutputs <= 1);
+        outputs.assign(1, inputs[0]);
         return false;
     }
 
diff --git a/modules/dnn/src/layer.cpp b/modules/dnn/src/layer.cpp
index 6d42cbd867..717105a10f 100644
--- a/modules/dnn/src/layer.cpp
+++ b/modules/dnn/src/layer.cpp
@@ -3,19 +3,24 @@
 // of this distribution and at http://opencv.org/license.html.
 
 #include "precomp.hpp"
+#include "net_impl.hpp"
 
 namespace cv {
 namespace dnn {
 CV__DNN_INLINE_NS_BEGIN
 
 
-Layer::Layer() { preferableTarget = DNN_TARGET_CPU; }
+Layer::Layer() {
+    netimpl = nullptr;
+    preferableTarget = DNN_TARGET_CPU;
+}
 
 Layer::Layer(const LayerParams& params)
     : blobs(params.blobs)
     , name(params.name)
     , type(params.type)
 {
+    netimpl = nullptr;
     preferableTarget = DNN_TARGET_CPU;
 }
 
@@ -273,10 +278,110 @@ void Layer::getTypes(const std::vector<MatType>&inputs,
     internals.assign(requiredInternals, inputs[0]);
 }
 
+int64 Layer::getFLOPS(const std::vector<MatShape>&,
+                      const std::vector<MatShape>&) const
+{
+    return 0;
+}
+
 bool Layer::updateMemoryShapes(const std::vector<MatShape>& inputs)
 {
     return true;
 }
 
+std::vector<Ptr<Graph> >* Layer::subgraphs() const
+{
+    return nullptr;
+}
+
+bool Layer::alwaysSupportInplace() const
+{
+    return false;
+}
+
+bool Layer::dynamicOutputShapes() const
+{
+    return false;
+}
+
+std::ostream& Layer::dumpAttrs(std::ostream& strm, int) const
+{
+    return strm;
+}
+
+std::ostream& Layer::dump(std::ostream& strm, int indent, bool comma) const
+{
+    CV_Assert(netimpl);
+    size_t ninputs = inputs.size();
+    size_t noutputs = outputs.size();
+    size_t nblobs = blobs.size();
+    const std::vector<Ptr<Graph> >* subgraphs_ = subgraphs();
+    size_t nsubgraphs = subgraphs_ ? subgraphs_->size() : 0;
+    Net::Impl* netimpl = getNetImpl(this);
+    int delta_indent = netimpl->dump_indent;
+    int subindent = indent + delta_indent;
+    int argindent = subindent + delta_indent;
+    prindent(strm, indent);
+    std::string opname = type;
+    strm << opname << " {\n";
+    prindent(strm, subindent);
+    strm << "name: \"" << name << "\",\n";
+
+    if (!blobs.empty()) {
+        prindent(strm, subindent);
+        strm << "blobs: [\n";
+        for (size_t i = 0; i < nblobs; i++) {
+            if (i > 0)
+                strm << ",\n";
+            const Mat& blob = blobs[i];
+            prindent(strm, argindent);
+            netimpl->dumpTypeShape(strm, blob.type(), blob.shape());
+        }
+        strm << "\n";
+        prindent(strm, subindent);
+        strm << "],\n";
+    }
+    dumpAttrs(strm, subindent);
+    prindent(strm, subindent);
+    strm << "inputs: [\n";
+    for (size_t i = 0; i < ninputs; i++) {
+        netimpl->dumpArg(strm, inputs[i], argindent, i+1 < ninputs, true);
+    }
+    prindent(strm, subindent);
+    strm << "],\n";
+    prindent(strm, subindent);
+    strm << "outputs: [\n";
+    for (size_t i = 0; i < noutputs; i++) {
+        netimpl->dumpArg(strm, outputs[i], argindent, i+1 < noutputs, true);
+    }
+    prindent(strm, subindent);
+    strm << "],\n";
+
+    if (nsubgraphs > 0) {
+        std::vector<std::string> names;
+        if (opname == "If")
+            names = {"then", "else"};
+        else if (opname == "Loop")
+            names = {"body"};
+        else {
+            CV_Error(Error::StsError,
+                     format("unsupported operation '%s' with subgraphs",
+                            std::string(opname).c_str()));
+        }
+        CV_Assert(names.size() == nsubgraphs);
+        for (size_t i = 0; i < nsubgraphs; i++) {
+            prindent(strm, subindent);
+            strm << names[i] << ": ";
+            subgraphs_->at(i)->dump(strm, argindent, i+1 < nsubgraphs);
+        }
+    }
+    prindent(strm, indent);
+    strm << '}';
+    if (comma)
+        strm << ',';
+    strm << '\n';
+    return strm;
+}
+
 CV__DNN_INLINE_NS_END
 }}  // namespace cv::dnn
diff --git a/modules/dnn/src/layer_internals.hpp b/modules/dnn/src/layer_internals.hpp
index 3cb0297400..4a3e045dd8 100644
--- a/modules/dnn/src/layer_internals.hpp
+++ b/modules/dnn/src/layer_internals.hpp
@@ -322,7 +322,7 @@ struct DataLayer : public Layer
             std::vector<MatShape>& outputs,
             std::vector<MatShape>& internals) const CV_OVERRIDE
     {
-        CV_Assert(inputs.size() == requiredOutputs);
+        CV_Assert(inputs.size() == requiredOutputs || requiredOutputs == 0);
         outputs.assign(inputs.begin(), inputs.end());
         return false;
     }
diff --git a/modules/dnn/src/layers/accum_layer.cpp b/modules/dnn/src/layers/accum_layer.cpp
index 72bbf04b87..56124831ad 100644
--- a/modules/dnn/src/layers/accum_layer.cpp
+++ b/modules/dnn/src/layers/accum_layer.cpp
@@ -28,7 +28,7 @@ public:
                                  std::vector<MatShape> &outputs,
                                  std::vector<MatShape> &internals) const CV_OVERRIDE
     {
-        std::vector<int> outShape;
+        MatShape outShape;
         int batch = inputs[0][0];
         outShape.push_back(batch);
 
@@ -85,8 +85,13 @@ public:
     virtual void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
     {
         LayerParams resizeParams;
+        Mat out = outputs_arr.getMat(0);
         resizeParams.set("interpolation", "bilinear");
         resizeParams.set("align_corners", true);
+        if (out.dims == 4) {
+            resizeParams.set("height", out.size[2]);
+            resizeParams.set("width", out.size[3]);
+        }
         resize = ResizeLayer::create(resizeParams);
     }
 
diff --git a/modules/dnn/src/layers/arg_layer.cpp b/modules/dnn/src/layers/arg_layer.cpp
index caa2399396..39206b0f76 100644
--- a/modules/dnn/src/layers/arg_layer.cpp
+++ b/modules/dnn/src/layers/arg_layer.cpp
@@ -71,7 +71,7 @@ public:
 
         const int axis_ = normalize_axis(axis, inpShape);
         // handle dims = 0 situation
-        if (!inpShape.empty())
+        if (inpShape.dims > 0)
             handleKeepDims(inpShape, axis_);
         outputs.assign(1, inpShape);
 
@@ -97,7 +97,7 @@ public:
         outputs_arr.getMatVector(outputs);
 
         CV_Assert_N(inputs.size() == 1, outputs.size() == 1);
-        std::vector<int> outShape = shape(outputs[0]);
+        MatShape outShape = shape(outputs[0]);
         Mat output(outShape, CV_32SC1);
 
         switch (op)
diff --git a/modules/dnn/src/layers/blank_layer.cpp b/modules/dnn/src/layers/blank_layer.cpp
index 41eab0cd1c..133edfa91f 100644
--- a/modules/dnn/src/layers/blank_layer.cpp
+++ b/modules/dnn/src/layers/blank_layer.cpp
@@ -78,7 +78,9 @@ public:
                          std::vector<MatShape> &outputs,
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
-        Layer::getMemoryShapes(inputs, requiredOutputs, outputs, internals);
+        CV_Assert(!inputs.empty());
+        outputs.assign(std::max(requiredOutputs, 1), inputs[0]);
+        internals.clear();
         return true;
     }
 
@@ -88,8 +90,9 @@ public:
         std::vector<MatType>& outputs,
         std::vector<MatType>& internals) const CV_OVERRIDE
     {
-        CV_Assert(inputs.size());
-        outputs = inputs;
+        CV_Assert(!inputs.empty());
+        outputs.assign(std::max(requiredOutputs, 1), inputs[0]);
+        internals.clear();
     }
 
 
@@ -126,10 +129,12 @@ public:
         inputs_arr.getMatVector(inputs);
         outputs_arr.getMatVector(outputs);
 
-        size_t i, n = outputs.size();
-        for (i = 0; i < n; ++i)
-            if (outputs[i].data != inputs[i].data)
-                inputs[i].copyTo(outputs[i]);
+        size_t i, ninputs = inputs.size(), noutputs = outputs.size();
+        for (i = 0; i < noutputs; ++i) {
+            const Mat& inp = inputs[i < ninputs ? i : 0];
+            if (outputs[i].data != inp.data)
+                inp.copyTo(outputs[i]);
+        }
     }
 
 #ifdef HAVE_CANN
diff --git a/modules/dnn/src/layers/concat2_layer.cpp b/modules/dnn/src/layers/concat2_layer.cpp
new file mode 100644
index 0000000000..896a6a2542
--- /dev/null
+++ b/modules/dnn/src/layers/concat2_layer.cpp
@@ -0,0 +1,191 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Concat layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Concat.html
+
+    Opset's 1 to 13 are covered.
+*/
+
+// out must be pre-allocated
+static void concat(const std::vector<Mat>& inps, Mat& out, int axis)
+{
+    CV_Assert(out.isContinuous());
+
+    MatShape outShape = out.shape();
+    int ndims = outShape.dims, nslices = 1;
+    size_t esz = out.elemSize();
+    size_t sliceSize = esz;
+    size_t totalSize = 0;
+    size_t outStep = 0;
+    int ninputs = (int)inps.size();
+    for (int i = ndims-1; i > axis; i--)
+        sliceSize *= outShape[i];
+    outStep = sliceSize*outShape[axis];
+    for (int i = 0; i < axis; i++)
+        nslices *= outShape[i];
+    for (int i = 0; i < ninputs; i++) {
+        CV_Assert(inps[i].isContinuous());
+        totalSize += inps[i].total()*esz;
+    }
+
+    parallel_for_(Range(0, ninputs), [&](const Range& r) {
+        for (int k = r.start; k < r.end; k++) {
+            const Mat& inp_k = inps[k];
+            uchar* outptr = out.data;
+            const uchar* inptr_k = inp_k.data;
+            int sz_a;
+            for (int i = 0; i < k; i++) {
+                sz_a = inps[i].size[axis];
+                outptr += sliceSize*sz_a;
+            }
+            sz_a = inp_k.size[axis];
+            size_t sliceSize_k = sliceSize*sz_a;
+            for (int i = 0; i < nslices; i++)
+                memcpy(outptr + i*outStep, inptr_k + i*sliceSize_k, sliceSize_k);
+        }
+    }, (totalSize > 1000000 ? ninputs : 1));
+}
+
+class Concat2LayerImpl CV_FINAL : public Concat2Layer
+{
+public:
+    Concat2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axis = params.get<int>("axis", 1);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const std::vector<MatShape>& inpShapes) const
+    {
+        size_t ninputs = inpShapes.size();
+        CV_Assert(ninputs == inputs.size());
+
+        const MatShape& inpShape0 = inpShapes[0];
+        int inpDims = inpShape0.dims;
+        int axis_ = normalize_axis(axis, inpDims);
+        CV_Assert(0 <= axis_ && axis_ < inpDims);
+        MatShape outShape = inpShape0;
+        outShape[axis_] = 0;
+
+        for (size_t i = 0; i < ninputs; i++) {
+            const MatShape& inpShape_i = inpShapes[i];
+            CV_Assert(inpShape_i.dims == inpDims);
+            for (int j = 0; j < inpDims; j++) {
+                if (j == axis_) {
+                    outShape[j] += inpShape_i[j];
+                    continue;
+                }
+                CV_Assert(inpShape0[j] == inpShape_i[j]);
+            }
+        }
+
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        outputs.assign(1, getOutShape(inputs));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs > 0);
+        for (size_t i = 1; i < ninputs; i++) {
+            CV_Assert(inputs[i] == inputs[0]);
+        }
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+
+        CV_Assert(ninputs > 0);
+
+        std::vector<MatShape> inpShapes(ninputs);
+        int inpType = inputs_arr.type(0);
+
+        for (int i = 0; i < ninputs; i++) {
+            inpShapes[i] = inputs_arr.shape(i);
+            CV_Assert(inputs_arr.type(i) == inpType);
+        }
+
+        MatShape outShape = getOutShape(inpShapes);
+        int outKind = outputs_arr.kind();
+        int axis_ = normalize_axis(axis, inpShapes[0].dims);
+
+        CV_Assert(outKind == _InputArray::STD_VECTOR_MAT ||
+                  outKind == _InputArray::STD_VECTOR_UMAT);
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat> inps;
+            inputs_arr.getMatVector(inps);
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            runOp(inps, outs[0], axis_);
+        } else {
+            // [TODO] more efficient OpenCL implementation
+            std::vector<Mat> inps;
+            inputs_arr.getMatVector(inps);
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            Mat temp(outShape, inpType);
+            runOp(inps, temp, axis_);
+            temp.copyTo(outs[0]);
+        }
+    }
+
+    void runOp(const std::vector<Mat>& inps, Mat& out, int axis_)
+    {
+        concat(inps, out, axis_);
+    }
+};
+
+Ptr<Concat2Layer> Concat2Layer::create(const LayerParams& params)
+{
+    return Ptr<Concat2Layer>(new Concat2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/concat_layer.cpp b/modules/dnn/src/layers/concat_layer.cpp
index a38e3baa9e..a42551a390 100644
--- a/modules/dnn/src/layers/concat_layer.cpp
+++ b/modules/dnn/src/layers/concat_layer.cpp
@@ -109,11 +109,9 @@ public:
                 }
             }
 
-            axisSum += (!curShape.empty()) ? curShape[cAxis] : 1;
-        }
-        if (inputs[0].empty()){
-            outputs[0] = MatShape(1);
+            axisSum += curShape.dims >= cAxis ? curShape[cAxis] : 1;
         }
+        outputs[0].dims = std::max(outputs[0].dims, 1);
         outputs[0][cAxis] = axisSum;
         return false;
     }
diff --git a/modules/dnn/src/layers/constantofshape_layer.cpp b/modules/dnn/src/layers/constantofshape_layer.cpp
new file mode 100644
index 0000000000..3129ddf925
--- /dev/null
+++ b/modules/dnn/src/layers/constantofshape_layer.cpp
@@ -0,0 +1,149 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    ConstantOfShape layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__ConstantOfShape.html
+
+    Opset's 9 to 23 are covered.
+*/
+
+// out must be pre-allocated
+static void constantOfShape(const Mat& value, Mat& out)
+{
+    CV_Assert(value.total() == 1);
+    CV_Assert(out.isContinuous());
+    CV_CheckEQ(value.type(), out.type(), "input and output tensor types must be the same");
+
+    size_t esz = value.elemSize();
+    size_t total = out.total();
+    const uchar* inpdata_ = value.data;
+    uchar* outdata_ = out.data;
+
+    #undef IMPL_CONST_OF_SHAPE
+    #define IMPL_CONST_OF_SHAPE(T) \
+        T val = *(const T*)inpdata_; \
+        T* outdata = (T*)outdata_; \
+        for (size_t i = 0; i < total; i++) \
+            outdata[i] = val
+
+    if (esz == 1) {
+        IMPL_CONST_OF_SHAPE(uint8_t);
+    } else if (esz == 2) {
+        IMPL_CONST_OF_SHAPE(uint16_t);
+    } else if (esz == 4) {
+        IMPL_CONST_OF_SHAPE(uint32_t);
+    } else if (esz == 8) {
+        IMPL_CONST_OF_SHAPE(uint64_t);
+    } else {
+        CV_Error_(Error::StsNotImplemented, ("invalid/unsupported tensor type: %s", typeToString(value.type()).c_str()));
+    }
+}
+
+class ConstantOfShapeLayerImpl CV_FINAL : public ConstantOfShapeLayer
+{
+public:
+    ConstantOfShapeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        CV_Assert(netimpl_);
+        CV_Assert(this->inputs.size() == 1);
+        return !netimpl_->isConstArg(this->inputs[0]);
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape>&,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(!dynamicOutputShapes());
+
+        CV_Assert(this->inputs.size() == (size_t)1);
+        Net::Impl* netimpl_ = getNetImpl(this);
+        Mat shapeTensor = netimpl_->argTensor(this->inputs[0]);
+        MatShape shape = tensorToShape(shapeTensor);
+        outputs.assign(1, shape);
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        CV_Assert(blobs.size() == 1);
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)1);
+        outputs.assign(requiredOutputs, blobs[0].type());
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_Assert(blobs.size() == 1);
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 1);
+
+        const Mat& value = blobs[0];
+        Mat shapeTensor = inputs_arr.getMat(0);
+        MatShape shape = tensorToShape(shapeTensor);
+
+        auto kind = outputs_arr.kind();
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(shape, value.type());
+            constantOfShape(value, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(shape, value.type());
+            Mat temp(shape, value.type());
+            constantOfShape(value, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<ConstantOfShapeLayer> ConstantOfShapeLayer::create(const LayerParams& params)
+{
+    return Ptr<ConstantOfShapeLayer>(new ConstantOfShapeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/convolution_layer.cpp b/modules/dnn/src/layers/convolution_layer.cpp
index de599ae23b..cbee52ac70 100644
--- a/modules/dnn/src/layers/convolution_layer.cpp
+++ b/modules/dnn/src/layers/convolution_layer.cpp
@@ -80,21 +80,16 @@ class BaseConvolutionLayerImpl : public ConvolutionLayer
 public:
     bool fusedWeights, fusedBias;
     std::vector<double> weightsMultipliers;
-#ifdef HAVE_WEBNN
     int groups;
-#endif
+
     BaseConvolutionLayerImpl(const LayerParams &params)
     {
         setParamsFrom(params);
         getConvolutionKernelParams(params, kernel_size, pads_begin, pads_end, strides, dilations,
                                    padMode, adjust_pads, useWinograd);
 
-        numOutput = params.get<int>("num_output");
-        int ngroups = params.get<int>("group", 1);
-#ifdef HAVE_WEBNN
-        groups = ngroups;
-#endif
-        CV_Assert(numOutput % ngroups == 0);
+        numOutput = -1;
+        groups = params.get<int>("group", 1);
 
         if (kernel_size.size() == 2) {
             kernel = Size(kernel_size[1], kernel_size[0]);
@@ -122,10 +117,11 @@ public:
 
         CV_Assert((inputs.size() > outputs.size() && blobs.empty()) ||
                   (!inputs.empty() && (blobs.size() == 1 || blobs.size() == 2)));
-        MatSize weightShape = blobs.empty() ? inputs[1].size : blobs[0].size;
+        MatShape weightShape = blobs.empty() ? inputs[1].shape() : blobs[0].shape();
+        numOutput = weightShape[0];
 
         CV_Assert(inputs[0].dims == outputs[0].dims);
-        if (weightShape.dims() == 3)
+        if (weightShape.dims == 3)
         {
             kernel_size.resize(1, kernel_size[0]);
             strides.resize(1, strides[0]);
@@ -133,7 +129,7 @@ public:
             pads_begin.resize(1, pads_begin[0]);
             pads_end.resize(1, pads_end[0]);
         }
-        CV_Assert(weightShape.dims() == kernel_size.size() + 2);
+        CV_Assert(weightShape.dims == kernel_size.size() + 2);
         for (int i = 0; i < kernel_size.size(); i++) {
             CV_Assert(weightShape[i + 2] == kernel_size[i]);
         }
@@ -338,7 +334,8 @@ public:
         if (padMode.empty())
         {
             for (int i = 0; i < inpShape.size(); i++)
-                outShape.push_back((inpShape[i] + pads_begin[i] + pads_end[i] - dilations[i] * (kernel_size[i] - 1) - 1) / strides[i] + 1);
+                outShape.push_back((inpShape[i] + pads_begin[i] + pads_end[i] -
+                                    dilations[i] * (kernel_size[i] - 1) - 1) / strides[i] + 1);
         }
         else
         {
@@ -351,7 +348,7 @@ public:
                      "be multiple of %d but got %d", weightShape[1], inpCn));
         CV_Assert(ngroups > 0 && inpCn % ngroups == 0 && outCn % ngroups == 0);
 
-        outputs.resize(1, outShape);
+        outputs.resize(1, MatShape(outShape));
 
         return false;
     }
@@ -1329,13 +1326,11 @@ public:
     MatShape computeColRowShape(const MatShape &inpShape, const MatShape &outShape) const CV_OVERRIDE
     {
         int dims = inpShape.size();
-        int inpCn = inpShape[1];
         int inpD = dims == 5 ? inpShape[2] : 1;
         int inpH = inpShape[dims - 2];
         int inpW = inpShape.back();
         int outCn = outShape[1];
-        int ngroups = inpCn / blobs[0].size[0];
-        int outGroupCn = outCn / ngroups;
+        int outGroupCn = outCn / groups;
         int ksize = outGroupCn * std::accumulate(kernel_size.begin(), kernel_size.end(),
                                                  1, std::multiplies<size_t>());
         return shape(ksize, inpD * inpH * inpW);
@@ -1372,10 +1367,14 @@ public:
                          std::vector<MatShape> &outputs,
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
-        CV_Assert(!hasBias() || blobs[1].total() == (size_t)numOutput);
         CV_Assert(inputs.size() != 0);
 
         int outCn = numOutput;
+        if (outCn < 0) {
+            CV_Assert(inputs.size() > 1 || !blobs.empty());
+            MatShape weightShape = blobs.empty() ? inputs[1] : blobs[0].shape();
+            outCn = weightShape[1]*groups;
+        }
         std::vector<int> outShape;
         outShape.push_back(inputs[0][0]);  // batch
         outShape.push_back(outCn);
@@ -1398,13 +1397,12 @@ public:
             CV_Error(Error::StsError, "Unsupported padding mode " + padMode);
 
         CV_Assert(outCn % blobs[0].size[1] == 0);
-        int ngroups = outCn / blobs[0].size[1];
 
         int inpCn = inputs[0][1];
-        CV_Assert(inpCn % ngroups == 0 && outCn % ngroups == 0);
+        CV_Assert(inpCn % groups == 0 && outCn % groups == 0);
         CV_Assert(blobs[0].size[0] == inpCn);
 
-        outputs.resize(1, outShape);
+        outputs.resize(1, MatShape(outShape));
 
         if (!is1x1())
             internals.push_back(computeColRowShape(inputs[0], outputs[0]));
@@ -1412,6 +1410,17 @@ public:
         return false;
     }
 
+    void getTypes(const std::vector<MatType> &inputs,
+                  const int requiredOutputs,
+                  const int requiredInternals,
+                  std::vector<MatType> &outputs,
+                  std::vector<MatType> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() > 0);
+        outputs.assign(requiredOutputs, inputs[0]);
+        internals.assign(requiredInternals, CV_32F);
+    }
+
     void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
     {
         BaseConvolutionLayerImpl::finalize(inputs_arr, outputs_arr);
@@ -1420,6 +1429,11 @@ public:
         inputs_arr.getMatVector(inputs);
         outputs_arr.getMatVector(outputs);
 
+        CV_Assert(inputs.size() > 1 || !blobs.empty());
+
+        MatShape weightShape = blobs.empty() ? inputs[1].shape() : blobs[0].shape();
+        numOutput = weightShape[1]*groups;
+
         std::vector<int> inpShape;
         std::vector<int> outShape;
         for (int i = 2; i < inputs[0].dims; i++) {
@@ -1436,11 +1450,13 @@ public:
         }
 
         weightsMultipliers.assign(numOutput, 1.0);
-        if (weightsMat.empty())
-        {
+
+        if (weightsMat.empty() && !blobs.empty()) {
             transpose(blobs[0].reshape(1, blobs[0].size[0]), weightsMat);
-            biasesMat = hasBias() ? blobs[1].reshape(1, numOutput)
-                                  : Mat::zeros(numOutput, 1, CV_32F);
+        }
+
+        if (biasesMat.empty() && blobs.size() >= 2) {
+            biasesMat = blobs[1].reshape(1, numOutput);
         }
     }
 
@@ -1754,33 +1770,40 @@ public:
         if (is1x1())
             return false;
 
-        if (umat_weights.empty())
-        {
+        if (umat_weights.empty() || inputs.size() >= 2) {
+            Mat temp;
             if (fusedWeights)
                 weightsMat.copyTo(umat_weights);
-            else
-                transpose(blobs[0].reshape(1, inpCn), umat_weights);
+            else if (!blobs.empty()) {
+                transpose(blobs[0].reshape(1, inpCn), temp);
+                temp.copyTo(umat_weights);
+            }
+            else {
+                transpose(inputs[1].reshape(1, inpCn), temp);
+                temp.copyTo(umat_weights);
+            }
+        }
 
+        if (umat_biases.empty() || inputs.size() >= 3) {
             if (fusedBias)
                 biasesMat.copyTo(umat_biases);
+            else if (blobs.size() > 1)
+                blobs[1].reshape(1, outCn).copyTo(umat_biases);
+            else if (inputs.size() >= 3)
+                inputs[2].reshape(1, outCn).copyTo(umat_biases);
             else
-            {
-                if (hasBias())
-                    blobs[1].reshape(1, outCn).copyTo(umat_biases);
-                else
-                    umat_biases = UMat::zeros(outCn, 1, CV_32F);
-            }
+                umat_biases = UMat::zeros(outCn, 1, CV_32F);
         }
 
         String buildopt = format("-DT=%s ", ocl::typeToStr(inputs[0].type()));
         buildopt += format("-DPAD_H=%d -DPAD_W=%d -DKERNEL_H=%d -DKERNEL_W=%d -DSTRIDE_H=%d -DSTRIDE_W=%d ",
                            pad.height, pad.width, kernel.height, kernel.width, stride.height, stride.width);
 
-        for (size_t ii = 0; ii < outputs.size(); ii++)
+        //for (size_t ii = 0; ii < outputs.size(); ii++)
         {
-            int ngroups = outCn / blobs[0].size[1];
-            int inpGroupCn = inpCn / ngroups;
-            int outGroupCn = blobs[0].size[1];
+            int ii = 0;
+            int inpGroupCn = inpCn / groups;
+            int outGroupCn = outCn / groups;
             const UMat& inp = inputs[ii];
             UMat& out = outputs[ii];
             int numImg = inp.size[0];
@@ -1789,21 +1812,21 @@ public:
 
             MatShape inpshape = shape(numImg*inpCn, inpH*inpW);
             MatShape outshape = shape(numImg*outCn, outH*outW);
-            UMat convBlob = inputs[ii].reshape(1, inpshape.size(), &inpshape[0]);
-            UMat decnBlob = out.reshape(1, outshape.size(), &outshape[0]);
-            int rows = internals[0].rows / ngroups;
+            UMat convBlob = inputs[ii].reshape(1, inpshape);
+            UMat decnBlob = out.reshape(1, outshape);
+            int rows = internals[0].rows / groups;
 
             for (int n = 0; n < numImg; n++)
             {
-                for (int g = 0; g < ngroups; g++)
+                for (int g = 0; g < groups; g++)
                 {
                     UMat colMat = internals[0].rowRange(_Range(g * rows, rows));
-                    UMat convMat = convBlob.rowRange(_Range((g + n * ngroups) * inpGroupCn, inpGroupCn));
+                    UMat convMat = convBlob.rowRange(_Range((g + n * groups) * inpGroupCn, inpGroupCn));
                     UMat wghtMat = umat_weights.colRange(_Range(g * inpGroupCn, inpGroupCn));
                     gemm(wghtMat, convMat, 1, noArray(), 0, colMat, 0);
                 }
 
-                for (int g = 0; g < ngroups; g++)
+                for (int g = 0; g < groups; g++)
                 {
                     int total = outGroupCn * decnBlob.cols;
                     int index = 0;
@@ -1826,7 +1849,7 @@ public:
                     k.set(index++, ocl::KernelArg::PtrReadOnly(umat_biases));
                     k.set(index++, (int)(g * outGroupCn * umat_biases.cols));
                     k.set(index++, ocl::KernelArg::PtrWriteOnly(decnBlob));
-                    k.set(index++, (int)((g + n * ngroups) * outGroupCn * decnBlob.cols));
+                    k.set(index++, (int)((g + n * groups) * outGroupCn * decnBlob.cols));
 
                     size_t global[] = { (size_t)total };
                     bool ret = k.run(1, global, NULL, false);
@@ -1845,38 +1868,67 @@ public:
         CV_TRACE_FUNCTION();
         CV_TRACE_ARG_VALUE(name, "name", name.c_str());
 
-        CV_OCL_RUN(IS_DNN_OPENCL_TARGET(preferableTarget),
-                   forward_ocl(inputs_arr, outputs_arr, internals_arr));
+        // For some reason, tests for deconvolution fail;
+        // Also, the current implementation is super-inefficient,
+        // Just disabled it. Need to rewrite it and then uncomment back these lines
+        //CV_OCL_RUN(IS_DNN_OPENCL_TARGET(preferableTarget),
+        //           forward_ocl(inputs_arr, outputs_arr, internals_arr));
 
-        if (inputs_arr.depth() == CV_16F)
+        if (inputs_arr.depth(0) == CV_16F)
         {
             forward_fallback(inputs_arr, outputs_arr, internals_arr);
             return;
         }
 
-        std::vector<Mat> inputs, outputs, internals;
+        auto kind = outputs_arr.kind();
+        std::vector<Mat> inputs, internals;
         inputs_arr.getMatVector(inputs);
-        outputs_arr.getMatVector(outputs);
         internals_arr.getMatVector(internals);
 
         int outCn = numOutput;
         int inpCn = inputs[0].size[1];
         bool is1x1flag = is1x1();
         int nstripes = getNumThreads();
+        /*CV_Assert(outputs.size() == 1);
+        CV_Assert(inputs[0].size[0] == outputs[0].size[0]);
+        CV_Assert(outCn == outputs[0].size[1]);*/
 
-        if( weightsMat.empty() )
-        {
-            transpose(blobs[0].reshape(1, inpCn), weightsMat);
-            biasesMat = hasBias() ? blobs[1].reshape(1, outCn) : Mat::zeros(outCn, 1, CV_32F);
+        if (weightsMat.empty() || inputs.size() >= 2) {
+            Mat inpWeights = !blobs.empty() ? blobs[0] : inputs[1];
+            transpose(inpWeights.reshape(1, inpCn), weightsMat);
+        }
+
+        if (biasesMat.empty() || inputs.size() >= 3) {
+            Mat inpBias = blobs.size() >= 2 ? blobs[1] : inputs.size() >= 3 ? inputs[2] : Mat();
+            Mat biasesMat_ = !inpBias.empty() ? inpBias.reshape(1, outCn) : Mat::zeros(outCn, 1, CV_32F);
+            biasesMat_.copyTo(biasesMat);
         }
 
-        for (size_t ii = 0; ii < outputs.size(); ii++)
+        /*printf("DeConvolution Input: ");
+        pprint(std::cout, inputs[0], 0, 3, 100, '[');
+        printf("\nDeConvolution Weights: ");
+        pprint(std::cout, weightsMat, 0, 3, 100, '[');
+        printf("\nDeConvolution Bias: ");
+        pprint(std::cout, biasesMat, 0, 3, 100, '[');
+        printf("\n");*/
+
+        //for (size_t ii = 0; ii < outputs.size(); ii++)
         {
-            int ngroups = outCn / blobs[0].size[1];
-            int inpGroupCn = inpCn / ngroups;
-            int outGroupCn = blobs[0].size[1];
+            int ii = 0;
+            int inpGroupCn = inpCn / groups;
+            int outGroupCn = outCn / groups;
             const Mat& inp = inputs[ii];
-            Mat& out = outputs[ii];
+            MatShape outshape = outputs_arr.shape(0);
+            CV_Assert(outshape.dims == inp.dims);
+            CV_Assert(outshape[0] == inp.size[0]);
+            CV_Assert(outshape[1] == outCn);
+            Mat out;
+            if (kind == _InputArray::STD_VECTOR_MAT) {
+                out = outputs_arr.getMat(0);
+            }
+            else {
+                out.create(outshape, inp.type());
+            }
             int numImg = inp.size[0];
             int inpH = inp.size[2], inpW = inp.size[3];
             int outH = out.size[2], outW = out.size[3];
@@ -1886,12 +1938,12 @@ public:
 
             for (int n = 0; n < numImg; n++)
             {
-                for (int g = 0; g < ngroups; g++)
+                for (int g = 0; g < groups; g++)
                 {
-                    Mat dstMat = decnBlob.rowRange(_Range((g + n * ngroups) * outGroupCn, outGroupCn));
+                    Mat dstMat = decnBlob.rowRange(_Range((g + n * groups) * outGroupCn, outGroupCn));
                     Mat &colMat = is1x1flag ? dstMat : internals[0];
 
-                    Mat convMat = convBlob.rowRange(_Range((g + n * ngroups) * inpGroupCn, inpGroupCn));
+                    Mat convMat = convBlob.rowRange(_Range((g + n * groups) * inpGroupCn, inpGroupCn));
                     Mat wghtMat = weightsMat.colRange(_Range(g * inpGroupCn, inpGroupCn));
                     Mat curBiasMat = biasesMat.rowRange(_Range(g * outGroupCn, outGroupCn));
 
@@ -1905,6 +1957,10 @@ public:
                                        curBiasMat.ptr<float>(), is1x1flag);
                 }
             }
+            if (kind == _InputArray::STD_VECTOR_UMAT) {
+                std::vector<UMat>& u_outputs = outputs_arr.getUMatVecRef();
+                out.copyTo(u_outputs[0]);
+            }
         }
     }
 
diff --git a/modules/dnn/src/layers/correlation_layer.cpp b/modules/dnn/src/layers/correlation_layer.cpp
index cfb3b8eefc..aa239dd839 100644
--- a/modules/dnn/src/layers/correlation_layer.cpp
+++ b/modules/dnn/src/layers/correlation_layer.cpp
@@ -44,7 +44,7 @@ public:
         int neighborhood_grid_radius = max_displacement / stride_2;
         int neighborhood_grid_width = neighborhood_grid_radius * 2 + 1;
 
-        std::vector<int> outShape;
+        MatShape outShape;
 
         int num = inputs[0][0];
         outShape.push_back(num);
diff --git a/modules/dnn/src/layers/cpu_kernels/convolution.cpp b/modules/dnn/src/layers/cpu_kernels/convolution.cpp
index 33fb62a47b..ee769f0350 100644
--- a/modules/dnn/src/layers/cpu_kernels/convolution.cpp
+++ b/modules/dnn/src/layers/cpu_kernels/convolution.cpp
@@ -175,7 +175,6 @@ Ptr<FastConv> initFastConv(
 #endif
 
     Mat weightsMat = _weightsMat.getMat();
-    auto wShape = shape(weightsMat);
     const size_t wstep = weightsMat.step1();
 
     conv->useFP16 = false;
diff --git a/modules/dnn/src/layers/cumsum_layer.cpp b/modules/dnn/src/layers/cumsum_layer.cpp
index 50533a1c2a..f8126a0ced 100644
--- a/modules/dnn/src/layers/cumsum_layer.cpp
+++ b/modules/dnn/src/layers/cumsum_layer.cpp
@@ -46,7 +46,7 @@ public:
         std::vector<MatType>& outputs,
         std::vector<MatType>& internals) const CV_OVERRIDE
     {
-        CV_CheckType(inputs[0], inputs[0] == CV_32F || inputs[0] == CV_32S || inputs[0] == CV_64S || inputs[0] == CV_16F, "");
+        CV_CheckType(inputs[0], inputs[0] == CV_32F || inputs[0] == CV_64F || inputs[0] == CV_32S || inputs[0] == CV_64S || inputs[0] == CV_16F, "");
         outputs.assign(1, inputs[0]);
     }
 
@@ -78,6 +78,9 @@ public:
             case CV_64S:
                 forwardImpl<int64_t>(inputs, outputs);
                 break;
+            case CV_64F:
+                forwardImpl<double>(inputs, outputs);
+                break;
             default:
                 CV_Error(Error::BadDepth, "");
         }
diff --git a/modules/dnn/src/layers/dequantizelinear_layer.cpp b/modules/dnn/src/layers/dequantizelinear_layer.cpp
new file mode 100644
index 0000000000..1981cfa471
--- /dev/null
+++ b/modules/dnn/src/layers/dequantizelinear_layer.cpp
@@ -0,0 +1,348 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    DequantizeLinear layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html
+
+    Opset's 10 to 23 are covered.
+*/
+
+template <typename _InpTp, typename _ScaleTp, typename _OutTp>
+static void dequantizeLinear(const _InpTp* inp_, const _ScaleTp* scale_,
+                             const _InpTp* zp_, _OutTp* out_,
+                             int64_t nslices, int sz_a_,
+                             int64_t slice_size_, int block_size_)
+{
+    int bsz_ = std::max(block_size_, 1);
+    int nblocks_per_axis = (sz_a_ + bsz_ - 1) / bsz_;
+    int64_t nmacro_blocks = nslices * nblocks_per_axis;
+    CV_Assert(nmacro_blocks <= (int64_t)INT_MAX);
+
+    parallel_for_(Range(0, (int)nmacro_blocks), [&](const Range& r) {
+        int sz_a = sz_a_;
+        int64_t slice_size = slice_size_;
+        int block_size = block_size_;
+        int delta = 0;
+        int64_t scale_step = block_size > 0 ? slice_size : 1;
+        int64_t zp_step = zp_ ? scale_step : 0;
+
+        for (int i = r.start; i < r.end; i += delta) {
+            int slice_idx = i / nblocks_per_axis;
+            int block_idx = i - slice_idx * nblocks_per_axis;
+            int64_t block_ofs, scale_ofs;
+            if (block_size > 0) {
+                delta = std::min(nblocks_per_axis - block_idx, r.end - i);
+                block_ofs = (slice_idx*sz_a + block_idx*block_size)*slice_size;
+                scale_ofs = (slice_idx*nblocks_per_axis + block_idx)*slice_size;
+            } else {
+                delta = std::min(sz_a - block_idx, r.end - i);
+                block_ofs = (slice_idx*sz_a + block_idx)*slice_size;
+                scale_ofs = block_idx;
+            }
+            const _InpTp* inp = inp_ + block_ofs;
+            const _InpTp* zp = zp_ ? zp_ + scale_ofs : nullptr;
+            const _ScaleTp* sc = scale_ + scale_ofs;
+            _OutTp* out = out_ + block_ofs;
+
+            // [TODO] vectorize using intrinsics
+            if (slice_size > 1) {
+                for (int k = 0; k < delta; k++, inp += slice_size, out += slice_size,
+                                                sc += scale_step, zp += zp_step) {
+                    float scval = (float)*sc;
+                    _InpTp zpval = zp ? *zp : (_InpTp)0;
+
+                    for (int64_t j = 0; j < slice_size; j++)
+                        out[j] = _OutTp((inp[j] - zpval)*scval);
+                }
+            } else if (block_size > 0 ) {
+                int bsz = block_size;
+                for (int k = 0; k < delta; k++, inp += bsz, out += bsz) {
+                    bsz = std::min(bsz, sz_a - (block_idx + k)*block_size);
+                    float scval = (float)sc[k];
+                    _InpTp zpval = zp ? zp[k] : (_InpTp)0;
+
+                    for (int j = 0; j < bsz; j++)
+                        out[j] = _OutTp((inp[j] - zpval)*scval);
+                }
+                sc += delta;
+                zp += zp ? delta : 0;
+            } else {
+                if (zp) {
+                    for (int j = 0; j < delta; j++) {
+                        float scval = (float)sc[j];
+                        _InpTp zpval = zp[j];
+                        out[j] = _OutTp((inp[j] - zpval)*scval);
+                    }
+                } else {
+                    for (int j = 0; j < delta; j++) {
+                        float scval = (float)sc[j];
+                        out[j] = _OutTp(inp[j]*scval);
+                    }
+                }
+                inp += delta;
+                out += delta;
+            }
+        }
+    });
+}
+
+// Dequantize INT8/UINT8 to FP32/FP16; out must be preallocated
+static void dequantizeLinear(const Mat& inp, const Mat& scale_, const Mat& zp,
+                             int axis, int block_size, Mat& out)
+{
+    Mat scale = scale_;
+    CV_Assert(inp.isContinuous());
+    CV_Assert(scale.isContinuous());
+    CV_Assert(out.isContinuous());
+
+    int inptype = inp.type();
+    int outtype = out.type();
+    int sctype = scale.type();
+    int zptype = zp.type();
+    MatShape inpshape = inp.shape();
+    MatShape scshape = scale.shape();
+    MatShape zpshape = zp.shape();
+    int i, ndims = inpshape.dims;
+    int64_t nslices = 1, slice_size = 1;
+
+    CV_Assert(inptype == CV_8U || inptype == CV_8S || inptype == CV_32S);
+    CV_Assert(sctype == CV_32F || sctype == CV_16F);
+    CV_Assert(outtype == CV_32F || outtype == CV_16F);
+
+    if (!zp.empty()) {
+        CV_Assert(zp.isContinuous());
+        CV_Assert(zptype == inptype);
+        CV_Assert(zpshape == scshape);
+    }
+
+    axis = normalize_axis(axis, ndims);
+    for (i = 0; i < axis; i++)
+        nslices *= inpshape[i];
+    for (i = axis+1; i < ndims; i++)
+        slice_size *= inpshape[i];
+    int sz_a = inpshape[axis];
+
+    if (block_size == 0) {
+        size_t sc_total = scshape.total();
+        CV_Assert(scale.dims <= 1);
+        CV_Assert(sc_total == 1 || sc_total == (size_t)sz_a);
+
+        // unroll the innermost loop if the scale's/zp's are the same
+        if (sc_total == 1) {
+            slice_size *= sz_a;
+            sz_a = 1;
+        }
+
+        // avoid FP16 => FP32 conversion for scale inside the innermost loop
+        if (sctype == CV_16F && slice_size == 1 && nslices > 1) {
+            Mat temp;
+            scale_.convertTo(temp, CV_32F);
+            scale = temp;
+            sctype = CV_32F;
+        }
+    } else {
+        CV_Assert(block_size > 0);
+        CV_Assert(scale.dims == ndims);
+        for (int i = 0; i < ndims; i++) {
+            int inp_i = inpshape[i];
+            int sc_i = scshape[i];
+            if (i == axis) {
+                CV_Assert((inp_i + block_size - 1)/block_size == sc_i);
+            } else {
+                CV_Assert(sc_i == inp_i);
+            }
+        }
+    }
+
+    if (inptype == CV_8U && sctype == CV_32F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const uint8_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8U && sctype == CV_16F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const uint8_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8U && sctype == CV_32F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const uint8_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8U && sctype == CV_16F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const uint8_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8S && sctype == CV_32F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const int8_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8S && sctype == CV_16F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const int8_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8S && sctype == CV_32F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const int8_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_8S && sctype == CV_16F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const int8_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_32S && sctype == CV_32F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const int32_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int32_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_32S && sctype == CV_16F && outtype == CV_32F)
+        dequantizeLinear(reinterpret_cast<const int32_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int32_t*>(zp.data),
+                         reinterpret_cast<float*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_32S && sctype == CV_32F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const int32_t*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int32_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (inptype == CV_32S && sctype == CV_16F && outtype == CV_16F)
+        dequantizeLinear(reinterpret_cast<const int32_t*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int32_t*>(zp.data),
+                         reinterpret_cast<hfloat*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else {
+        CV_Error_(Error::StsNotImplemented,
+                  ("the following combination of types is not supported in "
+                   "DequantizeLinear: inp=%s, scale=%s, out=%s",
+                   typeToString(inptype).c_str(),
+                   typeToString(sctype).c_str(),
+                   typeToString(outtype).c_str()));
+    }
+}
+
+class DequantizeLinearLayerImpl CV_FINAL : public DequantizeLinearLayer
+{
+public:
+    DequantizeLinearLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+
+        axis = params.get<int>("axis", 1);
+        block_size = params.get<int>("block_size", 0);
+        CV_Assert(block_size >= 0);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV || backendId == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int requiredOutputs,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+        CV_Assert(requiredOutputs == 1);
+        outputs.assign(1, inputs[0]);
+        return true;
+    }
+
+    int getOutType() const
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        return netimpl_->enableFP16 ? CV_16F : CV_32F;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+        if (ninputs == 3) {
+            CV_Assert(inputs[0] == inputs[2]);
+        }
+        outputs.assign(1, getOutType());
+    }
+
+    virtual void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        int ninputs = inputs_arr.size(-1).area();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+
+        Mat inp = inputs_arr.getMat(0);
+        Mat scale = inputs_arr.getMat(1);
+        Mat zeropoint;
+        int outtype = getOutType();
+        MatShape inpshape = inp.shape();
+
+        if (ninputs >= 3) {
+            zeropoint = inputs_arr.getMat(2);
+        }
+
+        auto kind = outputs_arr.kind();
+
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(inpshape, outtype);
+            dequantizeLinear(inp, scale, zeropoint, axis, block_size, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(inpshape, outtype);
+            Mat temp(inpshape, outtype);
+            dequantizeLinear(inp, scale, zeropoint, axis, block_size, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<DequantizeLinearLayer> DequantizeLinearLayer::create(const LayerParams& params)
+{
+    return Ptr<DequantizeLinearLayer>(new DequantizeLinearLayerImpl(params));
+}
+
+}}
diff --git a/modules/dnn/src/layers/einsum_layer.cpp b/modules/dnn/src/layers/einsum_layer.cpp
index 771b4a097e..4a50526460 100644
--- a/modules/dnn/src/layers/einsum_layer.cpp
+++ b/modules/dnn/src/layers/einsum_layer.cpp
@@ -15,7 +15,7 @@ namespace dnn
 {
 
 static bool IsTransposeReshapeForEinsum(const std::vector<size_t>& perm,
-                                        std::vector<int> input_dims,
+                                        const MatShape& input_dims,
                                         MatShape& new_shape) {
     // As long as the dims with values > 1 stay in the same order, it's a reshape.
     // Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1).
@@ -59,7 +59,8 @@ static Mat Transpose(
     Mat output;
     MatShape order(permutation.begin(), permutation.end());
 
-    cv::transposeND((reshape ? input_reshaped : input), order, output);
+    std::vector<int> order_(order.begin(), order.end());
+    cv::transposeND((reshape ? input_reshaped : input), order_, output);
     return output;
 }
 
@@ -352,6 +353,9 @@ public:
     // Backend for fastgemm
     FastGemmOpt opt;
 
+    mutable bool outputShapeComputed;
+    mutable MatShape cachedOutputShape;
+
     void parseEquation(String equation);
     void processEquation(const std::vector<MatShape>& inputs);
     void processBroadcastedDims();
@@ -375,31 +379,27 @@ public:
         const MatShape& input2ShapeOverride
     );
 
+    void computeOutputShape(const std::vector<MatShape>& inputs) const {
+        if (!outputShapeComputed) {
+            // Copy of the existing computation logic
+            const_cast<LayerEinsumImpl*>(this)->processEquation(inputs);
+            const_cast<LayerEinsumImpl*>(this)->processBroadcastedDims();
+            const_cast<LayerEinsumImpl*>(this)->validateOutputSubscript();
+            const_cast<LayerEinsumImpl*>(this)->calculateOutputShape();
+
+            cachedOutputShape = einsumOutDims;
+            outputShapeComputed = true;
+        }
+    }
+
     // constructor
     LayerEinsumImpl(const LayerParams& params)
+        : outputShapeComputed(false)
     {
         setParamsFrom(params);
         equation = params.get<String>("equation");
-        int outputSize = params.get<int>("outputSize");
-        numInputs  = params.get<int>("inputSize");
-
-        CV_CheckEQ(outputSize, 1, "Einsum layer should only have one output");
-
-        // get the input shapes from onnx importer
-        for (int i=0; i < numInputs; i++){
-            auto param = params.get("inputShapes" + cv::format("%d", i));
-            int inputDims = param.size();
-            std::vector<int> shape;
-            for (int i = 0; i < inputDims; ++i)
-                shape.emplace_back(param.get<int>(i));
-            einsumInpShapes.emplace_back(shape);
-        }
-
         opt.init();
 
-        // Maintains a mapping between input indices and their corresponding subscript labels for each input
-        inputSubscriptIndices.reserve(numInputs);
-
         // We allocate space for 10 values as a precaution,
         // assuming that we won't encounter any input with a rank greater than 10.
         // In such cases, the value of num_subscript_indices_ would be greater than 10.
@@ -413,15 +413,6 @@ public:
         // parser equation and extract tokens from the equation
         // save token to lhs_eq_tokens variable
         parseEquation(equation); // TODO: return lhs_eq_tokens
-
-        // Start preprocessing related to equation parsing
-        // and dimention broadcasting
-        processEquation(einsumInpShapes);
-        processBroadcastedDims();
-
-        // calculate output shape
-        validateOutputSubscript();
-        calculateOutputShape();
     }
 
     virtual bool supportBackend(int backendId) CV_OVERRIDE {
@@ -435,21 +426,27 @@ public:
                          std::vector<MatShape> &outputs,
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
+        CV_UNUSED(requiredOutputs);
         CV_UNUSED(internals);
 
-        // check if passed and parsed inputs match up in number and dimensions
-        CV_CheckEQ(static_cast<int>(inputs.size()), numInputs,
-            "Number of inputs in forward and inputs during graph constructions do not match");
-        for (int i = 0; i < numInputs; i++)
-        {
-            if (inputs[i] != einsumInpShapes[i])
-                CV_Error(Error::StsAssert, "Passed input shapes do not match with parsed input shapes!");
+        // check if input einsumInputShapes is empty
+        if (einsumInpShapes.empty()) {
+            outputShapeComputed = false;
+        } else {
+            // check weather shapes in inputs are compatible with shapes in einsumInpShapes
+            for (int i = 0; i < inputs.size(); i++) {
+                if (inputs[i] != einsumInpShapes[i]) {
+                    outputShapeComputed = false;
+                    break;
+                }
+            }
         }
 
+        computeOutputShape(inputs);
+
         outputs.clear();
-        outputs.emplace_back(einsumOutDims);
+        outputs.emplace_back(cachedOutputShape);
         return true;
-
     } // getMemoryShape
 
     // forward
@@ -699,10 +696,29 @@ void LayerEinsumImpl::parseEquation(String equation)
 
     // split lhs_eq by ',' - comma and put all created token - splits
     // into lhs_eq_tokens vector
-    std::stringstream src(lhs_eq);
-    for (std::string token; std::getline(src, token, ',');) {
-        lhs_eq_tokens.emplace_back(token);
+    // the implementation does not ignore empty tokens and trailing comma
+    size_t start = 0;
+    while(start < lhs_eq.size())
+    {
+        size_t comma = lhs_eq.find(',', start);
+        if (comma != std::string::npos)
+        {
+            std::string token = lhs_eq.substr(start, comma-start);
+            lhs_eq_tokens.push_back(token);
+            start = comma+1;
+        }
+        else
+        {
+            std::string token = lhs_eq.substr(start);
+            lhs_eq_tokens.push_back(token);
+            start = lhs_eq.size()+1;
+        }
     }
+
+    // trailing comma without token
+    if (lhs_eq[lhs_eq.size()-1] == ',')
+        lhs_eq_tokens.push_back(std::string());
+
 }
 
 
@@ -764,6 +780,9 @@ void LayerEinsumImpl::calculateOutputShape()
             subscriptIndicesToOutputIndices[mappedIndex] = outputDimCounter++;
         }
     }
+    if (rhs_eq.empty()) {
+        einsumOutDims = MatShape(0, 0); // handle scalar output case
+    }
 }
 
 void LayerEinsumImpl::validateOutputSubscript()
@@ -873,10 +892,19 @@ void LayerEinsumImpl::processBroadcastedDims()
 void LayerEinsumImpl::processEquation(const std::vector<MatShape>& inputs)
 {
 
+    // fill in the einsumInpShapes
+    for (const auto& input : inputs) {
+        einsumInpShapes.emplace_back(input);
+    }
+
+
+    numInputs = inputs.size();
+    inputSubscriptIndices.reserve(numInputs);
     // Check if number of tokens in equal to number of inputs.
     // For install "ij, jk -> ik" needs to have 2 inputs tensors
     int num_input_tensors = inputs.size();
-    if (lhs_eq_tokens.empty() || (lhs_eq_tokens.size() == 1 && lhs_eq_tokens[0].empty() && lhs_eq == ",") ) {
+    if (lhs_eq_tokens.empty() || (lhs_eq == ",") ) {
+        inputSubscriptIndices.resize(numInputs);
         return;
     }
     // if we have only one token and two inputs lets skip the check
@@ -1006,9 +1034,9 @@ Mat LayerEinsumImpl::FinalizeOutput(
     const std::vector<int>& subscript_indices_to_output_indices = subscriptIndicesToOutputIndices;
     const auto output_dims = einsumOutDims;
 
-    MatShape output_shape = output_dims;
     const auto output_rank = output_dims.size();
 
+    // MatShape output_shape = output_dims;
     // CV_CheckEQ((int) candidateOutput.dims,  (int) output_shape.size(),
     //           "Einsum op: The candidate output cannot be reshaped into the op's output");
 
@@ -1025,6 +1053,7 @@ Mat LayerEinsumImpl::FinalizeOutput(
     output_permutation.resize(output_rank, 0);
     size_t output_iter = 0;
 
+
     for (size_t iter = 0, end = ordered_subscript_indices_in_candidate.size(); iter < end; ++iter)
     {
         auto output_index = subscript_indices_to_output_indices[ordered_subscript_indices_in_candidate[iter]];
@@ -1345,6 +1374,7 @@ Mat LayerEinsumImpl::batchwiseMatMul(
     Mat reshapedInput1 = input1;
     Mat reshapedInput2 = input2;
 
+
     Mat output;
     if (batches > 1)
     {
@@ -1373,10 +1403,11 @@ Mat LayerEinsumImpl::batchwiseMatMul(
             reshapedInput2 = input2.reshape(1, 2, shape2);
         }
 
+
         output = Mat(M, N, reshapedInput1.type());
-        if ((shape(reshapedInput1).empty() && shape(reshapedInput2).empty())  ||
-            (shape(reshapedInput1).empty() && !shape(reshapedInput2).empty()) ||
-            (!shape(reshapedInput1).empty() && shape(reshapedInput2).empty()))
+        if ((reshapedInput1.dims == 0 && reshapedInput2.dims == 0)  ||
+            (reshapedInput1.dims == 0 && reshapedInput2.dims != 0) ||
+            (reshapedInput1.dims != 0 && reshapedInput2.dims == 0))
         {
             output = reshapedInput1.mul(reshapedInput2); // fastGemm does not support 0D * 0D multiplication
         } else {
diff --git a/modules/dnn/src/layers/eltwise_layer.cpp b/modules/dnn/src/layers/eltwise_layer.cpp
index 9219ca6f52..e42242a9cf 100644
--- a/modules/dnn/src/layers/eltwise_layer.cpp
+++ b/modules/dnn/src/layers/eltwise_layer.cpp
@@ -280,7 +280,7 @@ public:
 
         for (size_t i = 0; i < inputs.size(); i++)
         {
-            MatShape inpShape = shape(inputs[i].size);
+            MatShape inpShape = inputs[i].shape();
             if (isAllOnes(inpShape, 0, inputs[i].dims))
             {
                 hasVecInput = true;
@@ -710,15 +710,15 @@ public:
         {
             for (size_t i = 0; i < inputs.size(); i++)
             {
-                MatShape inpShape = shape(inputs[i].size);
+                MatShape inpShape = inputs[i].shape();
                 bool allOnes = isAllOnes(inpShape, 2, inputs[i].dims);
 
                 if (allOnes)
                 {
                     Mat tmpInput = inputs[i];
-                    MatShape outShape = shape(outputs[0].size);
+                    MatShape outShape = outputs[0].shape();
                     size_t xSize = outShape[2];
-                    for (size_t j = 3; j < outShape.size(); j++)
+                    for (int j = 3; j < outShape.dims; j++)
                         xSize *= outShape[j];
 
                     int dimVec[3] = {outShape[0], outShape[1], (int) xSize};
diff --git a/modules/dnn/src/layers/expand2_layer.cpp b/modules/dnn/src/layers/expand2_layer.cpp
new file mode 100644
index 0000000000..c7392207a4
--- /dev/null
+++ b/modules/dnn/src/layers/expand2_layer.cpp
@@ -0,0 +1,130 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Expand layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Expand.html
+
+    Opset's 8 to 13 are covered.
+*/
+
+class Expand2LayerImpl CV_FINAL : public Expand2Layer
+{
+public:
+    Expand2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        CV_Assert(netimpl_);
+        size_t ninputs = this->inputs.size();
+        CV_Assert(ninputs == 2);
+        return !netimpl_->isConstArg(this->inputs[1]);
+    }
+
+    MatShape getOutShape(const MatShape& inpshape, const Mat& shapeTensor) const
+    {
+        MatShape shape0 = tensorToShape(shapeTensor);
+        MatShape shape = inpshape.expand(shape0);
+        // according to ONNX specification, the specified shape can be smaller than the input!
+        // so we comment off the check
+        // CV_Assert(shape == shape0); // check that input can be expanded to the specified shape
+        return shape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape>& inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(!dynamicOutputShapes());
+
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)2);
+        Net::Impl* netimpl_ = getNetImpl(this);
+
+        Mat shapeTensor = netimpl_->argTensor(this->inputs[1]);
+
+        outputs.assign(1, getOutShape(inputs[0], shapeTensor));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)2);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 2);
+
+        Mat inp = inputs_arr.getMat(0);
+        int inptype = inp.type();
+        Mat shapeTensor = inputs_arr.getMat(1);
+
+        MatShape outshape = getOutShape(inp.shape(), shapeTensor);
+
+        auto kind = outputs_arr.kind();
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            broadcast(inp, outshape, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            Mat temp(outshape, inptype);
+            broadcast(inp, outshape, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<Expand2Layer> Expand2Layer::create(const LayerParams& params)
+{
+    return Ptr<Expand2Layer>(new Expand2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/expand_layer.cpp b/modules/dnn/src/layers/expand_layer.cpp
index 6b5be6c67a..24b8d1df9b 100644
--- a/modules/dnn/src/layers/expand_layer.cpp
+++ b/modules/dnn/src/layers/expand_layer.cpp
@@ -16,13 +16,23 @@ public:
         setParamsFrom(params);
 
         // shape as param
-        CV_CheckTrue(params.has("shape"), "DNN/Expand: shape is required in Expand layer initialization");
-        DictValue param_shape = params.get("shape");
-        int ndims_shape = param_shape.size();
-        CV_CheckGT(ndims_shape, 0, "DNN/Expand: ndims of shape must be > 0");
-        target_shape.resize(ndims_shape);
-        for (int i = 0; i < ndims_shape; i++) {
-            target_shape[i] = param_shape.get<int>(i);
+        if (params.has("shape")) {
+            DictValue param_shape = params.get("shape");
+            int ndims_shape = param_shape.size();
+            CV_CheckGT(ndims_shape, 0, "DNN/Expand: ndims of shape must be > 0");
+            target_shape.resize(ndims_shape);
+            for (int i = 0; i < ndims_shape; i++) {
+                target_shape[i] = param_shape.get<int>(i);
+            }
+        } else if (params.blobs.size() == 1) {
+            Mat expand_shape = params.blobs[0];
+            CV_Assert(expand_shape.total() > 0);
+            target_shape.resize(expand_shape.total());
+            for (int i = 0; i < expand_shape.total(); i++) {
+                target_shape[i] = expand_shape.at<int64_t>(i);
+            }
+        } else {
+            CV_Error(Error::StsBadArg, "DNN/Expand: shape is required in Expand layer initialization");
         }
 
         // FIXME: remove when 0d/1d mat is available
@@ -45,7 +55,7 @@ public:
 
         MatShape input_shape = inputs[0]; // 1d tensor is represented as 2d mat, e.g. [3] -> [3, 1]
         if (const_input_1d) {
-            input_shape = {inputs[0][0]};
+            input_shape = shape(inputs[0][0]);
         }
 
         auto& moreDimension = input_shape.size() > target_shape.size() ? input_shape : target_shape;
@@ -96,7 +106,7 @@ public:
         const auto &input = inputs[0];
         auto input_shape = shape(input);
         if (const_input_1d) {
-            input_shape = {input_shape[0]};
+            input_shape = shape(input_shape[0]);
         }
 
         auto& moreDimension = input_shape.size() > target_shape.size() ? input_shape : target_shape;
@@ -105,7 +115,7 @@ public:
         MatShape final_target_shape(moreDimension.size(), 1);
         for (int i = 0; i < moreDimension.size(); i++) {
             int d = moreDimension[i];
-            int j = i - (moreDimension.size() - lessDimension.size());
+            int j = i - (int)(moreDimension.size() - lessDimension.size());
             if (j >= 0) {
                 final_target_shape[i] = std::max(lessDimension[j], d);
             } else {
diff --git a/modules/dnn/src/layers/flatten_layer.cpp b/modules/dnn/src/layers/flatten_layer.cpp
index 6fd3c2568e..5ac79da771 100644
--- a/modules/dnn/src/layers/flatten_layer.cpp
+++ b/modules/dnn/src/layers/flatten_layer.cpp
@@ -65,10 +65,14 @@ namespace dnn
 class FlattenLayerImpl CV_FINAL : public FlattenLayer
 {
 public:
+    bool _onnxMode;
+
     FlattenLayerImpl(const LayerParams &params)
     {
         _startAxis = params.get<int>("axis", 1);
         _endAxis = params.get<int>("end_axis", -1);
+        _onnxMode = params.get<bool>("onnx", false);
+
         setParamsFrom(params);
     }
 
@@ -94,24 +98,50 @@ public:
             CV_Assert(inputs[i] == inputs[0]);
         }
 
-        int numAxes = inputs[0].size();
+        MatShape outputShapeVec;
+
+        int numAxes = (int)inputs[0].size();
+        /*
+           Ticket: https://github.com/opencv/opencv/issues/26197
+           [TODO] this is not quite correct,
+           in ONNX Flatten valid range is [0, numAxes],
+           not [0, numAxes-1] which normalize_axis() produces.
+           But if we fix it, flatten_const.onnx from opencv_extra
+           is not processed correctly.
+           libprotobuf-c reads it correctly,
+           but the current version of libprotobuf does not
+        */
         int startAxis = normalize_axis(_startAxis, numAxes);
         int endAxis = normalize_axis(_endAxis, numAxes);
 
-        CV_Assert(startAxis >= 0);
-        CV_Assert(endAxis >= startAxis && endAxis < (int)numAxes);
+        CV_Assert(startAxis >= 0 && startAxis <= numAxes);
 
-        size_t flattenedDimensionSize = total(inputs[0], startAxis, endAxis + 1);
+        if (_onnxMode) {
+            size_t outer = 1, inner = 1;
+            int i = 0;
+            for (; i < startAxis; i++)
+                outer *= inputs[0][i];
+            for (; i < numAxes; i++)
+                inner *= inputs[0][i];
 
-        MatShape outputShapeVec;
-        for (int i = 0; i < startAxis; i++)
-        {
-            outputShapeVec.push_back(inputs[0][i]);
+            CV_Assert_N(inner <= (size_t)INT_MAX, outer < (size_t)INT_MAX);
+            outputShapeVec.push_back((int)outer);
+            outputShapeVec.push_back((int)inner);
         }
-        outputShapeVec.push_back(flattenedDimensionSize);
-        for (size_t i = endAxis + 1; i < numAxes; i++)
-        {
-            outputShapeVec.push_back(inputs[0][i]);
+        else {
+            CV_Assert(endAxis >= startAxis && endAxis <= numAxes);
+
+            size_t flattenedDimensionSize = total(inputs[0], startAxis, endAxis + 1);
+
+            for (int i = 0; i < startAxis; i++)
+            {
+                outputShapeVec.push_back(inputs[0][i]);
+            }
+            outputShapeVec.push_back(flattenedDimensionSize);
+            for (size_t i = endAxis + 1; i < numAxes; i++)
+            {
+                outputShapeVec.push_back(inputs[0][i]);
+            }
         }
 
         outputs.resize(inputs.size(), outputShapeVec);
@@ -126,18 +156,9 @@ public:
         std::vector<MatType>& internals) const CV_OVERRIDE
     {
         CV_Assert(inputs.size());
-        for (auto input : inputs)
-        {
-            if (preferableTarget == DNN_TARGET_OPENCL_FP16)
-                CV_CheckType(input, input == CV_16F || input == CV_32S || input == CV_64S || input == CV_8S || input == CV_8U || input == CV_Bool, "");
-            else
-                CV_CheckType(input, input == CV_32F || input == CV_32S || input == CV_64S || input == CV_8S || input == CV_8U || input == CV_Bool, "");
-        }
-
         outputs.assign(requiredOutputs, inputs[0]);
     }
 
-
     void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays) CV_OVERRIDE
     {
         std::vector<Mat> inputs;
@@ -165,7 +186,7 @@ public:
         {
             MatShape outShape = shape(outputs[i]);
             UMat& output = outputs_arr.getUMatRef(i);
-            output = inputs[i]->reshape(1, (int)outShape.size(), &outShape[0]);
+            inputs[i]->reshape(1, (int)outShape.size(), &outShape[0]).copyTo(output);
         }
 
         return true;
diff --git a/modules/dnn/src/layers/gather2_layer.cpp b/modules/dnn/src/layers/gather2_layer.cpp
new file mode 100644
index 0000000000..f023e12720
--- /dev/null
+++ b/modules/dnn/src/layers/gather2_layer.cpp
@@ -0,0 +1,210 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Gather layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Gather.html
+
+    Opset's 1 to 13 are covered.
+*/
+
+// out must be pre-allocated
+static void gather(const Mat& data, const Mat& ind, Mat& out, int axis)
+{
+    CV_Assert_N(data.isContinuous(), ind.isContinuous(), out.isContinuous());
+    int indType = ind.type();
+    CV_Assert(indType == CV_32S || indType == CV_64S);
+
+    MatShape dataShape = data.shape();
+    MatShape indShape = ind.shape();
+    MatShape outShape = out.shape();
+    int dataDims = dataShape.dims;
+    int indDims = indShape.dims;
+    int outDims = outShape.dims;
+
+    CV_Assert(outDims == dataDims + indDims - 1);
+    size_t indTotal = indShape.total(), nslices = 1;
+    size_t elemSize = data.elemSize();
+    size_t sliceSize = elemSize;
+
+    for(int j = 0; j < dataDims; j++) {
+        int szj = dataShape[j];
+        if (j < axis)
+            nslices *= szj;
+        else if (j > axis)
+            sliceSize *= szj;
+    }
+    size_t dataStep = sliceSize * dataShape[axis];
+    size_t outStep = sliceSize * indTotal;
+    volatile bool globOutOfRangeIdx = false;
+
+    parallel_for_(Range(0, (int)indTotal), [&](const Range& r) {
+        int shape_a = dataShape[axis];
+        const uchar* dataptr0 = data.data;
+        uchar* outptr0 = out.data;
+        const int32_t* ind32 = indType == CV_32S ? ind.ptr<int32_t>() : nullptr;
+        const int64_t* ind64 = indType == CV_64S ? ind.ptr<int64_t>() : nullptr;
+        bool outOfRangeIdx = globOutOfRangeIdx;
+        for (int j = r.start; j < r.end && !outOfRangeIdx; j++) {
+            int k = ind32 ? (int)ind32[j] : (int)ind64[j];
+            uchar* outptr = outptr0 + j*sliceSize;
+            const uchar* dataptr = dataptr0;
+            for (size_t i = 0; i < nslices; i++, dataptr += dataStep, outptr += outStep) {
+                k += k < 0 ? shape_a : 0;
+                if (k < 0 || k >= shape_a) {
+                    outOfRangeIdx = true;
+                    break;
+                }
+                memcpy(outptr, dataptr + k*sliceSize, sliceSize);
+            }
+        }
+        if (outOfRangeIdx)
+            globOutOfRangeIdx = true;
+    }, std::min((double)indTotal, (double)sliceSize*nslices*indTotal/1e6));
+
+    if (globOutOfRangeIdx) {
+        CV_Error(Error::StsOutOfRange, "some of indices are outside of range");
+    }
+}
+
+class Gather2LayerImpl CV_FINAL : public Gather2Layer
+{
+public:
+    Gather2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axis = params.get<int>("axis", 0);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const MatShape& dataShape, const MatShape& indShape) const
+    {
+        int dataDims = dataShape.dims;
+        int indDims = indShape.dims;
+
+        int axis_ = normalize_axis(axis, dataDims);
+        CV_Assert(0 <= axis_ && axis_ < dataDims);
+        MatShape outShape(dataDims + indDims - 1);
+
+        for (int i = 0; i < outShape.dims; i++) {
+            if (i < axis_) {
+                outShape[i] = dataShape[i];
+            } else {
+                int j = i - axis_;
+                outShape[i] = j < indDims ? indShape[j] : dataShape[i - indDims + 1];
+            }
+        }
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 2);
+        outputs.assign(1, getOutShape(inputs[0], inputs[1]));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 2);
+        int dataType = inputs[0];
+        int indType = inputs[1];
+        CV_Assert(indType == CV_32S || indType == CV_64S);
+        outputs.assign(requiredOutputs, dataType);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+
+        CV_Assert(ninputs == 2);
+
+        MatShape dataShape = inputs_arr.shape(0);
+        MatShape indShape = inputs_arr.shape(1);
+        int dataType = inputs_arr.type(0);
+        int indType = inputs_arr.type(1);
+        CV_Assert(indType == CV_32S || indType == CV_64S);
+
+        MatShape outShape = getOutShape(dataShape, indShape);
+        int outKind = outputs_arr.kind();
+        int axis_ = normalize_axis(axis, dataShape.dims);
+
+        CV_Assert(outKind == _InputArray::STD_VECTOR_MAT ||
+                  outKind == _InputArray::STD_VECTOR_UMAT);
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            Mat data = inputs_arr.getMat(0);
+            Mat ind = inputs_arr.getMat(1);
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, dataType);
+            runOp(data, ind, outs[0], axis_);
+        } else {
+            // [TODO] more efficient OpenCL implementation
+            Mat data = inputs_arr.getMat(0);
+            Mat ind = inputs_arr.getMat(1);
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, dataType);
+            Mat temp(outShape, dataType);
+            runOp(data, ind, temp, axis_);
+            temp.copyTo(outs[0]);
+        }
+    }
+
+    void runOp(const Mat& data, const Mat& ind, Mat& out, int axis_)
+    {
+        gather(data, ind, out, axis_);
+    }
+};
+
+Ptr<Gather2Layer> Gather2Layer::create(const LayerParams& params)
+{
+    return Ptr<Gather2Layer>(new Gather2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/gemm_layer.cpp b/modules/dnn/src/layers/gemm_layer.cpp
index ac0914c2c2..79a0d01dcb 100644
--- a/modules/dnn/src/layers/gemm_layer.cpp
+++ b/modules/dnn/src/layers/gemm_layer.cpp
@@ -30,9 +30,28 @@ public:
         alpha = params.get<float>("alpha", 1.0f);
         beta = params.get<float>("beta", 1.0f);
 
-        const_B = params.get<bool>("constB", false); // true means blobs[0] is B
-        const_C = params.get<bool>("constC", false); // true means blobs.back() is C
-        have_bias = params.get<bool>("have_bias", false); // NOTE: have_bias being true does not mean bias is constant
+        if (params.has("constB") || params.has("constB") || params.has("have_bias"))
+        {
+            // The params are not part of ONNX, but set by old ONNX parser
+            const_B = params.get<bool>("constB", false); // true means blobs[0] is B
+            const_C = params.get<bool>("constC", false); // true means blobs.back() is C
+            have_bias = params.get<bool>("have_bias", false); // NOTE: have_bias being true does not mean bias is constant
+        }
+        else
+        {
+            // TODO: With the new parser the function should be smart enough to figure out
+            // the operation mode from the number of 'inputs' and number of 'blobs'.
+            // note, however, that 'inputs' may not be set yet in the constructor
+            // Ticket: https://github.com/opencv/opencv/issues/26209
+
+            if (!blobs.empty()) {
+                const_B = const_C = true;
+            } else {
+                const_B = const_C = false;
+            }
+
+            have_bias = blobs.size() > 1 || params.get<bool>("have_bias", false); // NOTE: have_bias being true does not mean bias is constant
+        }
 
         real_ndims_C = params.get<int>("real_ndims_C", -1);
     }
@@ -67,18 +86,24 @@ public:
         int N = trans_b ? mb : nb;
         int K_a = trans_a ? ma : na;
         int K_b = trans_b ? nb : mb;
+
+
         CV_CheckEQ(K_a, K_b, "DNN/Gemm: Invalid dimension of dim K");
 
+        bool have_bias_ = have_bias || inputs.size() == 3;
+
         // Check whether C can be unidirectional broadcast to (M, N). Handle carefully with 1D Mat.
-        if (have_bias) {
+        if (have_bias_) {
             const auto shape_C = const_C ? shape(blobs.back()) : inputs.back();
 
             auto ndims_C = shape_C.size();
             CV_CheckLE(ndims_C, static_cast<size_t>(2), "DNN/Gemm: C can only be 0d (scalar) / 1d / 2d tensor");
 
-            if (real_ndims_C == 1) { // (1,) or (N,)
+            int real_ndims_C_ = real_ndims_C >= 0 ? real_ndims_C : ndims_C;
+
+            if (real_ndims_C_ == 1) { // (1,) or (N,)
                 CV_Check(shape_C[0], shape_C[0] == 1 || shape_C[0] == N, "DNN/Gemm: invalid dimension of C");
-            } else if (real_ndims_C == 2) { // (1, 1) or (1, N) or (M, 1) or (M, N)
+            } else if (real_ndims_C_ == 2) { // (1, 1) or (1, N) or (M, 1) or (M, N)
                 // printf("shape_C=[%d, %d]\n", shape_C[0], shape_C[1]);
                 CV_Check(shape_C[0], (shape_C[0] == 1 && shape_C[1] == 1) ||
                                      (shape_C[0] == 1 && shape_C[1] == N) ||
@@ -104,22 +129,23 @@ public:
     // TODO: replace with cv::broadcast() once 1d mat is supported
     // FIXME: fix if conditions if 1d mat is supported properly
     void broadcastCWtihBeta(int M, int N, const Mat &C) {
+        broadcast_C.clear();
+        broadcast_C.resize(M * N, 0.f);
         if (beta != 0 && !C.empty()) {
-            broadcast_C.clear();
-            broadcast_C.resize(M * N, 0.f);
+            int real_ndims_C_ = real_ndims_C >= 0 ? real_ndims_C : C.dims;
 
             const float *ptr_c = C.ptr<const float>();
             const auto shape_C = shape(C);
-            if ((real_ndims_C == 0) || (real_ndims_C == 1 && shape_C[0] == 1) ||
-                (real_ndims_C == 2 && shape_C[0] == 1 && shape_C[1] == 1)) {
+            if ((real_ndims_C_ == 0) || (real_ndims_C_ == 1 && shape_C[0] == 1) ||
+                (real_ndims_C_ == 2 && shape_C[0] == 1 && shape_C[1] == 1)) {
                 // (), (1,), (1, 1)
                 float c = *ptr_c;
                 int total = M * N;
                 for (int i = 0; i < total; ++i) {
                     broadcast_C[i] = beta * c;
                 }
-            } else if ((real_ndims_C == 1 && shape_C[0] == N) ||
-                       (real_ndims_C == 2 && shape_C[0] == 1 && shape_C[1] == N)) {
+            } else if ((real_ndims_C_ == 1 && shape_C[0] == N) ||
+                       (real_ndims_C_ == 2 && shape_C[0] == 1 && shape_C[1] == N)) {
                 // (N,), (1, N)
                 for (int i = 0; i < M; ++i) {
                     int step = i * N;
@@ -127,7 +153,7 @@ public:
                         broadcast_C[step + j] = beta * ptr_c[j];
                     }
                 }
-            } else if (real_ndims_C == 2 && shape_C[0] == M && shape_C[1] == 1) {
+            } else if (real_ndims_C_ == 2 && shape_C[0] == M && shape_C[1] == 1) {
                 // (M, 1)
                 for (int i = 0; i < M; ++i) {
                     int step = i * N;
@@ -191,11 +217,12 @@ public:
         size_t dims_Y = shape_Y.size();
         int M = shape_Y[dims_Y - 2], N = shape_Y[dims_Y - 1];
         int K = trans_a ? ma : na;
+        bool have_bias_ = have_bias || inputs.size() == 3;
 
         // broadcast C and copy C to output
-        if (have_bias) {
-            if (!const_C) {
-                broadcastCWtihBeta(M, N, inputs.back());
+        if (have_bias_) {
+            if (!const_C || broadcast_C.empty()) {
+                broadcastCWtihBeta(M, N, (inputs.size() >= 3 ? inputs.back() : blobs.back()));
             }
             int step = M * N;
             CV_CheckEQ(broadcast_C.size(), static_cast<size_t>(step), "DNN/Gemm: C is not broadcast properly");
diff --git a/modules/dnn/src/layers/layer_norm.cpp b/modules/dnn/src/layers/layer_norm.cpp
index 487383efdc..20e809e35d 100644
--- a/modules/dnn/src/layers/layer_norm.cpp
+++ b/modules/dnn/src/layers/layer_norm.cpp
@@ -36,12 +36,14 @@ class LayerNormLayerImpl CV_FINAL : public LayerNormLayer
 #endif
 
 public:
+    int axis0;
+
     LayerNormLayerImpl(const LayerParams& params)
     {
         setParamsFrom(params);
 
         // standard attr
-        axis = params.get<int>("axis", -1);
+        axis = axis0 = params.get<int>("axis", -1);
         epsilon = params.get<float>("epsilon", 1e-5);
     }
 
@@ -61,6 +63,9 @@ public:
                                  std::vector<MatShape> &outputs,
                                  std::vector<MatShape> &internals) const CV_OVERRIDE
     {
+        int noutputs = std::max(requiredOutputs > 0 ? requiredOutputs : (int)this->outputs.size(), 1);
+        CV_Assert(noutputs == 1 || noutputs == 3);
+
         // check shapes of weight and bias if existed
         // inputs >= 2 (X and Weight are required, bias is optional)
         int num_inputs = inputs.size() + blobs.size();
@@ -69,14 +74,16 @@ public:
         auto x_shape = inputs[0];
         int x_ndims = static_cast<int>(x_shape.size());
 
+        int axis_ = normalize_axis(axis0, x_shape.dims);
+
         // Weight and bias are either constants or variable
         auto w_shape = blobs.empty() ? inputs[1] : shape(blobs.front());
         // if axis == last_dim, scale and b are both 1d tensor (represented as 2d mat nx1)
         int w_ndims = static_cast<int>(w_shape.size());
-        w_ndims = (axis == x_ndims - 1 && w_ndims == 2) ? w_ndims - 1 : w_ndims;
-        CV_CheckEQ(x_ndims - axis, w_ndims, "LayerNorm: shape of weight does not match with given axis and shape of input");
+        w_ndims = (axis_ == x_ndims - 1 && w_ndims == 2) ? w_ndims - 1 : w_ndims;
+        CV_CheckEQ(x_ndims - axis_, w_ndims, "LayerNorm: shape of weight does not match with given axis and shape of input");
         for (int i = 0; i < w_ndims; ++i)
-            CV_CheckEQ(x_shape[axis+i], w_shape[i], "LayerNorm: weight dimensions does not match with input dimensions");
+            CV_CheckEQ(x_shape[axis_+i], w_shape[i], "LayerNorm: weight dimensions does not match with input dimensions");
         if (num_inputs >= 3)
         {
             auto b_shape = blobs.empty() ? inputs[2] : shape(blobs.back());
@@ -85,7 +92,18 @@ public:
                 CV_CheckEQ(w_shape[i], b_shape[i], "LayerNorm: bias dimensions does not match with weight dimensions");
         }
 
-        outputs.assign(1, inputs[0]);
+        outputs.resize(noutputs, inputs[0]);
+
+        /*
+            even though OpenCV currently does not compute the other outputs
+            of LayerNormalization op, we correctly compute their shapes,
+            according to the specs:
+            https://onnx.ai/onnx/operators/onnx__LayerNormalization.html
+        */
+        for (int i = 1; i < noutputs; i++) {
+            for (int j = axis_; j < x_ndims; j++)
+                outputs[i][j] = 1;
+        }
         return false;
     }
 
@@ -94,7 +112,7 @@ public:
         inputs_arr.getMatVector(inputs);
 
         const auto input_shape = shape(inputs[0]);
-        axis = normalize_axis(axis, static_cast<int>(input_shape.size()));
+        axis = normalize_axis(axis0, static_cast<int>(input_shape.size()));
 
 #ifdef HAVE_OPENCL
         weight_umat.release();
@@ -124,6 +142,8 @@ public:
         const auto &scale = blobs.empty() ? inputs[1] : blobs.front();
         auto &output = outputs[0];
 
+        axis = normalize_axis(axis0, input.dims);
+
         if ((inputs.size() + blobs.size()) >= 3) {
             const auto &bias = blobs.empty() ? inputs[2] : blobs.back();
             fastNorm(input, scale, bias, output, epsilon, static_cast<size_t>(axis));
@@ -150,6 +170,7 @@ public:
         auto &output = outputs[0];
 
         const auto input_shape = shape(input);
+        axis = normalize_axis(axis0, input_shape.dims);
         size_t loops = static_cast<size_t>(total(input_shape, 0, axis)),
                norm_size = static_cast<size_t>(total(input_shape, axis));
         float inv_norm_size = 1.f / norm_size;
diff --git a/modules/dnn/src/layers/layers_common.cpp b/modules/dnn/src/layers/layers_common.cpp
index 3b3a007b06..9a8fa9d0b5 100644
--- a/modules/dnn/src/layers/layers_common.cpp
+++ b/modules/dnn/src/layers/layers_common.cpp
@@ -264,5 +264,147 @@ double getWeightScale(const Mat& weightsMat)
     return (realMax == realMin) ? 1.0 : std::max(-realMin, realMax)/127;
 }
 
+void tensorToIntVec(const Mat& tensor, std::vector<int>& vec)
+{
+    if (tensor.empty()) {
+        vec.clear();
+    } else {
+        int type = tensor.type();
+        CV_Assert(type == CV_32S || type == CV_64S);
+        CV_Assert(tensor.dims <= 1);
+        int size = (int)tensor.total();
+        vec.resize(size);
+        for (int i = 0; i < size; i++) {
+            vec[i] = type == CV_32S ? tensor.at<int>(i) :
+                saturate_cast<int>(tensor.at<int64_t>(i));
+        }
+    }
+}
+
+void tensorToFloatVec(const Mat& tensor, std::vector<float>& vec)
+{
+    if (tensor.empty()) {
+        vec.clear();
+    } else {
+        int type = tensor.type();
+        MatShape shape = tensor.shape();
+        CV_Assert(type == CV_32F || type == CV_16F);
+        CV_Assert(shape.dims <= 1);
+        int size = (int)shape.total();
+        vec.resize(size);
+        for (int i = 0; i < size; i++) {
+            vec[i] = type == CV_32F ? tensor.at<float>(i) :
+                (float)tensor.at<hfloat>(i);
+        }
+    }
+}
+
+void reshapeAndCopyFirst(InputArrayOfArrays inputs,
+                         OutputArrayOfArrays outputs,
+                         const MatShape& shape)
+{
+    int inpKind = inputs.kind(), outKind = outputs.kind();
+    CV_Assert(inpKind == outKind);
+    CV_Assert(inpKind == _InputArray::STD_VECTOR_MAT ||
+              inpKind == _InputArray::STD_VECTOR_UMAT);
+    CV_Assert(inputs.isContinuous(0));
+    int inpType = inputs.type(0);
+    if (inpKind == _InputArray::STD_VECTOR_MAT) {
+        Mat inp = inputs.getMat(0);
+        std::vector<Mat>& outref = outputs.getMatVecRef();
+        outref.resize(1);
+        outref[0].fit(shape, inpType);
+        CV_Assert(outref[0].isContinuous());
+        Mat inp_ = inp.reshape(0, shape);
+        if (inp_.data != outref[0].data)
+            inp_.copyTo(outref[0]);
+    }
+    else {
+        UMat inp = inputs.getUMat(0);
+        std::vector<UMat>& outref = outputs.getUMatVecRef();
+        outref.resize(1);
+        outref[0].fit(shape, inpType);
+        CV_Assert(outref[0].isContinuous());
+        UMat inp_ = inp.reshape(0, shape);
+        inp_.copyTo(outref[0]);
+    }
+}
+
+MatShape tensorToShape(const Mat& shapeTensor)
+{
+    std::vector<int> shapeSpecVec;
+    tensorToIntVec(shapeTensor, shapeSpecVec);
+    return MatShape(shapeSpecVec);
+}
+
+void tensorToScalar(const Mat& tensor, int type, void* value)
+{
+    CV_Assert(tensor.total() == 1);
+    int type0 = tensor.type();
+    int depth = CV_MAT_DEPTH(type), cn = CV_MAT_CN(type);
+    CV_Assert(cn == 1);
+    double v = 0;
+    int64_t iv = 0;
+    bool isflt = type0 == CV_32F || type0 == CV_64F || type0 == CV_16F || type0 == CV_16BF;
+
+    if (type0 == CV_8U)
+        iv = *tensor.ptr<uint8_t>();
+    else if (type0 == CV_8S)
+        iv = *tensor.ptr<uint8_t>();
+    else if (type0 == CV_16U)
+        iv = *tensor.ptr<uint8_t>();
+    else if (type0 == CV_16S)
+        iv = *tensor.ptr<int16_t>();
+    else if (type0 == CV_32U)
+        iv = *tensor.ptr<uint32_t>();
+    else if (type0 == CV_32S)
+        iv = *tensor.ptr<int32_t>();
+    else if (type0 == CV_64S)
+        iv = *tensor.ptr<int64_t>();
+    else if (type0 == CV_32F)
+        v = *tensor.ptr<float>();
+    else if (type0 == CV_64F)
+        v = *tensor.ptr<double>();
+    else if (type0 == CV_16F)
+        v = (float)*tensor.ptr<hfloat>();
+    else if (type0 == CV_16BF)
+        v = (float)*tensor.ptr<bfloat>();
+    else if (type0 == CV_Bool)
+        iv = *tensor.ptr<uint8_t>() != 0;
+    else {
+        CV_Error_(Error::StsNotImplemented, ("type %s is not supported", typeToString(type0).c_str()));
+    }
+
+    if (depth == CV_8U)
+        *reinterpret_cast<uint8_t*>(value) = isflt ? saturate_cast<uint8_t>(v) : saturate_cast<uint8_t>(iv);
+    else if (depth == CV_8S)
+        *reinterpret_cast<int8_t*>(value) = isflt ? saturate_cast<int8_t>(v) : saturate_cast<int8_t>(iv);
+    else if (depth == CV_16U)
+        *reinterpret_cast<uint16_t*>(value) = isflt ? saturate_cast<uint16_t>(v) : saturate_cast<uint16_t>(iv);
+    else if (depth == CV_16S)
+        *reinterpret_cast<int16_t*>(value) = isflt ? saturate_cast<int16_t>(v) : saturate_cast<int16_t>(iv);
+    else if (depth == CV_32U)
+        *reinterpret_cast<uint32_t*>(value) = isflt ? saturate_cast<uint32_t>(v) : saturate_cast<uint32_t>(iv);
+    else if (depth == CV_32S)
+        *reinterpret_cast<int32_t*>(value) = isflt ? saturate_cast<int32_t>(v) : saturate_cast<int32_t>(iv);
+    else if (depth == CV_64U)
+        *reinterpret_cast<uint64_t*>(value) = isflt ? saturate_cast<uint64_t>(v) : saturate_cast<uint64_t>(iv);
+    else if (depth == CV_64S)
+        *reinterpret_cast<int64_t*>(value) = isflt ? saturate_cast<int64_t>(v) : iv;
+    else if (depth == CV_32F)
+        *reinterpret_cast<float*>(value) = isflt ? (float)v : saturate_cast<float>(iv);
+    else if (depth == CV_64F)
+        *reinterpret_cast<double*>(value) = isflt ? v : saturate_cast<double>(iv);
+    else if (depth == CV_16F)
+        *reinterpret_cast<hfloat*>(value) = isflt ? saturate_cast<hfloat>(v) : saturate_cast<hfloat>(iv);
+    else if (depth == CV_16BF)
+        *reinterpret_cast<bfloat*>(value) = isflt ? saturate_cast<bfloat>(v) : saturate_cast<bfloat>(iv);
+    else if (depth == CV_Bool)
+        *reinterpret_cast<uint8_t*>(value) = isflt ? (uint8_t)(v != 0) : (uint8_t)(iv != 0);
+    else {
+        CV_Error_(Error::StsNotImplemented, ("type %s is not supported", typeToString(depth).c_str()));
+    }
+}
+
 }
 }
diff --git a/modules/dnn/src/layers/layers_common.hpp b/modules/dnn/src/layers/layers_common.hpp
index 4510f6b106..39938035c2 100644
--- a/modules/dnn/src/layers/layers_common.hpp
+++ b/modules/dnn/src/layers/layers_common.hpp
@@ -76,6 +76,38 @@ void getConvPoolPaddings(const std::vector<int>& inp, const std::vector<size_t>&
 
 // Used in quantized model. It will return the (Max_element - Min_element)/127.
 double getWeightScale(const Mat& weightsMat);
+
+// Several ONNX operations take list of integer's or float's,
+// e.g. to specify list of axes (Squeeze, Unsqueeze, Transpose, Reduce*, ...),
+// coordinates, repetitions etc. (Slice, Tile, ...), scale factors (Resize, ...).
+// Here are helper functions to extract this data
+void tensorToIntVec(const Mat& tensor, std::vector<int>& vec);
+void tensorToFloatVec(const Mat& tensor, std::vector<float>& vec);
+void tensorToScalar(const Mat& tensor, int type, void* value);
+template<typename _Tp> _Tp tensorToScalar(const Mat& tensor)
+{
+    _Tp value = _Tp(0);
+    tensorToScalar(tensor, DataType<_Tp>::type, &value);
+    return value;
+}
+
+// tensor to mat shape
+MatShape tensorToShape(const Mat& shapeTensor);
+
+// inputs and outputs are both vector<Mat>'s or both are vector<UMat>'s.
+// the function does the following:
+//
+// 1. resizes output vector to 1-element vector
+// 2. outputs[0].fit(shape, inputs[0].type())
+// 3. temp = inputs[0].reshape(shape);
+// 4. temp.copyTo(outputs[0]) // detect in-place case and do nothing in this case
+//
+// the function helps to implement DL operations
+// 'Reshape', 'Flatten', 'Squeeze', 'Unsqueeze', 'Identity'.
+void reshapeAndCopyFirst(InputArrayOfArrays inputs,
+                         OutputArrayOfArrays outputs,
+                         const MatShape& shape);
+
 }
 }
 
diff --git a/modules/dnn/src/layers/max_unpooling_layer.cpp b/modules/dnn/src/layers/max_unpooling_layer.cpp
index c8e489eb52..66e3b13e52 100644
--- a/modules/dnn/src/layers/max_unpooling_layer.cpp
+++ b/modules/dnn/src/layers/max_unpooling_layer.cpp
@@ -221,7 +221,7 @@ public:
         std::vector<MatShape> outShapes, internals;
         for (int i = 0; i < nodes.size(); ++i) {
             std::vector<size_t> shape = nodes[i].dynamicCast<InfEngineNgraphNode>()->node.get_shape();
-            inpShapes[i] = std::vector<int>(shape.begin(), shape.end());
+            inpShapes[i] = MatShape(shape.begin(), shape.end());
         }
         getMemoryShapes(inpShapes, 1, outShapes, internals);
 
diff --git a/modules/dnn/src/layers/normalize_bbox_layer.cpp b/modules/dnn/src/layers/normalize_bbox_layer.cpp
index 1f34e32b3c..abb03b0bf6 100644
--- a/modules/dnn/src/layers/normalize_bbox_layer.cpp
+++ b/modules/dnn/src/layers/normalize_bbox_layer.cpp
@@ -127,8 +127,8 @@ public:
         startAxis = normalize_axis(startAxis, inp0.dims);
         endAxis = normalize_axis(endAxis, inp0.dims);
 
-        size_t num = total(shape(inp0.size), 0, startAxis);
-        size_t numPlanes = total(shape(inp0.size), startAxis, endAxis + 1);
+        size_t num = total(inp0.shape(), 0, startAxis);
+        size_t numPlanes = total(inp0.shape(), startAxis, endAxis + 1);
         size_t planeSize = inp0.total() / (num * numPlanes);
         MatShape s = shape(1, inputs[0].total());
         UMat inp = inputs[0].reshape(1, s.size(), &s[0]).reshape(1, num);
diff --git a/modules/dnn/src/layers/not_layer.cpp b/modules/dnn/src/layers/not_layer.cpp
index 7f589a9ba6..4ff354805a 100644
--- a/modules/dnn/src/layers/not_layer.cpp
+++ b/modules/dnn/src/layers/not_layer.cpp
@@ -56,11 +56,14 @@ public:
         CV_CheckTypeEQ(inputs[0].type(), CV_Bool, "");
         CV_CheckTypeEQ(outputs[0].type(), CV_Bool, "");
 
+        CV_Assert(inputs[0].isContinuous());
+        CV_Assert(outputs[0].isContinuous());
+
         bool* input = inputs[0].ptr<bool>();
         bool* output = outputs[0].ptr<bool>();
-        int size = inputs[0].total();
+        size_t size = inputs[0].total();
 
-        for (int i = 0; i < size; ++i)
+        for (size_t i = 0; i < size; ++i)
             output[i] = !input[i];
     }
 
diff --git a/modules/dnn/src/layers/pad2_layer.cpp b/modules/dnn/src/layers/pad2_layer.cpp
new file mode 100644
index 0000000000..1c0e04c16e
--- /dev/null
+++ b/modules/dnn/src/layers/pad2_layer.cpp
@@ -0,0 +1,377 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+static constexpr int PAD_MAX_DIMS = 5;
+
+/*
+    Padding layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Pad.html
+
+    Opset's 1 to 23 are covered.
+*/
+
+// out must be pre-allocated
+// pads[] should contains as many elements as inp.dims*2
+static void pad(const Mat& inp, const std::vector<int>& pads_, int mode_, const Mat& value, Mat& out)
+{
+    int inptype = inp.type();
+    MatShape inpshape_ = inp.shape();
+    MatShape outshape_ = out.shape();
+    double buf = 0;
+    Mat vbuf(1, 1, inptype, &buf);
+
+    int inpshape[PAD_MAX_DIMS];
+    int outshape[PAD_MAX_DIMS];
+    int pads[PAD_MAX_DIMS*2];
+    int64_t inpstep[PAD_MAX_DIMS];
+    int64_t outstep[PAD_MAX_DIMS];
+    std::vector<int> tab[PAD_MAX_DIMS];
+
+    int ndims = inp.dims, delta = PAD_MAX_DIMS - ndims;
+    int64_t esz = inp.elemSize();
+
+    CV_Assert(inp.isContinuous());
+    CV_Assert(out.isContinuous());
+    CV_Assert(inp.type() == out.type());
+    CV_Assert(esz == 1 || esz == 2 || esz == 4 || esz == 8);
+    CV_Assert(inp.dims == out.dims);
+    CV_Assert(inp.dims <= PAD_MAX_DIMS);
+
+    if (!value.empty()) {
+        CV_Assert(value.dims <= 2 && value.total() == 1 && value.channels() == 1);
+        tensorToScalar(value, inptype, &buf);
+    }
+
+    for (int i = 0; i < PAD_MAX_DIMS; i++) {
+        inpshape[i] = outshape[i] = 1;
+        pads[i] = pads[i + PAD_MAX_DIMS] = 0;
+    }
+
+    for (int i = 0; i < ndims; i++) {
+        inpshape[i+delta] = inpshape_[i];
+        outshape[i+delta] = outshape_[i];
+        pads[i+delta] = pads_[i];
+        pads[i+delta + PAD_MAX_DIMS] = pads_[i + ndims];
+
+        // initialize lookup table along the corresponding axis
+        int inpsz_i = inpshape_[i];
+        int outsz_i = outshape_[i];
+        tab[i+delta].resize(outsz_i);
+        int* tab_i = tab[i+delta].data();
+        int before = pads_[i];
+        for (int j = 0; j < outsz_i; j++)
+            tab_i[j] = borderInterpolate(j - before, inpsz_i, mode_);
+    }
+
+    for (int i = PAD_MAX_DIMS-1; i >= 0; i--) {
+        if (i == PAD_MAX_DIMS-1)
+            inpstep[i] = outstep[i] = 1;
+        else {
+            inpstep[i] = inpstep[i+1]*inpshape[i+1];
+            outstep[i] = outstep[i+1]*outshape[i+1];
+        }
+    }
+
+    int nplanes = outshape[0]*outshape[1]*outshape[2];
+
+    CV_Assert(!tab[4].empty());
+
+    #undef IMPL_PAD
+    #define IMPL_PAD(T) \
+    parallel_for_(Range(0, nplanes), [&](const Range& r) { \
+        int mode = mode_; \
+        int sz1 = outshape[1], sz2 = outshape[2], sz3 = outshape[3], sz4 = outshape[4]; \
+        const int* tab0 = tab[0].data(); \
+        const int* tab1 = tab[1].data(); \
+        const int* tab2 = tab[2].data(); \
+        const int* tab3 = tab[3].data(); \
+        const int* tab4 = tab[4].data(); \
+        const T* inpdata0 = (const T*)inp.data; \
+        T val0 = *reinterpret_cast<T*>(vbuf.data); \
+        T* outdata0 = (T*)out.data; \
+        int p0 = pads[PAD_MAX_DIMS-1], p1 = pads[PAD_MAX_DIMS*2-1]; \
+        int p0_ = std::max(p0, 0), p1_ = std::max(p1, 0); \
+        for (int plane = r.start; plane < r.end; plane++) { \
+            int plane_ = plane; \
+            int i2 = plane_ % sz2; \
+            plane_ /= sz2; \
+            int i1 = plane_ % sz1; \
+            int i0 = plane_ / sz1; \
+            int ii0 = tab0 ? tab0[i0] : i0; \
+            int ii1 = tab1 ? tab1[i1] : i1; \
+            int ii2 = tab2 ? tab2[i2] : i2; \
+            for (int i3 = 0; i3 < sz3; i3++) { \
+                int ii3 = tab3 ? tab3[i3] : i3; \
+                T* outdata = outdata0 + i0*outstep[0] + i1*outstep[1] + i2*outstep[2] + i3*outstep[3]; \
+                int i4 = 0; \
+                if ((ii0|ii1|ii2|ii3) < 0) { \
+                    for (; i4 < sz4; i4++) \
+                        outdata[i4] = val0; \
+                    continue; \
+                } \
+                const T* inpdata = inpdata0 + ii0*inpstep[0] + ii1*inpstep[1] + ii2*inpstep[2] + ii3*inpstep[3]; \
+                if (mode == BORDER_CONSTANT) {\
+                    for (; i4 < p0_; i4++) \
+                        outdata[i4] = val0; \
+                } else { \
+                    for (; i4 < p0_; i4++) \
+                        outdata[i4] = inpdata[tab4[i4]]; \
+                } \
+                for (; i4 < sz4 - p1_; i4++) \
+                    outdata[i4] = inpdata[i4 - p0]; \
+                if (mode == BORDER_CONSTANT) { \
+                    for (; i4 < sz4; i4++) \
+                        outdata[i4] = val0; \
+                } else { \
+                    for (; i4 < sz4; i4++) \
+                        outdata[i4] = inpdata[tab4[i4]]; \
+                } \
+            } \
+        } \
+    })
+
+    if (esz == 1) {
+        IMPL_PAD(uint8_t);
+    } else if (esz == 2) {
+        IMPL_PAD(uint16_t);
+    } else if (esz == 4) {
+        IMPL_PAD(uint32_t);
+    } else {
+        CV_Assert(esz == 8);
+        IMPL_PAD(uint64_t);
+    }
+}
+
+class Pad2LayerImpl CV_FINAL : public Pad2Layer
+{
+public:
+    std::vector<int> pads0;
+    float value0 = 0.f;
+    int mode = BORDER_CONSTANT;
+
+    Pad2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        std::vector<int> pads0_ = params.getVector<int>("paddings");
+        // [TODO] remove this transposition after the original transposition is removed from onnx importer 2
+        if (!pads0_.empty()) {
+            int i, ndims = (int)(pads0_.size()/2);
+            pads0.resize(ndims*2);
+            for (i = 0; i < ndims; i++) {
+                pads0[i] = pads0_[i*2];
+                pads0[i + ndims] = pads0_[i*2+1];
+            }
+        }
+        std::string strmode = params.get<std::string>("mode", "constant");
+        if (strmode == "constant")
+            mode = BORDER_CONSTANT;
+        else if (strmode == "reflect")
+            mode = BORDER_REFLECT101;
+        else if (strmode == "edge")
+            mode = BORDER_REPLICATE;
+        else if (strmode == "wrap")
+            mode = BORDER_WRAP;
+        else {
+            CV_Error_(Error::StsNotImplemented, ("mode '%s' is not supported", strmode.c_str()));
+        }
+        value0 = params.get<float>("value", 0.f);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        CV_Assert(netimpl_);
+        size_t ninputs = this->inputs.size();
+        CV_Assert(1 <= ninputs && ninputs <= 4);
+        return (ninputs >= 2 && !netimpl_->isConstArg(this->inputs[1])) ||
+               (ninputs >= 4 && !netimpl_->isConstArg(this->inputs[3]));
+    }
+
+    void getPads(int ndims, const Mat& pads_, const Mat& axes_, std::vector<int>& pads) const
+    {
+        int atype = axes_.type(), ptype = pads_.type();
+        CV_Assert(ndims <= PAD_MAX_DIMS);
+
+        const int32_t* adata_i32 = nullptr;
+        const int64_t* adata_i64 = nullptr;
+        const int32_t* pdata_i32 = nullptr;
+        const int64_t* pdata_i64 = nullptr;
+
+        bool axismask[PAD_MAX_DIMS];
+        int naxes = !axes_.empty() ? (int)axes_.total() : ndims;
+
+        CV_Assert(pads_.dims == 1);
+        CV_Assert(ptype == CV_32S || ptype == CV_64S);
+
+        if (ptype == CV_32S)
+            pdata_i32 = reinterpret_cast<const int32_t*>(pads_.data);
+        else
+            pdata_i64 = reinterpret_cast<const int64_t*>(pads_.data);
+
+        if (!axes_.empty()) {
+            CV_Assert(axes_.dims == 1);
+            CV_Assert(atype == CV_32S || atype == CV_64S);
+            CV_Assert(pads_.total() == axes_.total()*2);
+            CV_Assert(axes_.total() <= (size_t)ndims);
+
+            if (atype == CV_32S)
+                adata_i32 = reinterpret_cast<const int32_t*>(axes_.data);
+            else
+                adata_i64 = reinterpret_cast<const int64_t*>(axes_.data);
+        } else {
+            CV_Assert(pads_.total() == (size_t)ndims*2);
+        }
+
+        pads.resize(ndims*2);
+
+        for (int i = 0; i < ndims; i++) {
+            pads[i] = pads[i+ndims] = 0;
+            axismask[i] = false;
+        }
+
+        for (int i = 0; i < naxes; i++) {
+            int a = adata_i32 ? (int)adata_i32[i] : adata_i64 ? (int)adata_i64[i] : i;
+            a = normalize_axis(a, ndims);
+            if (axismask[a]) {
+                CV_Error_(Error::StsBadArg, ("duplicate axis %d in Pad", a));
+            }
+            axismask[a] = true;
+            int p0 = pdata_i32 ? (int)pdata_i32[i] : pdata_i64 ? (int)pdata_i64[i] : 0;
+            int p1 = pdata_i32 ? (int)pdata_i32[i+naxes] : pdata_i64 ? (int)pdata_i64[i+naxes] : 0;
+            pads[a] = p0;
+            pads[a+ndims] = p1;
+            // p0, p1 can be positive, zero or even negative, according to ONNX specification.
+            // so we don't put any checks here.
+        }
+    }
+
+    MatShape getOutShape(const MatShape& inpshape, const std::vector<int>& pads) const
+    {
+        MatShape outshape = inpshape;
+        int ndims = inpshape.dims;
+        for (int i = 0; i < ndims; i++) {
+            outshape[i] += pads[i] + pads[i+ndims];
+            CV_Assert(outshape[i] >= 0);
+        }
+        return outshape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape>& inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(!dynamicOutputShapes());
+
+        size_t ninputs = inputs.size();
+        CV_Assert(1 <= ninputs && ninputs <= 4);
+        Net::Impl* netimpl_ = getNetImpl(this);
+
+        std::vector<int> padsbuf;
+        const std::vector<int>* pads = &pads0;
+
+        if (ninputs >= 2) {
+            int ndims = inputs[0].dims;
+            Mat padsTensor = netimpl_->argTensor(this->inputs[1]);
+            Mat axesTensor;
+            if (ninputs >= 4)
+                axesTensor = netimpl_->argTensor(this->inputs[3]);
+            getPads(ndims, padsTensor, axesTensor, padsbuf);
+            pads = &padsbuf;
+        }
+
+        outputs.assign(1, getOutShape(inputs[0], *pads));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(1 <= ninputs && ninputs <= 4);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(1 <= ninputs && ninputs <= 4);
+
+        Mat inp = inputs_arr.getMat(0);
+        Mat value(1, 1, CV_32F, &value0);
+        int inptype = inp.type();
+        std::vector<int> padsbuf;
+        const std::vector<int>* pads = &pads0;
+
+        if (ninputs >= 2) {
+            int ndims = inp.dims;
+            Mat padsTensor = inputs_arr.getMat(1);
+            Mat axesTensor;
+            if (ninputs >= 4)
+                axesTensor = inputs_arr.getMat(3);
+            getPads(ndims, padsTensor, axesTensor, padsbuf);
+            pads = &padsbuf;
+            if (ninputs >= 3)
+                value = inputs_arr.getMat(2);
+        }
+
+        MatShape inpshape = inp.shape();
+        MatShape outshape = getOutShape(inpshape, *pads);
+
+        auto kind = outputs_arr.kind();
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            pad(inp, *pads, mode, value, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            Mat temp(outshape, inptype);
+            pad(inp, *pads, mode, value, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<Pad2Layer> Pad2Layer::create(const LayerParams& params)
+{
+    return Ptr<Pad2Layer>(new Pad2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/padding_layer.cpp b/modules/dnn/src/layers/padding_layer.cpp
index ac574ea0e7..5c3405f11f 100644
--- a/modules/dnn/src/layers/padding_layer.cpp
+++ b/modules/dnn/src/layers/padding_layer.cpp
@@ -58,13 +58,7 @@ public:
     {
         CV_Assert(inputs.size() == 1);
         const MatShape& inpShape = inputs[0];
-        if (inpShape.empty()){
-            CV_Assert(paddings.size() == 1);
-            outputs.resize(1, MatShape(1, paddings[0].first + paddings[0].second + 1));
-            return false;
-        }
         CV_Assert(inpShape.size() >= paddings.size());
-
         CV_Assert(inputDims == -1 || inpShape.size() == inputDims || inpShape.size() > paddings.size());
 
         outputs.resize(1, inpShape);
diff --git a/modules/dnn/src/layers/pooling_layer.cpp b/modules/dnn/src/layers/pooling_layer.cpp
index 45e8a1026b..1f3932dc7b 100644
--- a/modules/dnn/src/layers/pooling_layer.cpp
+++ b/modules/dnn/src/layers/pooling_layer.cpp
@@ -1250,7 +1250,7 @@ public:
         int numOutputs = requiredOutputs ? requiredOutputs : (type == MAX ? 2 : 1);
         CV_Assert(numOutputs == 1 || (numOutputs == 2 && type == MAX));
 
-        outputs.assign(numOutputs, outShape);
+        outputs.assign(numOutputs, MatShape(outShape));
 
         return false;
     }
diff --git a/modules/dnn/src/layers/prior_box_layer.cpp b/modules/dnn/src/layers/prior_box_layer.cpp
index bb3aa99cd3..1c5dd59ef3 100644
--- a/modules/dnn/src/layers/prior_box_layer.cpp
+++ b/modules/dnn/src/layers/prior_box_layer.cpp
@@ -580,10 +580,12 @@ public:
         auto image_shape = image_wrapper->getShape();
 
         PriorBoxConfiguration config;
-        config.feature_map_width = feature_map_shape.rbegin()[0];
-        config.feature_map_height = feature_map_shape.rbegin()[1];
-        config.image_width = image_shape.rbegin()[0];
-        config.image_height = image_shape.rbegin()[1];
+        int fm_dims = feature_map_shape.dims;
+        int im_dims = image_shape.dims;
+        config.feature_map_width = feature_map_shape.p[fm_dims-1];
+        config.feature_map_height = feature_map_shape.p[fm_dims-2];
+        config.image_width = image_shape[im_dims-1];
+        config.image_height = image_shape[im_dims-2];
 
         config.num_priors = _numPriors;
         config.box_widths = _boxWidths;
diff --git a/modules/dnn/src/layers/proposal_layer.cpp b/modules/dnn/src/layers/proposal_layer.cpp
index 66559ab9ff..5b655e796e 100644
--- a/modules/dnn/src/layers/proposal_layer.cpp
+++ b/modules/dnn/src/layers/proposal_layer.cpp
@@ -162,16 +162,16 @@ public:
         // Scores permute layer.
         Mat scores = getObjectScores(inputs[0]);
         layerInputs.assign(1, scores);
-        layerOutputs.assign(1, Mat(shape(scores.size[0], scores.size[2],
-                                         scores.size[3], scores.size[1]), CV_32FC1));
+        layerOutputs.assign(1, Mat({scores.size[0], scores.size[2],
+                                    scores.size[3], scores.size[1]}, CV_32FC1));
         scoresPermute->finalize(layerInputs, layerOutputs);
 
         // BBox predictions permute layer.
         const Mat& bboxDeltas = inputs[1];
         CV_Assert(bboxDeltas.dims == 4);
         layerInputs.assign(1, bboxDeltas);
-        layerOutputs.assign(1, Mat(shape(bboxDeltas.size[0], bboxDeltas.size[2],
-                                         bboxDeltas.size[3], bboxDeltas.size[1]), CV_32FC1));
+        layerOutputs.assign(1, Mat({bboxDeltas.size[0], bboxDeltas.size[2],
+                                    bboxDeltas.size[3], bboxDeltas.size[1]}, CV_32FC1));
         deltasPermute->finalize(layerInputs, layerOutputs);
     }
 
@@ -287,9 +287,22 @@ public:
         Mat& detections = internals[3];
 
         CV_Assert(imInfo.total() >= 2);
+        int imInfo0, imInfo1;
+        if (imInfo.type() == CV_32F) {
+            imInfo0 = cvRound(imInfo.at<float>(0));
+            imInfo1 = cvRound(imInfo.at<float>(1));
+        } else if (imInfo.type() == CV_32S) {
+            imInfo0 = imInfo.at<int>(0);
+            imInfo1 = imInfo.at<int>(1);
+        } else if (imInfo.type() == CV_64S) {
+            imInfo0 = (int)imInfo.at<int64_t>(0);
+            imInfo1 = (int)imInfo.at<int64_t>(1);
+        } else {
+            CV_Error(Error::StsBadArg, "unsupported type of input[2]: must be 32f, 32s or 64s");
+        }
         // We've chosen the smallest data type because we need just a shape from it.
         // We don't allocate memory but just need the shape is correct.
-        Mat fakeImageBlob(shape(1, 1, imInfo.at<float>(0), imInfo.at<float>(1)), CV_8UC1, NULL);
+        Mat fakeImageBlob({1, 1, imInfo0, imInfo1}, CV_8UC1, NULL);
 
         // Generate prior boxes.
         std::vector<Mat> layerInputs(2), layerOutputs(1, priorBoxes);
diff --git a/modules/dnn/src/layers/quantlizelinear_layer.cpp b/modules/dnn/src/layers/quantlizelinear_layer.cpp
new file mode 100644
index 0000000000..429da2fdf3
--- /dev/null
+++ b/modules/dnn/src/layers/quantlizelinear_layer.cpp
@@ -0,0 +1,336 @@
+
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    QuantizeLinear layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html
+
+    Opset's 10 to 23 are covered.
+*/
+
+template <typename _InpTp, typename _ScaleTp, typename _OutTp>
+static void quantizeLinear(const _InpTp* inp_, const _ScaleTp* scale_,
+                           const _OutTp* zp_, _OutTp* out_,
+                           int64_t nslices, int sz_a_,
+                           int64_t slice_size_, int block_size_)
+{
+    int bsz_ = std::max(block_size_, 1);
+    int nblocks_per_axis = (sz_a_ + bsz_ - 1) / bsz_;
+    int64_t nmacro_blocks = nslices * nblocks_per_axis;
+    CV_Assert(nmacro_blocks <= (int64_t)INT_MAX);
+
+    parallel_for_(Range(0, (int)nmacro_blocks), [&](const Range& r) {
+        int sz_a = sz_a_;
+        int64_t slice_size = slice_size_;
+        int block_size = block_size_;
+        int delta = 0;
+        int64_t scale_step = block_size > 0 ? slice_size : 1;
+        int64_t zp_step = zp_ ? scale_step : 0;
+
+        for (int i = r.start; i < r.end; i += delta) {
+            int slice_idx = i / nblocks_per_axis;
+            int block_idx = i - slice_idx * nblocks_per_axis;
+            int64_t block_ofs, scale_ofs;
+            if (block_size > 0) {
+                delta = std::min(nblocks_per_axis - block_idx, r.end - i);
+                block_ofs = (slice_idx*sz_a + block_idx*block_size)*slice_size;
+                scale_ofs = (slice_idx*nblocks_per_axis + block_idx)*slice_size;
+            } else {
+                delta = std::min(sz_a - block_idx, r.end - i);
+                block_ofs = (slice_idx*sz_a + block_idx)*slice_size;
+                scale_ofs = block_idx;
+            }
+            const _InpTp* inp = inp_ + block_ofs;
+            const _OutTp* zp = zp_ ? zp_ + scale_ofs : nullptr;
+            const _ScaleTp* sc = scale_ + scale_ofs;
+            _OutTp* out = out_ + block_ofs;
+
+            // [TODO] vectorize using intrinsics
+            if (slice_size > 1) {
+                for (int k = 0; k < delta; k++, inp += slice_size, out += slice_size,
+                                                sc += scale_step, zp += zp_step) {
+                    float scval = 1.f/(float)(*sc);
+                    _OutTp zpval = zp ? *zp : (_InpTp)0;
+
+                    for (int64_t j = 0; j < slice_size; j++)
+                        out[j] = saturate_cast<_OutTp>(inp[j]*scval + zpval);
+                }
+            } else if (block_size > 0 ) {
+                int bsz = block_size;
+                for (int k = 0; k < delta; k++, inp += bsz, out += bsz) {
+                    bsz = std::min(bsz, sz_a - (block_idx + k)*block_size);
+                    float scval = 1.f/(float)sc[k];
+                    _OutTp zpval = zp ? zp[k] : (_InpTp)0;
+
+                    for (int j = 0; j < bsz; j++)
+                        out[j] = saturate_cast<_OutTp>(inp[j]*scval + zpval);
+                }
+                sc += delta;
+                zp += zp ? delta : 0;
+            } else {
+                // here we assume that scale's have been inversed in advance in the parent function
+                if (zp) {
+                    for (int j = 0; j < delta; j++) {
+                        float scval = (float)sc[j];
+                        _OutTp zpval = zp[j];
+                        out[j] = saturate_cast<_OutTp>(inp[j]*scval + zpval);
+                    }
+                } else {
+                    for (int j = 0; j < delta; j++) {
+                        float scval = (float)sc[j];
+                        out[j] = saturate_cast<_OutTp>(inp[j]*scval);
+                    }
+                }
+                inp += delta;
+                out += delta;
+            }
+        }
+    });
+}
+
+// Dequantize INT8/UINT8 to FP32/FP16; out must be preallocated
+static void quantizeLinear(const Mat& inp, const Mat& scale_, const Mat& zp,
+                           int axis, int block_size, Mat& out)
+{
+    Mat scale = scale_;
+    CV_Assert(inp.isContinuous());
+    CV_Assert(scale.isContinuous());
+    CV_Assert(out.isContinuous());
+
+    int inptype = inp.type();
+    int outtype = out.type();
+    int sctype = scale.type();
+    int zptype = zp.type();
+    MatShape inpshape = inp.shape();
+    MatShape scshape = scale.shape();
+    MatShape zpshape = zp.shape();
+    int i, ndims = inpshape.dims;
+    int64_t nslices = 1, slice_size = 1;
+
+    CV_Assert(inptype == CV_32F || inptype == CV_16F);
+    CV_Assert(sctype == CV_32F || sctype == CV_16F);
+    CV_Assert(outtype == CV_8U || outtype == CV_8S);
+
+    if (!zp.empty()) {
+        CV_Assert(zp.isContinuous());
+        CV_Assert(zptype == outtype);
+        CV_Assert(zpshape == scshape);
+    }
+
+    axis = normalize_axis(axis, ndims);
+    for (i = 0; i < axis; i++)
+        nslices *= inpshape[i];
+    for (i = axis+1; i < ndims; i++)
+        slice_size *= inpshape[i];
+    int sz_a = inpshape[axis];
+
+    if (block_size == 0) {
+        size_t sc_total = scshape.total();
+        CV_Assert(scale.dims <= 1);
+        CV_Assert(sc_total == 1 || sc_total == (size_t)sz_a);
+
+        // unroll the innermost loop if the scale's/zp's are the same
+        if (sc_total == 1) {
+            slice_size *= sz_a;
+            sz_a = 1;
+        }
+
+        // avoid repeated inversion and FP16 => FP32 conversion inside the innermost loop
+        if (slice_size == 1) {
+            Mat temp(scale.size(), CV_32F);
+            const float* scdata_32f = reinterpret_cast<const float*>(scale.data);
+            const hfloat* scdata_16f = reinterpret_cast<const hfloat*>(scale.data);
+            float* tempdata = temp.ptr<float>();
+
+            for (size_t i = 0; i < sc_total; i++)
+                tempdata[i] = 1.f/(sctype == CV_32F ? scdata_32f[i] : (float)scdata_16f[i]);
+            scale = temp;
+            sctype = CV_32F;
+        }
+    } else {
+        CV_Assert(block_size > 0);
+        CV_Assert(scale.dims == ndims);
+        for (int i = 0; i < ndims; i++) {
+            int inp_i = inpshape[i];
+            int sc_i = scshape[i];
+            if (i == axis) {
+                CV_Assert((inp_i + block_size - 1)/block_size == sc_i);
+            } else {
+                CV_Assert(sc_i == inp_i);
+            }
+        }
+    }
+
+    if (outtype == CV_8U && sctype == CV_32F && inptype == CV_32F)
+        quantizeLinear(reinterpret_cast<const float*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<uint8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8U && sctype == CV_16F && inptype == CV_32F)
+        quantizeLinear(reinterpret_cast<const float*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<uint8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8U && sctype == CV_32F && inptype == CV_16F)
+        quantizeLinear(reinterpret_cast<const hfloat*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<uint8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8U && sctype == CV_16F && inptype == CV_16F)
+        quantizeLinear(reinterpret_cast<const hfloat*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const uint8_t*>(zp.data),
+                         reinterpret_cast<uint8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8S && sctype == CV_32F && inptype == CV_32F)
+        quantizeLinear(reinterpret_cast<const float*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<int8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8S && sctype == CV_16F && inptype == CV_32F)
+        quantizeLinear(reinterpret_cast<const float*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<int8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8S && sctype == CV_32F && inptype == CV_16F)
+        quantizeLinear(reinterpret_cast<const hfloat*>(inp.data),
+                         reinterpret_cast<const float*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<int8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else if (outtype == CV_8S && sctype == CV_16F && inptype == CV_16F)
+        quantizeLinear(reinterpret_cast<const hfloat*>(inp.data),
+                         reinterpret_cast<const hfloat*>(scale.data),
+                         reinterpret_cast<const int8_t*>(zp.data),
+                         reinterpret_cast<int8_t*>(out.data),
+                         nslices, sz_a, slice_size, block_size);
+    else {
+        CV_Error_(Error::StsNotImplemented,
+                  ("the following combination of types is not supported in "
+                   "QuantizeLinear: inp=%s, scale=%s, out=%s",
+                   typeToString(inptype).c_str(),
+                   typeToString(sctype).c_str(),
+                   typeToString(outtype).c_str()));
+    }
+}
+
+class QuantizeLinearLayerImpl CV_FINAL : public QuantizeLinearLayer
+{
+public:
+    QuantizeLinearLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+
+        axis = params.get<int>("axis", 1);
+        block_size = params.get<int>("block_size", 0);
+        saturate = params.get<bool>("saturate", true);
+        output_dtype = params.get<int>("output_dtype", -1);
+        CV_Assert(block_size >= 0);
+        CV_Assert(saturate);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV || backendId == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int requiredOutputs,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+        CV_Assert(requiredOutputs == 1);
+        outputs.assign(1, inputs[0]);
+        return true;
+    }
+
+    int getOutType(int zptype) const
+    {
+        return output_dtype >= 0 ? output_dtype : zptype;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+        int zptype = CV_8U;
+        if (ninputs == 3) {
+            zptype = inputs[2];
+        }
+        outputs.assign(1, getOutType(zptype));
+    }
+
+    virtual void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        int ninputs = inputs_arr.size(-1).area();
+        CV_Assert(2 <= ninputs && ninputs <= 3);
+
+        Mat inp = inputs_arr.getMat(0);
+        Mat scale = inputs_arr.getMat(1);
+        Mat zeropoint;
+        int zptype = CV_8U, outtype;
+        MatShape inpshape = inp.shape();
+
+        if (ninputs >= 3) {
+            zeropoint = inputs_arr.getMat(2);
+            zptype = zeropoint.type();
+        }
+
+        outtype = getOutType(zptype);
+        auto kind = outputs_arr.kind();
+
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(inpshape, outtype);
+            quantizeLinear(inp, scale, zeropoint, axis, block_size, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(inpshape, outtype);
+            Mat temp(inpshape, outtype);
+            quantizeLinear(inp, scale, zeropoint, axis, block_size, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<QuantizeLinearLayer> QuantizeLinearLayer::create(const LayerParams& params)
+{
+    return Ptr<QuantizeLinearLayer>(new QuantizeLinearLayerImpl(params));
+}
+
+}}
diff --git a/modules/dnn/src/layers/range_layer.cpp b/modules/dnn/src/layers/range_layer.cpp
new file mode 100644
index 0000000000..01849a1611
--- /dev/null
+++ b/modules/dnn/src/layers/range_layer.cpp
@@ -0,0 +1,224 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Range layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Range.html
+
+    Opset's 11 to 11 are covered.
+*/
+
+static int rangeSize(double start, double limit, double delta)
+{
+    return std::max((int)ceil((limit - start)/delta), 0);
+}
+
+static int rangeSize(int64_t start, int64_t limit, int64_t delta)
+{
+    return delta > 0 ?
+        std::max((int)((limit - start + delta - 1)/delta), 0) :
+        std::max((int)((start - limit - delta - 1)/-delta), 0);
+}
+
+// out must be pre-allocated
+template <typename _Tp>
+static void makeRange(_Tp start, _Tp limit, _Tp delta, Mat& out)
+{
+    int nout = rangeSize(start, limit, delta);
+    CV_Assert(out.dims == 1);
+    CV_Assert(out.total() == (size_t)nout);
+    uchar* outdata_ = out.data;
+
+    int type = out.type();
+
+    #undef IMPL_RANGE
+    #define IMPL_RANGE(T) \
+        T* outdata = (T*)outdata_; \
+        for (int i = 0; i < nout; i++) \
+            outdata[i] = saturate_cast<T>(start + i*delta)
+
+    if (type == CV_32F) {
+        IMPL_RANGE(float);
+    } else if (type == CV_64F) {
+        IMPL_RANGE(double);
+    } else if (type == CV_32S) {
+        IMPL_RANGE(int32_t);
+    } else if (type == CV_64S) {
+        IMPL_RANGE(int64_t);
+    } else {
+        CV_Error_(Error::StsNotImplemented, ("invalid/unsupported tensor type: %s", typeToString(out.type()).c_str()));
+    }
+}
+
+class RangeLayerImpl CV_FINAL : public RangeLayer
+{
+public:
+    RangeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        CV_Assert(netimpl_);
+        CV_Assert(this->inputs.size() == 3);
+        return  !netimpl_->isConstArg(this->inputs[0]) ||
+                !netimpl_->isConstArg(this->inputs[1]) ||
+                !netimpl_->isConstArg(this->inputs[2]);
+    }
+
+    int getRangeParams(const Mat& startTensor, const Mat& limitTensor, const Mat& deltaTensor,
+                       double& fstart, double& flimit, double& fdelta,
+                       int64_t& istart, int64_t& ilimit, int64_t& idelta, bool& isflt) const
+    {
+        CV_Assert(startTensor.total() == (size_t)1);
+        CV_Assert(limitTensor.total() == (size_t)1);
+        CV_Assert(deltaTensor.total() == (size_t)1);
+
+        int rtype = startTensor.type();
+        CV_Assert(rtype == limitTensor.type());
+        CV_Assert(rtype == deltaTensor.type());
+
+        fstart = flimit = fdelta = 0.;
+        istart = ilimit = idelta = 0;
+
+        isflt = rtype == CV_32F || rtype == CV_64F || rtype == CV_16F || rtype == CV_16BF;
+
+        if (isflt) {
+            fstart = tensorToScalar<double>(startTensor);
+            flimit = tensorToScalar<double>(limitTensor);
+            fdelta = tensorToScalar<double>(deltaTensor);
+
+            return rangeSize(fstart, flimit, fdelta);
+        } else {
+            istart = tensorToScalar<int64_t>(startTensor);
+            ilimit = tensorToScalar<int64_t>(limitTensor);
+            idelta = tensorToScalar<int64_t>(deltaTensor);
+
+            return rangeSize(istart, ilimit, idelta);
+        }
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape>& inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(!dynamicOutputShapes());
+
+        CV_Assert(inputs.size() == (size_t)3);
+        CV_Assert(inputs.size() == this->inputs.size());
+        Net::Impl* netimpl_ = getNetImpl(this);
+
+        Mat startTensor = netimpl_->argTensor(this->inputs[0]);
+        Mat limitTensor = netimpl_->argTensor(this->inputs[1]);
+        Mat deltaTensor = netimpl_->argTensor(this->inputs[2]);
+
+        double fstart, flimit, fdelta;
+        int64_t istart, ilimit, idelta;
+        bool isflt;
+
+        int nout = getRangeParams(startTensor, limitTensor, deltaTensor,
+                                  fstart, flimit, fdelta, istart, ilimit, idelta, isflt);
+        MatShape shape(1);
+        shape[0] = nout;
+        outputs.assign(1, shape);
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)3);
+        CV_Assert(inputs[0] == inputs[1]);
+        CV_Assert(inputs[0] == inputs[2]);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 3);
+
+        Mat startTensor = inputs_arr.getMat(0);
+        Mat limitTensor = inputs_arr.getMat(1);
+        Mat deltaTensor = inputs_arr.getMat(2);
+
+        double fstart, flimit, fdelta;
+        int64_t istart, ilimit, idelta;
+        bool isflt;
+
+        int nout = getRangeParams(startTensor, limitTensor, deltaTensor,
+                                  fstart, flimit, fdelta, istart, ilimit, idelta, isflt);
+        MatShape shape(1);
+        shape[0] = nout;
+
+        int rtype = startTensor.type();
+
+        auto kind = outputs_arr.kind();
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(shape, rtype);
+            if (isflt) {
+                makeRange(fstart, flimit, fdelta, outs[0]);
+            } else {
+                makeRange(istart, ilimit, idelta, outs[0]);
+            }
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(shape, rtype);
+            Mat temp(shape, rtype);
+            if (isflt) {
+                makeRange(fstart, flimit, fdelta, temp);
+            } else {
+                makeRange(istart, ilimit, idelta, temp);
+            }
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<RangeLayer> RangeLayer::create(const LayerParams& params)
+{
+    return Ptr<RangeLayer>(new RangeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/reshape2_layer.cpp b/modules/dnn/src/layers/reshape2_layer.cpp
new file mode 100644
index 0000000000..081d9a739e
--- /dev/null
+++ b/modules/dnn/src/layers/reshape2_layer.cpp
@@ -0,0 +1,190 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Reshape2 layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Reshape.html
+
+    Opset's 1 to 23 are covered.
+
+    The layers Flatten, Reshape2, Squeeze and Unsqueeze all share the same
+    implementation idea:
+    1. calculate shape of the output tensor
+    2. assuming that the input is continuous, just copy all the data to output tensor.
+       reshapeAndCopyFirst() does that.
+       The engine buffer allocator recognizes all these operations and tries to run
+       them in-place. In such a case no copy operation is actually done,
+       so the operations are really cheap.
+*/
+
+class Reshape2LayerImpl CV_FINAL : public Reshape2Layer
+{
+public:
+    bool dynamicShapeSpec;
+
+    Reshape2LayerImpl(const LayerParams& params)
+    {
+        dynamicShapeSpec = true;
+        setParamsFrom(params);
+        if (params.has("shape"))
+        {
+            dynamicShapeSpec = false;
+
+            const DictValue& shapeParam = params.get("shape");
+            int i, ndims = shapeParam.size();
+            newShapeDesc.resize(ndims);
+            for (i = 0; i < ndims; i++) {
+                int sz = shapeParam.get<int>(i);
+                if (sz <= 0)
+                    dynamicShapeSpec = true;
+                newShapeDesc[i] = sz;
+            }
+        }
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        // [TODO] fix. If the 'shape' spec is attribute,
+        // or if shape is a constant 2nd input of the layer,
+        // then the output shape can be inferred from the input tensor shape.
+        // That is, dynamicShapeSpec is not quite incorrect.
+        return dynamicShapeSpec;
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    bool haveShapeSpec() const
+    {
+        return newShapeDesc.dims >= 0;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape, MatShape& shapeSpec) const
+    {
+        MatShape outShape = shapeSpec;
+        int m1idx = -1;
+        int i, ndims = outShape.dims;
+        int64_t outTotal = 1;
+        for (i = 0; i < ndims; i++) {
+            if (outShape[i] < 0) {
+                CV_Assert(outShape[i] == -1);
+                if (m1idx >= 0) {
+                    CV_Error(Error::StsBadArg, "invalid shape spec, there must be at most one '-1'");
+                }
+                m1idx = i;
+            }
+            else {
+                if (outShape[i] == 0) {
+                    if (i >= inpShape.dims) {
+                        CV_Error(Error::StsBadArg, "cannot copy dimension from the input tensor");
+                    }
+                    outShape[i] = inpShape[i];
+                }
+                outTotal *= outShape[i];
+            }
+        }
+
+        int64_t inpTotal = (int64_t)inpShape.total();
+        if (m1idx >= 0) {
+            int64_t autoSize = inpTotal/outTotal;
+            CV_Assert(autoSize <= INT_MAX && autoSize*outTotal == inpTotal);
+            outShape[m1idx] = (int)autoSize;
+        } else {
+            CV_Assert(outTotal == inpTotal);
+        }
+
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        bool haveShapeSpec_ = haveShapeSpec();
+        CV_Assert((inputs.size() == 1 && haveShapeSpec_) ||
+                  (inputs.size() == 2 && !haveShapeSpec_));
+        MatShape shapeSpec = newShapeDesc, outShape;
+
+        if (inputs.size() == 2)
+        {
+            CV_Assert(this->inputs.size() == 2);
+            Net::Impl* netimpl_ = getNetImpl(this);
+            Mat shapeTensor = netimpl_->argTensor(this->inputs[1]);
+            shapeSpec = tensorToShape(shapeTensor);
+        } else {
+            CV_Assert(shapeSpec.dims >= 0);
+        }
+        outputs.assign(1, getOutShape(inputs[0], shapeSpec));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        bool haveShapeSpec_ = haveShapeSpec();
+        CV_Assert((ninputs == 1 && haveShapeSpec_) ||
+                  (ninputs == 2 && !haveShapeSpec_));
+
+        MatShape inpShape = inputs_arr.shape(0);
+        MatShape shapeSpec = newShapeDesc;
+        if (!haveShapeSpec_) {
+            Mat shapeTensor = inputs_arr.getMat(1);
+            shapeSpec = tensorToShape(shapeTensor);
+        }
+        MatShape outShape = getOutShape(inpShape, shapeSpec);
+        reshapeAndCopyFirst(inputs_arr, outputs_arr, outShape);
+    }
+};
+
+Ptr<Reshape2Layer> Reshape2Layer::create(const LayerParams& params)
+{
+    return Ptr<Reshape2Layer>(new Reshape2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/resize_layer.cpp b/modules/dnn/src/layers/resize_layer.cpp
index f52e153b37..219e282aa4 100644
--- a/modules/dnn/src/layers/resize_layer.cpp
+++ b/modules/dnn/src/layers/resize_layer.cpp
@@ -9,6 +9,7 @@
 #include "../op_cuda.hpp"
 #include "../op_inf_engine.hpp"
 #include "../op_cann.hpp"
+#include "../net_impl.hpp"
 #include <opencv2/imgproc.hpp>
 
 #ifdef HAVE_DNN_NGRAPH
@@ -26,13 +27,14 @@ namespace cv { namespace dnn {
 class ResizeLayerImpl : public ResizeLayer
 {
 public:
+    int outWidth0, outHeight0;
     ResizeLayerImpl(const LayerParams& params) : zoomFactorWidth(params.get<float>("zoom_factor_x", params.get<float>("zoom_factor", 0))),
                                                  zoomFactorHeight(params.get<float>("zoom_factor_y", params.get<float>("zoom_factor", 0))),
                                                  scaleWidth(0), scaleHeight(0)
     {
         setParamsFrom(params);
-        outWidth = params.get<float>("width", 0);
-        outHeight = params.get<float>("height", 0);
+        outWidth = outWidth0 = params.get<float>("width", 0);
+        outHeight = outHeight0 = params.get<float>("height", 0);
         if (params.has("zoom_factor"))
         {
             CV_Assert(!params.has("zoom_factor_x") && !params.has("zoom_factor_y"));
@@ -50,20 +52,65 @@ public:
             halfPixelCenters = true;
     }
 
+    bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        if (ninputs <= 1 &&
+            ((outWidth0 > 0 && outHeight0 > 0) ||
+            (zoomFactorWidth > 0 && zoomFactorHeight > 0)))
+            return false;
+        Net::Impl* netimpl_ = getNetImpl(this);
+        if (!netimpl_)
+            return true;
+        for (size_t i = 1; i < ninputs; i++) {
+            if (!netimpl_->isConstArg(inputs[i]))
+                return true;
+        }
+        return false;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape, const std::vector<int>& sizes,
+                         const std::vector<float>& scales) const
+    {
+        CV_Assert((sizes.size() == 4 && scales.empty()) ||
+                  (scales.size() == 4 && sizes.empty()));
+        MatShape outShape = inpShape;
+        if (!sizes.empty()) {
+            outShape[2] = sizes[2];
+            outShape[3] = sizes[3];
+        } else {
+            outShape[2] = (float)(inpShape[2]*scales[2]);
+            outShape[3] = (float)(inpShape[3]*scales[3]);
+        }
+        return outShape;
+    }
+
     bool getMemoryShapes(const std::vector<MatShape> &inputs,
                          const int requiredOutputs,
                          std::vector<MatShape> &outputs,
                          std::vector<MatShape> &internals) const CV_OVERRIDE
     {
-        CV_Assert_N(inputs.size() == 1 || inputs.size() == 2, inputs[0].size() == 4);
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2 || ninputs >= 4);
         outputs.resize(1, inputs[0]);
-        if (inputs.size() == 1) {
-            outputs[0][2] = zoomFactorHeight > 0 ? (outputs[0][2] * zoomFactorHeight) : outHeight;
-            outputs[0][3] = zoomFactorWidth > 0 ? (outputs[0][3] * zoomFactorWidth) : outWidth;
-        } else {
-            CV_CheckGE(inputs[1].size(), (size_t)4, "");
+        if (ninputs == 1) {
+            outputs[0][2] = zoomFactorHeight > 0 ? (int)(inputs[0][2] * zoomFactorHeight) : outHeight0;
+            outputs[0][3] = zoomFactorWidth > 0 ? (int)(inputs[0][3] * zoomFactorWidth) : outWidth0;
+        } else if (ninputs == 2 && inputs[1].dims == 4) {
+            // [TODO] this workaround needs to be removed
             outputs[0][2] = inputs[1][2];
             outputs[0][3] = inputs[1][3];
+        } else {
+            Net::Impl* netimpl_ = getNetImpl(this);
+            std::vector<int> sizes;
+            std::vector<float> scales;
+            if (ninputs >= 4) {
+                Mat sizesTensor = netimpl_->argTensor(this->inputs[3]);
+                tensorToIntVec(sizesTensor, sizes);
+            }
+            Mat scalesTensor = netimpl_->argTensor(this->inputs[ninputs >= 4 ? 2 : 1]);
+            tensorToFloatVec(scalesTensor, scales);
+            outputs[0] = getOutShape(inputs[0], sizes, scales);
         }
         // We can work in-place (do nothing) if input shape == output shape.
         return (outputs[0][2] == inputs[0][2]) && (outputs[0][3] == inputs[0][3]);
@@ -87,59 +134,117 @@ public:
         return backendId == DNN_BACKEND_OPENCV;
     }
 
-    virtual void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    void updateOutSizeAndScale(const MatShape& inpShape, const MatShape& outShape)
     {
-        std::vector<Mat> inputs, outputs;
-        inputs_arr.getMatVector(inputs);
-        outputs_arr.getMatVector(outputs);
-
-        outHeight = outputs[0].size[2];
-        outWidth = outputs[0].size[3];
+        CV_Assert(outShape.dims == 4);
+        outHeight = outShape[2];
+        outWidth = outShape[3];
         if (alignCorners && outHeight > 1)
-            scaleHeight = static_cast<float>(inputs[0].size[2] - 1) / (outHeight - 1);
+            scaleHeight = float(inpShape[2] - 1) / (outHeight - 1);
         else
-            scaleHeight = static_cast<float>(inputs[0].size[2]) / outHeight;
+            scaleHeight = float(inpShape[2]) / outHeight;
 
         if (alignCorners && outWidth > 1)
-            scaleWidth = static_cast<float>(inputs[0].size[3] - 1) / (outWidth - 1);
+            scaleWidth = float(inpShape[3] - 1) / (outWidth - 1);
         else
-            scaleWidth = static_cast<float>(inputs[0].size[3]) / outWidth;
+            scaleWidth = float(inpShape[3]) / outWidth;
+    }
+
+    virtual void finalize(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+        if (!dynamicOutputShapes())
+        {
+            MatShape inpShape = inputs_arr.shape(0);
+            MatShape outShape = outputs_arr.shape(0);
+            updateOutSizeAndScale(inpShape, outShape);
+        }
     }
 
-    void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr, OutputArrayOfArrays internals_arr) CV_OVERRIDE
+    void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays internals_arr) CV_OVERRIDE
     {
         CV_TRACE_FUNCTION();
         CV_TRACE_ARG_VALUE(name, "name", name.c_str());
 
-        if (inputs_arr.depth() == CV_16F)
-        {
-            forward_fallback(inputs_arr, outputs_arr, internals_arr);
-            return;
+        std::vector<Mat> inputs;
+        inputs_arr.getMatVector(inputs);
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs > 0);
+
+        Mat& inp_ = inputs[0];
+
+        MatShape inpShape = inp_.shape();
+        MatShape outShape;
+
+        if (ninputs == 1) {
+            outShape = inpShape;
+            outShape[2] = zoomFactorHeight > 0 ? (int)(inpShape[2] * zoomFactorHeight) : outHeight0;
+            outShape[3] = zoomFactorWidth > 0 ? (int)(inpShape[3] * zoomFactorWidth) : outWidth0;
+        } else if (ninputs == 2 && inputs[0].dims == 4 && inputs[1].dims == 4) {
+            outShape = inpShape;
+            outShape[2] = inputs[1].size[2];
+            outShape[3] = inputs[1].size[3];
+        } else {
+            std::vector<int> sizes;
+            std::vector<float> scales;
+            if (ninputs >= 4) {
+                Mat sizesTensor = inputs[3];
+                tensorToIntVec(sizesTensor, sizes);
+            }
+            Mat scalesTensor = inputs[ninputs >= 4 ? 2 : 1];
+            tensorToFloatVec(scalesTensor, scales);
+            outShape = getOutShape(inpShape, sizes, scales);
         }
 
-        std::vector<Mat> inputs, outputs, internals;
-        inputs_arr.getMatVector(inputs);
-        outputs_arr.getMatVector(outputs);
-        internals_arr.getMatVector(internals);
+        //printf("name: %s, outShape: %d x %d x %d x %d\n", name.c_str(), outShape[0], outShape[1], outShape[2], outShape[3]);
 
-        if (outHeight == inputs[0].size[2] && outWidth == inputs[0].size[3])
-        {
-            // outputs[0] = inputs[0] doesn't work due to BlobManager optimizations
-            if (inputs[0].data != outputs[0].data)
+        updateOutSizeAndScale(inpShape, outShape);
+
+        auto kind = outputs_arr.kind();
+        Mat out_;
+        UMat uout_;
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outputs = outputs_arr.getMatVecRef();
+            outputs[0].fit(outShape, inp_.type());
+            out_ = outputs[0];
+
+            if (outShape == inpShape)
+            {
+                inp_.copyTo(out_);
+                return;
+            }
+        }
+        else {
+            CV_Assert(kind == _InputArray::STD_VECTOR_UMAT);
+            std::vector<UMat>& u_outputs = outputs_arr.getUMatVecRef();
+            u_outputs[0].fit(outShape, inp_.type());
+            uout_ = u_outputs[0];
+            if (outShape == inpShape)
             {
-                inputs[0].copyTo(outputs[0]);
+                inp_.copyTo(uout_);
+                return;
             }
-            return;
+            out_.create(outShape, inp_.type());
+        }
+
+        int depth = inp_.type(), orig_depth = depth;
+
+        Mat inp, out;
+        if (depth != CV_32F && depth != CV_8S) {
+            inp_.convertTo(inp, CV_32F);
+            out.fit(outShape, CV_32F);
+            depth = CV_32F;
+        } else {
+            inp = inp_;
+            out = out_;
         }
 
-        Mat& inp = inputs[0];
-        Mat& out = outputs[0];
-        int depth = inp.depth();
         if ((interpolation == "nearest" && !alignCorners && !halfPixelCenters) || (interpolation == "opencv_linear" && depth != CV_8S) ||
             (interpolation == "bilinear" && halfPixelCenters && depth != CV_8S))
         {
             // INTER_LINEAR Resize mode does not support INT8 inputs
             InterpolationFlags mode = interpolation == "nearest" ? INTER_NEAREST : INTER_LINEAR;
+            // [TODO] this is a really slow approach; need to rewrite it completely.
             for (size_t n = 0; n < inputs[0].size[0]; ++n)
             {
                 for (size_t ch = 0; ch < inputs[0].size[1]; ++ch)
@@ -305,6 +410,16 @@ public:
         }
         else
             CV_Error(Error::StsNotImplemented, "Unknown interpolation: " + interpolation);
+
+        if (orig_depth != depth) {
+            if (!uout_.empty())
+                out.convertTo(uout_, orig_depth);
+            else
+                out.convertTo(out_, orig_depth);
+        }
+        else if (!uout_.empty()) {
+            out.copyTo(uout_);
+        }
     }
 
 #ifdef HAVE_CANN
diff --git a/modules/dnn/src/layers/shape_layer.cpp b/modules/dnn/src/layers/shape_layer.cpp
new file mode 100644
index 0000000000..dcfcb35f0f
--- /dev/null
+++ b/modules/dnn/src/layers/shape_layer.cpp
@@ -0,0 +1,137 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+class ShapeLayerImpl CV_FINAL : public ShapeLayer
+{
+public:
+    typedef int64_t shape_type_t;
+    int shapeType;
+
+    ShapeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+
+        start = params.get<int>("start", 0);
+        end = params.get<int>("end", INT_MAX);
+        shapeType = DataType<shape_type_t>::type;
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    Range getShapeRange(const MatShape& inpShape) const
+    {
+        int outDims = inpShape.dims;
+        int start_ = start < 0 ? start + outDims : start;
+        int end_ = end >= outDims ? outDims : end < 0 ? end + outDims : end;
+
+        CV_Assert(0 <= start_);
+        CV_Assert(start_ <= end_);
+        CV_Assert(end_ <= outDims);
+
+        return Range(start_, end_);
+    }
+
+    MatShape getOutShape(const MatShape& inpShape) const
+    {
+        MatShape outShape;
+        outShape.dims = 1;
+
+        Range r = getShapeRange(inpShape);
+
+        outShape[0] = r.end - r.start;
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int requiredOutputs,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 1);
+
+        outputs.assign(1, getOutShape(inputs[0]));
+        internals.clear();
+
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 1);
+        outputs.assign(requiredOutputs, shapeType);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 1);
+
+        MatShape inpShape = inputs_arr.shape(0);
+        Range r = getShapeRange(inpShape);
+
+        shape_type_t shapeData[CV_MAX_DIM];
+        for (int i = r.start; i < r.end; i++)
+            shapeData[i] = (shape_type_t)inpShape[i];
+
+        Mat shape({r.end - r.start}, shapeType, shapeData);
+
+        int outKind = outputs_arr.kind();
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& out = outputs_arr.getMatVecRef();
+            CV_Assert(out.size() == 1);
+            shape.copyTo(out[0]);
+        } else if (outKind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& out = outputs_arr.getUMatVecRef();
+            CV_Assert(out.size() == 1);
+            shape.copyTo(out[0]);
+        } else {
+            CV_Error_(Error::StsBadArg, ("invalid/unsupported outputs_arr kind: %d", outKind));
+        }
+    }
+};
+
+Ptr<ShapeLayer> ShapeLayer::create(const LayerParams& params)
+{
+    return Ptr<ShapeLayer>(new ShapeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/slice2_layer.cpp b/modules/dnn/src/layers/slice2_layer.cpp
new file mode 100644
index 0000000000..cd3a460bef
--- /dev/null
+++ b/modules/dnn/src/layers/slice2_layer.cpp
@@ -0,0 +1,359 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Slice2 layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Slice2.html
+
+    Opset's 1 to 13 are covered.
+*/
+
+/* Slice op for CPU.
+   starts_, ends_ and steps_ must contain as many elements as
+   the dimensionality in inp and out.
+*/
+static void slice(const Mat& inp, const int* starts_,
+                  const int*, const int* steps_,
+                  Mat& out)
+{
+    /// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+    /// in this function steps can be negative, so
+    /// please don't replace int64_t's with size_t's
+    /// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+    enum {SLICE_MAX_DIMS=7};
+
+    CV_Assert_N(inp.isContinuous(), out.isContinuous());
+    CV_Assert(inp.type() == out.type());
+    CV_Assert_N(inp.dims <= SLICE_MAX_DIMS, inp.dims == out.dims);
+
+    MatShape inpShape = inp.shape();
+    MatShape outShape = out.shape();
+    int64_t esz = (int64_t)inp.elemSize();
+
+    int ndims = inpShape.dims;
+    int starts[SLICE_MAX_DIMS], steps[SLICE_MAX_DIMS];
+    int inpsz[SLICE_MAX_DIMS], outsz[SLICE_MAX_DIMS];
+    int64_t inpstep[SLICE_MAX_DIMS];
+
+    int delta = SLICE_MAX_DIMS - ndims;
+    bool emptyOut = false;
+
+    for (int i = 0; i < SLICE_MAX_DIMS; i++) {
+        inpsz[i] = outsz[i] = steps[i] = 1;
+        starts[i] = 0;
+    }
+
+    for (int i = 0; i < ndims; i++) {
+        inpsz[delta + i] = inpShape[i];
+        outsz[delta + i] = outShape[i];
+        starts[delta + i] = starts_[i];
+        steps[delta + i] = steps_[i];
+        if (outShape[i] == 0)
+            emptyOut = true;
+    }
+
+    for (int i = SLICE_MAX_DIMS-1; i >= 0; i--)
+        inpstep[i] = i == SLICE_MAX_DIMS-1 ? 1 : inpstep[i+1]*inpsz[i+1];
+
+    const uchar* inptr0 = inp.data;
+
+    for (int i = 0; i < SLICE_MAX_DIMS; i++) {
+        inptr0 += starts[i]*inpstep[i]*esz;
+        inpstep[i] *= steps[i];
+    }
+
+    int sz0 = outsz[6], sz1 = outsz[5];
+    int sz2 = outsz[4], sz3 = outsz[3];
+    int sz4 = outsz[2], sz5 = outsz[1], sz6 = outsz[0];
+    int64_t p0 = inpstep[6], p1 = inpstep[5];
+    int64_t p2 = inpstep[4], p3 = inpstep[3];
+    int64_t p4 = inpstep[2], p5 = inpstep[1], p6 = inpstep[0];
+
+    #undef CV_IMPLEMENT_SLICE
+    #define CV_IMPLEMENT_SLICE(typ) \
+        typ* outptr = (typ*)(out.data); \
+        for(int i6 = 0; i6 < sz6; i6++) { \
+        for(int i5 = 0; i5 < sz5; i5++) { \
+        for(int i4 = 0; i4 < sz4; i4++) { \
+        for(int i3 = 0; i3 < sz3; i3++) { \
+        for(int i2 = 0; i2 < sz2; i2++) { \
+        for(int i1 = 0; i1 < sz1; i1++, outptr += sz0) { \
+            const typ* inptr = (const typ*)inptr0 + i6*p6 + \
+                    i5*p5 + i4*p4 + i3*p3 + i2*p2 + i1*p1; \
+            int i0 = 0; \
+            if (p0 == 1) { \
+                for (; i0 < sz0; i0++) \
+                    outptr[i0] = inptr[i0]; \
+            } \
+            else { \
+                for (; i0 <= sz0 - 4; i0 += 4) { \
+                    int64_t ip0 = i0*p0; \
+                    typ t0 = inptr[ip0], t1 = inptr[ip0 + p0]; \
+                    typ t2 = inptr[ip0 + p0*2], t3 = inptr[ip0 + p0*3]; \
+                    outptr[i0] = t0; outptr[i0+1] = t1; \
+                    outptr[i0+2] = t2; outptr[i0+3] = t3; \
+                } \
+                for (; i0 < sz0; i0++) \
+                    outptr[i0] = inptr[i0*p0]; \
+            } \
+        }}}}}}
+
+    if (emptyOut) return;
+    if (esz == 4) {
+        CV_IMPLEMENT_SLICE(int)
+    } else if (esz == 2) {
+        CV_IMPLEMENT_SLICE(int16_t)
+    } else if (esz == 1) {
+        CV_IMPLEMENT_SLICE(int8_t)
+    } else if (esz == 8) {
+        CV_IMPLEMENT_SLICE(int64_t)
+    } else {
+        CV_Error(Error::StsNotImplemented, "");
+    }
+}
+
+class Slice2LayerImpl CV_FINAL : public Slice2Layer
+{
+public:
+    Slice2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axes = params.getVector<int>("axes");
+        starts = params.getVector<int>("starts");
+        ends = params.getVector<int>("ends");
+    }
+
+    void checkNumInputs(size_t ninputs) const
+    {
+        CV_Assert(ninputs == 1 || (3 <= ninputs && ninputs <= 5));
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        size_t ninputs = inputs.size();
+
+        for (size_t i = 1; i < ninputs; i++) {
+            if (!netimpl_->isConstArg(inputs[i]))
+                return true;
+        }
+        return false;
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape,
+                         const std::vector<int>& starts_,
+                         const std::vector<int>& ends_,
+                         const std::vector<int>& axes_,
+                         const std::vector<int>& steps_,
+                         int* allStarts = nullptr,
+                         int* allEnds = nullptr,
+                         int* allSteps = nullptr) const
+    {
+        bool sliceMask[MatShape::MAX_DIMS];
+
+        int ndims = inpShape.dims;
+        int nstarts = (int)starts_.size(), nends = (int)ends_.size();
+        int naxes = (int)axes_.size(), nsteps = (int)steps_.size();
+
+        CV_Assert_N(nstarts > 0, nstarts <= ndims, nstarts == nends);
+        CV_Assert(naxes == 0 || naxes == nstarts);
+        CV_Assert(nsteps == 0 || nsteps == nstarts);
+
+        MatShape outShape = inpShape;
+
+        for (int i = 0; i < ndims; i++) {
+            sliceMask[i] = false;
+            if (allStarts)
+                allStarts[i] = 0;
+            if (allEnds)
+                allEnds[i] = inpShape[i];
+            if (allSteps)
+                allSteps[i] = 1;
+        }
+
+        for (int i = 0; i < nstarts; i++) {
+            int axis = i;
+            if (!axes_.empty()) {
+                axis = axes_[i];
+                axis = normalize_axis(axis, ndims);
+                if (sliceMask[axis]) {
+                    CV_Error(Error::StsBadArg, "duplicate axis occurs in Slice");
+                }
+            }
+            sliceMask[axis] = true;
+            int inpsz = inpShape[axis];
+            int start = starts_[i];
+            int end = ends_[i];
+            int step = 1;
+            if (!steps_.empty())
+                step = steps_[i];
+            CV_Assert(step != 0);
+            start = start < 0 ? std::max(start + inpsz, 0) :
+                                std::min(start, inpsz - (step < 0));
+            end = end < 0 ? std::max(end + inpsz, -(step < 0)) :
+                            std::min(end, inpsz);
+            if (allStarts)
+                allStarts[axis] = start;
+            if (allEnds)
+                allEnds[axis] = end;
+            if (allSteps)
+                allSteps[axis] = step;
+            int outsz = step > 0 ? (end - start + step-1)/step :
+                                   (start - end - step-1)/(-step);
+            CV_Assert(outsz >= 0);
+            outShape[axis] = outsz;
+        }
+
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        checkNumInputs(ninputs);
+        std::vector<int> tempStarts, tempEnds, tempAxes, steps;
+        const std::vector<int> *starts_ = &starts, *ends_ = &ends, *axes_ = &axes;
+
+        if (ninputs > 1) {
+            Net::Impl* netimpl_ = getNetImpl(this);
+            Mat startsTensor = netimpl_->argTensor(this->inputs[1]);
+            tensorToIntVec(startsTensor, tempStarts);
+            starts_ = &tempStarts;
+            Mat endsTensor = netimpl_->argTensor(this->inputs[2]);
+            tensorToIntVec(endsTensor, tempEnds);
+            ends_ = &tempEnds;
+            if (ninputs > 3) {
+                Mat axesTensor = netimpl_->argTensor(this->inputs[3]);
+                tensorToIntVec(axesTensor, tempAxes);
+                axes_ = &tempAxes;
+            }
+            if (ninputs > 4) {
+                Mat stepsTensor = netimpl_->argTensor(this->inputs[4]);
+                tensorToIntVec(stepsTensor, steps);
+            }
+        }
+        MatShape outShape = getOutShape(inputs[0], *starts_, *ends_, *axes_, steps);
+        outputs.assign(1, outShape);
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        checkNumInputs(ninputs);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        checkNumInputs(ninputs);
+
+        int inpType = inputs_arr.type(0);
+        MatShape inpShape = inputs_arr.shape(0);
+        std::vector<int> tempStarts, tempEnds, tempAxes, steps;
+        const std::vector<int> *starts_ = &starts, *ends_ = &ends, *axes_ = &axes;
+
+        if (ninputs > 1) {
+            Mat startsTensor = inputs_arr.getMat(1);
+            tensorToIntVec(startsTensor, tempStarts);
+            starts_ = &tempStarts;
+            Mat endsTensor = inputs_arr.getMat(2);
+            tensorToIntVec(endsTensor, tempEnds);
+            ends_ = &tempEnds;
+            if (ninputs > 3) {
+                Mat axesTensor = inputs_arr.getMat(3);
+                tensorToIntVec(axesTensor, tempAxes);
+                axes_ = &tempAxes;
+            }
+            if (ninputs > 4) {
+                Mat stepsTensor = inputs_arr.getMat(4);
+                tensorToIntVec(stepsTensor, steps);
+            }
+        }
+        int allStarts[MatShape::MAX_DIMS];
+        int allEnds[MatShape::MAX_DIMS];
+        int allSteps[MatShape::MAX_DIMS];
+        MatShape outShape = getOutShape(inpShape, *starts_, *ends_, *axes_, steps,
+                                        allStarts, allEnds, allSteps);
+
+        int outKind = outputs_arr.kind();
+
+        CV_Assert(outKind == _InputArray::STD_VECTOR_MAT ||
+                  outKind == _InputArray::STD_VECTOR_UMAT);
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            runOp(inp, allStarts, allEnds, allSteps, outs[0]);
+        } else {
+            // [TODO] more efficient OpenCL implementation
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            Mat temp(outShape, inpType);
+            runOp(inp, allStarts, allEnds, allSteps, temp);
+            temp.copyTo(outs[0]);
+        }
+    }
+
+    void runOp(const Mat& inp, const int* starts_,
+               const int* ends_, const int* steps_, Mat& out)
+    {
+        slice(inp, starts_, ends_, steps_, out);
+    }
+};
+
+Ptr<Slice2Layer> Slice2Layer::create(const LayerParams& params)
+{
+    return Ptr<Slice2Layer>(new Slice2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/softmax_layer.cpp b/modules/dnn/src/layers/softmax_layer.cpp
index 63a49f6233..2bce8d60f8 100644
--- a/modules/dnn/src/layers/softmax_layer.cpp
+++ b/modules/dnn/src/layers/softmax_layer.cpp
@@ -91,9 +91,10 @@ public:
     {
         bool inplace = Layer::getMemoryShapes(inputs, requiredOutputs, outputs, internals);
         MatShape shape = inputs[0];
-        int cAxis = normalize_axis(axisRaw, shape.size());
-        if (!shape.empty())
+        if (shape.dims > 0) {
+            int cAxis = normalize_axis(axisRaw, shape.dims);
             shape[cAxis] = 1;
+        }
         internals.assign(1, shape);
         return inplace;
     }
diff --git a/modules/dnn/src/layers/split2_layer.cpp b/modules/dnn/src/layers/split2_layer.cpp
new file mode 100644
index 0000000000..cbe7489004
--- /dev/null
+++ b/modules/dnn/src/layers/split2_layer.cpp
@@ -0,0 +1,266 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Split2 layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Split2.html
+
+    Opset's 1 to 13 are covered.
+*/
+
+// all outputs must be pre-allocated.
+// axis must be normalized
+static void split(const Mat& inp, std::vector<Mat>& outs, int axis)
+{
+    CV_Assert(inp.isContinuous());
+
+    MatShape inpShape = inp.shape();
+    int ndims = inpShape.dims;
+
+    CV_Assert_N(0 <= axis, axis <= inp.dims);
+
+    int nslices = 1;
+    int inpType = inp.type();
+    size_t esz = inp.elemSize();
+    size_t sliceSize = esz;
+    size_t inpStep = 0;
+    size_t totalSize = inp.total()*esz;
+    int outSize_a = 0;
+    for (int i = ndims-1; i > axis; i--)
+        sliceSize *= inpShape[i];
+    inpStep = sliceSize*inpShape[axis];
+    for (int i = 0; i < axis; i++)
+        nslices *= inpShape[i];
+
+    size_t noutputs = outs.size();
+    for (size_t k = 0; k < noutputs; k++) {
+        Mat& out = outs[k];
+        MatShape outShape = out.shape();
+        CV_Assert(out.isContinuous());
+        CV_Assert(out.type() == inpType);
+        CV_Assert(out.dims == ndims);
+        for (int i = 0; i < ndims; i++) {
+            if (i == axis)
+                outSize_a += outShape[i];
+            else {
+                CV_Assert(inpShape[i] == outShape[i]);
+            }
+        }
+    }
+
+    CV_Assert(outSize_a == inpShape[axis]);
+
+    parallel_for_(Range(0, (int)noutputs), [&](const Range& r) {
+        for (int k = r.start; k < r.end; k++) {
+            const uchar* inptr = inp.data;
+            Mat& out_k = outs[k];
+            uchar* outptr_k = out_k.data;
+            int sz_a;
+            for (int i = 0; i < k; i++) {
+                sz_a = outs[i].size[axis];
+                inptr += sliceSize*sz_a;
+            }
+            sz_a = out_k.size[axis];
+            size_t sliceSize_k = sliceSize*sz_a;
+            for (int i = 0; i < nslices; i++)
+                memcpy(outptr_k + i*sliceSize_k, inptr + i*inpStep, sliceSize_k);
+        }
+    }, (totalSize > 1000000 ? noutputs : 1));
+}
+
+class Split2LayerImpl CV_FINAL : public Split2Layer
+{
+public:
+    Split2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axis = params.get<int>("axis", 1);
+        split = params.getVector<int>("split");
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    void getOutShapes(const MatShape& inpShape, int axis_,
+                      const std::vector<int>& split,
+                      std::vector<MatShape>& outShapes) const
+    {
+        size_t noutputs = split.size();
+        CV_Assert(noutputs == outputs.size());
+
+        int inpDims = inpShape.dims;
+        CV_Assert(0 <= axis_ && axis_ < inpDims);
+        int totalSize_a = 0;
+
+        outShapes.resize(noutputs);
+        for (size_t i = 0; i < noutputs; i++) {
+            MatShape outShape = inpShape;
+            int s = split[i];
+            CV_Assert(s >= 0);
+            CV_Assert(s <= inpShape[axis_] - totalSize_a);
+            outShape[axis_] = s;
+            outShapes[i] = outShape;
+            totalSize_a += s;
+        }
+    }
+
+    void makeDefaultSplit(int totalSize, size_t noutputs, std::vector<int>& split_) const
+    {
+        split_.resize(noutputs);
+        int chunkSize = (int)((totalSize + noutputs - 1) / noutputs);
+        for (size_t i = 0; i < noutputs; i++) {
+            int sz_i = std::min(totalSize, chunkSize);
+            split_[i] = sz_i;
+            totalSize -= sz_i;
+        }
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int noutputs,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(noutputs == (int)this->outputs.size());
+
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+
+        MatShape inpShape = inputs[0];
+        std::vector<int> tempSplit;
+        const std::vector<int>* split_ = &split;
+        int axis_ = normalize_axis(axis, inpShape.dims);
+
+        if (ninputs == 2) {
+            Net::Impl* netimpl_ = getNetImpl(this);
+            Mat splitTensor = netimpl_->argTensor(this->inputs[1]);
+            tensorToIntVec(splitTensor, tempSplit);
+            split_ = &tempSplit;
+        }
+        else if (split.empty()) {
+            makeDefaultSplit(inpShape[axis_], noutputs, tempSplit);
+            split_ = &tempSplit;
+        }
+
+        getOutShapes(inputs[0], axis_, *split_, outputs);
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        int noutputs = (int)outputs.size();
+
+        CV_Assert(ninputs == 1 || ninputs == 2);
+
+        int inpType = inputs_arr.type(0);
+        MatShape inpShape = inputs_arr.shape(0);
+        std::vector<int> tempSplit;
+        const std::vector<int>* split_ = &split;
+        std::vector<MatShape> outShapes;
+
+        int axis_ = normalize_axis(axis, inpShape.dims);
+
+        if (ninputs == 2) {
+            Mat splitTensor = inputs_arr.getMat(1);
+            tensorToIntVec(splitTensor, tempSplit);
+            split_ = &tempSplit;
+        }
+        else if (split.empty()) {
+            makeDefaultSplit(inpShape[axis_], noutputs, tempSplit);
+            split_ = &tempSplit;
+        }
+        getOutShapes(inpShape, axis_, *split_, outShapes);
+        CV_Assert(outShapes.size() == (size_t)noutputs);
+
+        int outKind = outputs_arr.kind();
+
+        CV_Assert(outKind == _InputArray::STD_VECTOR_MAT ||
+                  outKind == _InputArray::STD_VECTOR_UMAT);
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(noutputs);
+            for (int i = 0; i < noutputs; i++) {
+                MatShape outShape = outShapes[i];
+                outs[i].fit(outShape, inpType);
+            }
+            runOp(inp, outs, axis_);
+        } else {
+            // [TODO] more efficient OpenCL implementation
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(noutputs);
+
+            std::vector<Mat> temps(noutputs);
+            for (int i = 0; i < noutputs; i++) {
+                MatShape outShape = outShapes[i];
+                temps[i].fit(outShape, inpType);
+            }
+            runOp(inp, temps, axis_);
+            for (int i = 0; i < noutputs; i++) {
+                MatShape outShape = outShapes[i];
+                outs[i].fit(outShape, inpType);
+                temps[i].copyTo(outs[i]);
+                temps[i].release();
+            }
+        }
+    }
+
+    void runOp(const Mat& inp, std::vector<Mat>& outs, int axis_)
+    {
+        cv::dnn::split(inp, outs, axis_);
+    }
+};
+
+Ptr<Split2Layer> Split2Layer::create(const LayerParams& params)
+{
+    return Ptr<Split2Layer>(new Split2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/squeeze_layer.cpp b/modules/dnn/src/layers/squeeze_layer.cpp
new file mode 100644
index 0000000000..b80587dcb4
--- /dev/null
+++ b/modules/dnn/src/layers/squeeze_layer.cpp
@@ -0,0 +1,159 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Squeeze layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Squeeze.html
+
+    Opset's 1 to 13 are covered.
+
+    See description in reshape2_layer.cpp
+    for more some common implementation details.
+*/
+class SqueezeLayerImpl CV_FINAL : public SqueezeLayer
+{
+public:
+    SqueezeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axes = params.getVector<int>("axes");
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        return inputs.size() == 2 && !netimpl_->isConstArg(inputs[1]);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape, const std::vector<int>& axes_) const
+    {
+        bool squeezeMask[MatShape::MAX_DIMS];
+
+        if (axes_.empty()) {
+            // remove all 1's
+            for (int i = 0; i < inpShape.dims; i++)
+                squeezeMask[i] = inpShape[i] == 1;
+        } else {
+            for (int i = 0; i < inpShape.dims; i++)
+                squeezeMask[i] = false;
+            for (int a: axes_) {
+                int a_ = normalize_axis(a, inpShape.dims);
+                if (squeezeMask[a_]) {
+                    CV_Error_(Error::StsBadArg, ("duplicate squeezed axis #%d", a));
+                }
+                if (inpShape[a_] != 1) {
+                    CV_Error_(Error::StsBadArg, ("squeezed axis #%d (== %d) != 1", a, inpShape[a_]));
+                }
+                squeezeMask[a_] = true;
+            }
+        }
+
+        MatShape outShape(inpShape.dims);
+        int j = 0;
+        for (int i = 0; i < inpShape.dims; i++) {
+            if (!squeezeMask[i])
+                outShape[j++] = inpShape[i];
+        }
+        outShape.dims = j;
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 1 || inputs.size() == 2);
+        MatShape outShape;
+        std::vector<int> tempAxes;
+        const std::vector<int>* axes_ = &axes;
+
+        if (inputs.size() == 2)
+        {
+            CV_Assert(axes.empty()); // if we have a dedicated 'axes' input,
+                                     // we should not have 'axes' attribute at the same time
+            Net::Impl* netimpl_ = getNetImpl(this);
+            Mat axesTensor = netimpl_->argTensor(this->inputs[1]);
+            tensorToIntVec(axesTensor, tempAxes);
+            axes_ = &tempAxes;
+        }
+        outputs.assign(1, getOutShape(inputs[0], *axes_));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+
+        MatShape inpShape = inputs_arr.shape(0);
+        std::vector<int> tempAxes;
+        const std::vector<int>* axes_ = &axes;
+
+        if (ninputs == 2)
+        {
+            CV_Assert(axes.empty()); // if we have a dedicated 'axes' input,
+                                     // we should not have 'axes' attribute at the same time
+            Mat axesTensor = inputs_arr.getMat(1);
+            tensorToIntVec(axesTensor, tempAxes);
+            axes_ = &tempAxes;
+        }
+        MatShape outShape = getOutShape(inpShape, *axes_);
+        reshapeAndCopyFirst(inputs_arr, outputs_arr, outShape);
+    }
+};
+
+Ptr<SqueezeLayer> SqueezeLayer::create(const LayerParams& params)
+{
+    return Ptr<SqueezeLayer>(new SqueezeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/tile2_layer.cpp b/modules/dnn/src/layers/tile2_layer.cpp
new file mode 100644
index 0000000000..d8e7f0e6be
--- /dev/null
+++ b/modules/dnn/src/layers/tile2_layer.cpp
@@ -0,0 +1,304 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+
+namespace cv
+{
+namespace dnn
+{
+
+static constexpr int TILE_MAX_DIMS = 6;
+
+/*
+    Tile layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Tile.html
+
+    Opset's 1 to 13 are covered.
+*/
+
+// out must be pre-allocated
+// repeats_[] should contains as many elements as inp.dims (== out.dims)
+static void tile(const Mat& inp, const int* repeats_, Mat& out)
+{
+    MatShape inpshape_ = inp.shape();
+    MatShape outshape_ = out.shape();
+    const uchar* inpdata0 = inp.data;
+    uchar* outdata0_ = out.data;
+
+    int inpshape[TILE_MAX_DIMS];
+    int outshape[TILE_MAX_DIMS];
+    int repeats[TILE_MAX_DIMS];
+    int64_t inpstep[TILE_MAX_DIMS];
+    int64_t outstep[TILE_MAX_DIMS];
+
+    int ndims = inp.dims, delta = TILE_MAX_DIMS - ndims;
+    int64_t esz = inp.elemSize();
+    int64_t total_size = 1, total_repeats = 1;
+
+    CV_Assert(inp.isContinuous());
+    CV_Assert(out.isContinuous());
+    CV_Assert(inp.type() == out.type());
+    CV_Assert(esz == 1 || esz == 2 || esz == 4 || esz == 8);
+    CV_Assert(inp.dims == out.dims);
+    CV_Assert(inp.dims <= TILE_MAX_DIMS);
+
+    for (int i = 0; i < TILE_MAX_DIMS; i++) {
+        inpshape[i] = outshape[i] = repeats[i] = 1;
+    }
+
+    for (int i = 0; i < ndims; i++) {
+        inpshape[i + delta] = inpshape_[i];
+        outshape[i + delta] = outshape_[i];
+        repeats[i + delta] = repeats_[i];
+
+        CV_Assert(inpshape_[i]*repeats_[i] == outshape_[i]);
+
+        total_size *= outshape_[i];
+        total_repeats *= repeats_[i];
+    }
+
+    for (int i = TILE_MAX_DIMS-1; i >= 0; i--) {
+        if (i == TILE_MAX_DIMS-1)
+            inpstep[i] = outstep[i] = 1;
+        else {
+            inpstep[i] = inpstep[i+1]*inpshape[i+1];
+            outstep[i] = outstep[i+1]*outshape[i+1];
+        }
+    }
+
+    int ntasks = 8;
+    if (ntasks > total_repeats)
+        ntasks = (int)total_repeats;
+    if (total_size < 1000000)
+        ntasks = 1;
+
+    parallel_for_(Range(0, ntasks), [&](const Range& r)
+    {
+        int sz0 = inpshape[0], sz1 = inpshape[1], sz2 = inpshape[2];
+        int sz3 = inpshape[3], sz4 = inpshape[4], sz5 = inpshape[5];
+
+        int64_t outstep_prelast = outstep[TILE_MAX_DIMS-2];
+        int64_t j0 = r.start*total_repeats/ntasks, j1 = r.end*total_repeats/ntasks;
+
+        for (int64_t j = j0; j < j1; j++)
+        {
+            // convert raw tile index into n-dim tile index.
+            // but we don't need this nd-index itself, we just need the
+            // offset of the tile in the output tensor
+            int64_t j_ = j, rawofs = 0;
+            for (int k = TILE_MAX_DIMS-1; k >= 0; k--) {
+                int r = repeats[k];
+                int64_t q = j_ / r;
+                rawofs += (j_ - q*r)*inpshape[k]*outstep[k];
+                j_ = q;
+            }
+
+            #undef IMPL_COPY_TILE
+            #define IMPL_COPY_TILE(T) \
+                T* inpdata = (T*)inpdata0; \
+                T* outdata0 = (T*)outdata0_ + rawofs; \
+                for (int i0 = 0; i0 < sz0; i0++) { \
+                for (int i1 = 0; i1 < sz1; i1++) { \
+                for (int i2 = 0; i2 < sz2; i2++) { \
+                for (int i3 = 0; i3 < sz3; i3++) { \
+                    T* outdata = outdata0 + i0*outstep[0] + i1*outstep[1] + i2*outstep[2] + i3*outstep[3]; \
+                    for (int i4 = 0; i4 < sz4; i4++, outdata += outstep_prelast, inpdata += sz5) { \
+                        for (int i5 = 0; i5 < sz5; i5++) \
+                            outdata[i5] = inpdata[i5]; \
+                    } \
+                }}}}
+
+            if (esz == 1) {
+                IMPL_COPY_TILE(uint8_t)
+            } else if (esz == 2) {
+                IMPL_COPY_TILE(uint16_t)
+            } else if (esz == 4) {
+                IMPL_COPY_TILE(uint32_t)
+            } else {
+                IMPL_COPY_TILE(uint64_t)
+            }
+        }
+    }
+    , ntasks);
+}
+
+class Tile2LayerImpl CV_FINAL : public Tile2Layer
+{
+public:
+    Tile2LayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        CV_Assert(netimpl_);
+        size_t ninputs = this->inputs.size();
+        CV_Assert(ninputs == 2 || ninputs == 3);
+        return  !netimpl_->isConstArg(this->inputs[1]) ||
+                (ninputs == 3 && !netimpl_->isConstArg(this->inputs[2]));
+    }
+
+    void getRepeats(const Mat& repeats_, const Mat& axes_, int ndims, int* repeats) const
+    {
+        int atype = axes_.type(), rtype = repeats_.type();
+        CV_Assert(ndims <= TILE_MAX_DIMS);
+
+        const int32_t* adata_i32 = nullptr;
+        const int64_t* adata_i64 = nullptr;
+        const int32_t* rdata_i32 = nullptr;
+        const int64_t* rdata_i64 = nullptr;
+
+        bool axismask[TILE_MAX_DIMS];
+
+        CV_Assert(repeats_.dims == 1);
+        CV_Assert(rtype == CV_32S || rtype == CV_64S);
+
+        if (rtype == CV_32S)
+            rdata_i32 = reinterpret_cast<const int32_t*>(repeats_.data);
+        else
+            rdata_i64 = reinterpret_cast<const int64_t*>(repeats_.data);
+
+        if (!axes_.empty()) {
+            CV_Assert(axes_.dims == 1);
+            CV_Assert(atype == CV_32S || atype == CV_64S);
+            CV_Assert(repeats_.total() == axes_.total());
+            CV_Assert(axes_.total() <= (size_t)ndims);
+
+            if (atype == CV_32S)
+                adata_i32 = reinterpret_cast<const int32_t*>(axes_.data);
+            else
+                adata_i64 = reinterpret_cast<const int64_t*>(axes_.data);
+        } else {
+            CV_Assert(repeats_.total() == (size_t)ndims);
+        }
+
+        for (int i = 0; i < ndims; i++) {
+            repeats[i] = 1;
+            axismask[i] = false;
+        }
+
+        int nrepeats = (int)repeats_.total();
+        for (int i = 0; i < nrepeats; i++) {
+            int a = adata_i32 ? (int)adata_i32[i] : adata_i64 ? (int)adata_i64[i] : i;
+            a = normalize_axis(a, ndims);
+            if (axismask[a]) {
+                CV_Error_(Error::StsBadArg, ("duplicate axis %d in Tile", a));
+            }
+            axismask[a] = true;
+            int r = rdata_i32 ? (int)rdata_i32[i] : rdata_i64 ? (int)rdata_i64[i] : 1;
+            repeats[a] = r;
+        }
+    }
+
+    MatShape getOutShape(const MatShape& inpshape, const int* repeats) const
+    {
+        MatShape outshape = inpshape;
+        for (int i = 0; i < outshape.dims; i++)
+            outshape[i] *= repeats[i];
+        return outshape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape>& inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(!dynamicOutputShapes());
+
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)2 || ninputs == (size_t)3);
+        Net::Impl* netimpl_ = getNetImpl(this);
+
+        int repeats[TILE_MAX_DIMS];
+
+        Mat repeatsTensor = netimpl_->argTensor(this->inputs[1]);
+        Mat axesTensor;
+        if (ninputs > 2)
+            axesTensor = netimpl_->argTensor(this->inputs[2]);
+
+        int ndims = inputs[0].dims;
+        getRepeats(repeatsTensor, axesTensor, ndims, repeats);
+
+        outputs.assign(1, getOutShape(inputs[0], repeats));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == (size_t)2 || ninputs == (size_t)3);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 2 || ninputs == 3);
+
+        Mat inp = inputs_arr.getMat(0);
+        Mat repeatsTensor = inputs_arr.getMat(1);
+        Mat axesTensor;
+        int repeats[TILE_MAX_DIMS];
+        int inptype = inp.type();
+        int ndims = inp.dims;
+
+        if (ninputs > 2)
+            axesTensor = inputs_arr.getMat(2);
+
+        getRepeats(repeatsTensor, axesTensor, ndims, repeats);
+        MatShape outshape = getOutShape(inp.shape(), repeats);
+
+        auto kind = outputs_arr.kind();
+        if (kind == _InputArray::STD_VECTOR_MAT) {
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            tile(inp, repeats, outs[0]);
+        } else if (kind == _InputArray::STD_VECTOR_UMAT) {
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outshape, inptype);
+            Mat temp(outshape, inptype);
+            tile(inp, repeats, temp);
+            temp.copyTo(outs[0]);
+        } else {
+            CV_Error(Error::StsNotImplemented, "");
+        }
+    }
+};
+
+Ptr<Tile2Layer> Tile2Layer::create(const LayerParams& params)
+{
+    return Ptr<Tile2Layer>(new Tile2LayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/tile_layer.cpp b/modules/dnn/src/layers/tile_layer.cpp
index 3d766ca75b..d24d11786c 100644
--- a/modules/dnn/src/layers/tile_layer.cpp
+++ b/modules/dnn/src/layers/tile_layer.cpp
@@ -44,20 +44,21 @@ public:
                                  std::vector<MatShape> &internals) const CV_OVERRIDE
     {
         CV_CheckEQ(inputs.size(), 1ull, "Tile: one input is expected");
+        int nrepeats = (int)repeats.size();
 
         // repeats must have the same length as input's dimension number
         if (inputs[0].size() > 1) {
             CV_CheckEQ(inputs[0].size(), repeats.size(), "Tile: repeats must be a 1D tensor of the same length as input's dimension number");
             outputs.assign(1, inputs[0]);
-            for (int i = 0; i < repeats.size(); i++)
+            for (int i = 0; i < nrepeats; i++)
             {
                 outputs[0][i] *= repeats[i];
             }
         } else {
-            CV_CheckGE((int)repeats.size(), 1, "Tile: Provide at least one repeat along any dimension");
-            outputs.assign(1, repeats);
+            CV_CheckGE(nrepeats, 1, "Tile: Provide at least one repeat along any dimension");
+            outputs.assign(1, MatShape(repeats));
             if (inputs[0].size() == 1)
-                outputs[0][repeats.size() - 1] *= inputs[0][0];
+                outputs[0][nrepeats - 1] *= inputs[0][0];
         }
 
         return false;
@@ -85,7 +86,6 @@ public:
         Mat& out = outputs[0];
 
         Mat tmp = data.clone();
-        MatShape tmp_shape = shape(tmp);
         MatShape out_shape = shape(out);
         int rep_i, ndims = data.dims;
         int dims = 1;
diff --git a/modules/dnn/src/layers/transpose_layer.cpp b/modules/dnn/src/layers/transpose_layer.cpp
new file mode 100644
index 0000000000..2d6906cbac
--- /dev/null
+++ b/modules/dnn/src/layers/transpose_layer.cpp
@@ -0,0 +1,218 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Transpose layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Transpose.html
+
+    Opset's 1 to 23 are covered.
+*/
+
+static void transpose(const Mat& inp, const std::vector<int>& perm, Mat& out)
+{
+    enum {TRANSPOSE_MAX_DIMS=7};
+    MatShape inpShape = inp.shape();
+    MatShape outShape = out.shape();
+    int ndims = inpShape.dims;
+    size_t esz = inp.elemSize();
+    CV_Assert(esz == 1 || esz == 2 || esz == 4 || esz == 8);
+
+    int perm_[TRANSPOSE_MAX_DIMS];
+    int inpShape_[TRANSPOSE_MAX_DIMS];
+    int outShape_[TRANSPOSE_MAX_DIMS];
+    size_t inpStep_[TRANSPOSE_MAX_DIMS];
+    int delta = TRANSPOSE_MAX_DIMS - ndims;
+
+    CV_Assert(ndims <= TRANSPOSE_MAX_DIMS);
+    CV_Assert(inp.isContinuous());
+    CV_Assert(out.isContinuous());
+
+    for (int i = 0; i < TRANSPOSE_MAX_DIMS; i++) {
+        perm_[i] = i;
+        inpShape_[i] = outShape_[i] = 1;
+        inpStep_[i] = 0;
+    }
+    inpStep_[TRANSPOSE_MAX_DIMS-1] = 1; // step's are measured in elements, not bytes
+
+    for(int i = 0; i < ndims; i++) {
+        int j = perm.empty() ? ndims - i - 1 : perm[i];
+        if (j < 0)
+            j += ndims;
+        CV_Assert(0 <= j && j < ndims);
+        perm_[i + delta] = j + delta;
+        int inpsz = inpShape[j];
+        int outsz = outShape[i];
+        CV_Assert(inpsz == outsz);
+        inpShape_[i + delta] = inpShape[i];
+        outShape_[i + delta] = outShape[i];
+    }
+
+    for (int i = TRANSPOSE_MAX_DIMS-2; i >= 0; i--)
+        inpStep_[i] = inpStep_[i+1]*inpShape_[i+1];
+
+    int sz6 = outShape_[0], sz5 = outShape_[1];
+    int sz4 = outShape_[2], sz3 = outShape_[3];
+    int sz2 = outShape_[4], sz1 = outShape_[5], sz0 = outShape_[6];
+    size_t p6 = inpStep_[perm_[0]], p5 = inpStep_[perm_[1]];
+    size_t p4 = inpStep_[perm_[2]], p3 = inpStep_[perm_[3]];
+    size_t p2 = inpStep_[perm_[4]], p1 = inpStep_[perm_[5]], p0 = inpStep_[perm_[6]];
+
+#undef CV_IMPLEMENT_TRANSPOSE
+#define CV_IMPLEMENT_TRANSPOSE(typ) \
+    const typ* inptr0 = (const typ*)inp.data; \
+    typ* outptr = (typ*)out.data; \
+    for (int i6 = 0; i6 < sz6; i6++) { \
+    for (int i5 = 0; i5 < sz5; i5++) { \
+    for (int i4 = 0; i4 < sz4; i4++) { \
+    for (int i3 = 0; i3 < sz3; i3++) { \
+    for (int i2 = 0; i2 < sz2; i2++) { \
+    for (int i1 = 0; i1 < sz1; i1++, outptr += sz0) { \
+        int i0 = 0; \
+        const typ* inptr = inptr0 + i6*p6 + i5*p5 + i4*p4 + i3*p3 + i2*p2 + i1*p1; \
+        for (; i0 <= sz0 - 3; i0 += 3) { \
+            size_t ip0 = i0*p0; \
+            typ t0 = inptr[ip0]; \
+            typ t1 = inptr[ip0+p0]; \
+            typ t2 = inptr[ip0+p0*2]; \
+            outptr[i0] = t0; \
+            outptr[i0+1] = t1; \
+            outptr[i0+2] = t2; \
+        } \
+        for (; i0 < sz0; i0++) \
+            outptr[i0] = inptr[i0*p0]; \
+    }}}}}}
+
+    if (esz == 4) {
+        CV_IMPLEMENT_TRANSPOSE(int)
+    } else if (esz == 2) {
+        CV_IMPLEMENT_TRANSPOSE(short)
+    } else if (esz == 1) {
+        CV_IMPLEMENT_TRANSPOSE(char)
+    } else if (esz == 8) {
+        CV_IMPLEMENT_TRANSPOSE(int64_t)
+    }
+}
+
+class TransposeLayerImpl CV_FINAL : public TransposeLayer
+{
+public:
+    TransposeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        perm = params.getVector<int>("perm");
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape) const
+    {
+        MatShape outShape(inpShape.dims);
+        CV_Assert(perm.empty() || perm.size() == (size_t)inpShape.dims);
+
+        for (int i = 0; i < inpShape.dims; i++) {
+            int j = perm.empty() ? inpShape.dims - i - 1 : perm[i];
+            CV_Assert(0 <= j && j < inpShape.dims);
+            outShape[i] = inpShape[j];
+        }
+
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 1);
+        outputs.assign(1, getOutShape(inputs[0]));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        CV_Assert(inputs.size() == 1);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert(ninputs == 1);
+
+        MatShape inpShape = inputs_arr.shape(0);
+        MatShape outShape = getOutShape(inpShape);
+        int inpType = inputs_arr.type(0);
+        int outKind = outputs_arr.kind();
+
+        CV_Assert(outKind == _InputArray::STD_VECTOR_MAT ||
+                  outKind == _InputArray::STD_VECTOR_UMAT);
+
+        if (outKind == _InputArray::STD_VECTOR_MAT) {
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<Mat>& outs = outputs_arr.getMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            runOp(inp, outs[0]);
+        } else {
+            // [TODO] more efficient OpenCL implementation
+            Mat inp = inputs_arr.getMat(0);
+            std::vector<UMat>& outs = outputs_arr.getUMatVecRef();
+            outs.resize(1);
+            outs[0].fit(outShape, inpType);
+            Mat temp(outShape, inpType);
+            runOp(inp, temp);
+            temp.copyTo(outs[0]);
+        }
+    }
+
+    void runOp(const Mat& inp, Mat& out)
+    {
+        transpose(inp, perm, out);
+    }
+};
+
+Ptr<TransposeLayer> TransposeLayer::create(const LayerParams& params)
+{
+    return Ptr<TransposeLayer>(new TransposeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/layers/unsqueeze_layer.cpp b/modules/dnn/src/layers/unsqueeze_layer.cpp
new file mode 100644
index 0000000000..a68c203f24
--- /dev/null
+++ b/modules/dnn/src/layers/unsqueeze_layer.cpp
@@ -0,0 +1,156 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "layers_common.hpp"
+#include "../net_impl.hpp"
+//#include "../op_cuda.hpp"
+//#include "../op_inf_engine.hpp"
+//#include "../ie_ngraph.hpp"
+//#include "../op_webnn.hpp"
+//#include "../op_timvx.hpp"
+//#include "../op_cann.hpp"
+
+//#include <opencv2/dnn/shape_utils.hpp>
+
+namespace cv
+{
+namespace dnn
+{
+
+/*
+    Unsqueeze layer, as defined in ONNX specification:
+    https://onnx.ai/onnx/operators/onnx__Unsqueeze.html
+
+    Opset's 1 to 23 are covered.
+
+    See description in reshape2_layer.cpp
+    for more some common implementation details.
+*/
+class UnsqueezeLayerImpl CV_FINAL : public UnsqueezeLayer
+{
+public:
+    UnsqueezeLayerImpl(const LayerParams& params)
+    {
+        setParamsFrom(params);
+        axes = params.getVector<int>("axes");
+    }
+
+    virtual bool dynamicOutputShapes() const CV_OVERRIDE
+    {
+        Net::Impl* netimpl_ = getNetImpl(this);
+        return inputs.size() == 2 && !netimpl_->isConstArg(inputs[1]);
+    }
+
+    virtual bool supportBackend(int backendId) CV_OVERRIDE
+    {
+        return backendId == DNN_BACKEND_OPENCV;
+    }
+
+    MatShape getOutShape(const MatShape& inpShape, const std::vector<int>& axes_) const
+    {
+        bool unsqueezeMask[MatShape::MAX_DIMS];
+
+        int outDims = inpShape.dims + (int)axes_.size();
+        CV_Assert(0 <= outDims && outDims <= MatShape::MAX_DIMS);
+
+        for (int i = 0; i < outDims; i++)
+            unsqueezeMask[i] = false;
+        for (int a: axes_) {
+            int a_ = normalize_axis(a, outDims);
+            if (unsqueezeMask[a_]) {
+                CV_Error_(Error::StsBadArg, ("duplicate unsqueezed axis #%d", a));
+            }
+            unsqueezeMask[a_] = true;
+        }
+
+        MatShape outShape(outDims);
+        int j = 0;
+        for (int i = 0; i < outDims; i++) {
+            if (unsqueezeMask[i])
+                outShape[i] = 1;
+            else {
+                CV_Assert(j < inpShape.dims);
+                outShape[i] = inpShape[j++];
+            }
+        }
+        return outShape;
+    }
+
+    bool getMemoryShapes(const std::vector<MatShape> &inputs,
+                         const int,
+                         std::vector<MatShape> &outputs,
+                         std::vector<MatShape> &internals) const CV_OVERRIDE
+    {
+        CV_Assert((inputs.size() == 1 && !axes.empty()) ||
+                  (inputs.size() == 2 && axes.empty()));
+        MatShape outShape;
+        std::vector<int> tempAxes;
+        const std::vector<int>* axes_ = &axes;
+
+        if (inputs.size() == 2)
+        {
+            Net::Impl* netimpl_ = getNetImpl(this);
+            Mat axesTensor = netimpl_->argTensor(this->inputs[1]);
+            tensorToIntVec(axesTensor, tempAxes);
+            axes_ = &tempAxes;
+        }
+        outputs.assign(1, getOutShape(inputs[0], *axes_));
+        internals.clear();
+        return true;
+    }
+
+    void getTypes(const std::vector<MatType>& inputs,
+        const int requiredOutputs,
+        const int requiredInternals,
+        std::vector<MatType>& outputs,
+        std::vector<MatType>& internals) const CV_OVERRIDE
+    {
+        size_t ninputs = inputs.size();
+        CV_Assert(ninputs == 1 || ninputs == 2);
+        outputs.assign(requiredOutputs, inputs[0]);
+        CV_Assert(requiredInternals == 0);
+        internals.clear();
+    }
+
+    void finalize(InputArrayOfArrays, OutputArrayOfArrays outputs_arr) CV_OVERRIDE
+    {
+    }
+
+    void forward(InputArrayOfArrays inputs_arr,
+                 OutputArrayOfArrays outputs_arr,
+                 OutputArrayOfArrays) CV_OVERRIDE
+    {
+        CV_TRACE_FUNCTION();
+        CV_TRACE_ARG_VALUE(name, "name", name.c_str());
+
+        Size size = inputs_arr.size();
+        int ninputs = size.area();
+        CV_Assert((ninputs == 1 && !axes.empty()) ||
+                  (ninputs == 2 && axes.empty()));
+
+        MatShape inpShape = inputs_arr.shape(0);
+        std::vector<int> tempAxes;
+        const std::vector<int>* axes_ = &axes;
+
+        if (ninputs == 2)
+        {
+            CV_Assert(axes.empty()); // if we have a dedicated 'axes' input,
+                                     // we should not have 'axes' attribute at the same time
+            Mat axesTensor = inputs_arr.getMat(1);
+            tensorToIntVec(axesTensor, tempAxes);
+            axes_ = &tempAxes;
+        }
+        MatShape outShape = getOutShape(inpShape, *axes_);
+        reshapeAndCopyFirst(inputs_arr, outputs_arr, outShape);
+    }
+};
+
+Ptr<UnsqueezeLayer> UnsqueezeLayer::create(const LayerParams& params)
+{
+    return Ptr<UnsqueezeLayer>(new UnsqueezeLayerImpl(params));
+}
+
+}
+}
diff --git a/modules/dnn/src/legacy_backend.hpp b/modules/dnn/src/legacy_backend.hpp
index 3709c1f7f2..afa94d76aa 100644
--- a/modules/dnn/src/legacy_backend.hpp
+++ b/modules/dnn/src/legacy_backend.hpp
@@ -191,7 +191,7 @@ public:
             std::map<LayerPin, int>::const_iterator refIt;
 
             const int targetTotal = total(shape);
-            int bestBlobTotal = INT_MAX;
+            size_t bestBlobTotal = INT_MAX;
 
             for (hostIt = memHosts.begin(); hostIt != memHosts.end(); ++hostIt)
             {
diff --git a/modules/dnn/src/model.cpp b/modules/dnn/src/model.cpp
index 0f6207a0aa..951fa526d1 100644
--- a/modules/dnn/src/model.cpp
+++ b/modules/dnn/src/model.cpp
@@ -117,7 +117,7 @@ public:
         net.setInput(blob);
 
         // Faster-RCNN or R-FCN
-        if (net.getLayer(0)->outputNameToIndex("im_info") != -1)
+        if (!net.getMainGraph() && net.getLayer(0)->outputNameToIndex("im_info") != -1)
         {
             Mat imInfo(Matx13f(size.height, size.width, 1.6f));
             net.setInput(imInfo, "im_info");
diff --git a/modules/dnn/src/net.cpp b/modules/dnn/src/net.cpp
index 0b531700a7..fdd51e6180 100644
--- a/modules/dnn/src/net.cpp
+++ b/modules/dnn/src/net.cpp
@@ -175,17 +175,33 @@ String Net::dump()
 {
     CV_TRACE_FUNCTION();
     CV_Assert(impl);
+    if (impl->mainGraph) {
+        std::stringstream sstrm;
+        dumpToStream(sstrm);
+        return sstrm.str();
+    }
     CV_Assert(!empty());
     return impl->dump(true);
 }
 
+void Net::dumpToStream(std::ostream& strm) const
+{
+    if (impl->mainGraph) {
+        impl->dump(strm);
+    }
+}
+
 void Net::dumpToFile(const String& path)
 {
     CV_TRACE_FUNCTION();
     CV_Assert(impl);
     CV_Assert(!empty());
     std::ofstream file(path.c_str());
-    file << dump();
+    if (impl->mainGraph) {
+        impl->dump(file);
+    } else {
+        file << dump();
+    }
     file.close();
 }
 
@@ -411,5 +427,93 @@ int64 Net::getPerfProfile(std::vector<double>& timings)
     return impl->getPerfProfile(timings);
 }
 
+bool Net::isConstArg(Arg arg) const
+{
+    return argKind(arg) == DNN_ARG_CONST;
+}
+
+const ArgData& Net::argData(Arg arg) const
+{
+    CV_Assert(impl);
+    CV_Assert((size_t)arg.idx < impl->args.size());
+    return impl->args[arg.idx];
+}
+
+const std::string& Net::argName(Arg arg) const { return argData(arg).name; }
+
+ArgKind Net::argKind(Arg arg) const { return argData(arg).kind; }
+
+Mat& Net::argTensor(Arg arg) const {
+    CV_Assert(impl);
+    return impl->argTensor(arg);
+}
+
+Arg Net::getArg(const std::string& name)
+{
+    CV_Assert(impl);
+    return impl->getArg(name);
+}
+
+bool Net::haveArg(const std::string& name) const
+{
+    CV_Assert(impl);
+    return impl->haveArg(name);
+}
+
+Ptr<Graph> Net::getMainGraph() const
+{
+    CV_Assert(impl);
+    return impl->mainGraph;
+}
+
+std::ostream& Net::dumpArg(std::ostream& strm, Arg arg, int indent,
+                           bool comma, bool dump_details) const
+{
+    CV_Assert(impl);
+    return impl->dumpArg(strm, arg, indent, comma, dump_details);
+}
+
+int Net::findDim(const std::string& dimname, bool insert)
+{
+    CV_Assert(impl);
+    return impl->findDim(dimname, insert);
+}
+
+std::ostream& Net::dumpDim(std::ostream& strm, int value) const
+{
+    CV_Assert(impl);
+    return impl->dumpDim(strm, value);
+}
+
+void Net::setTracingMode(TracingMode tracingMode)
+{
+    CV_Assert(impl);
+    impl->tracingMode = tracingMode;
+}
+
+TracingMode Net::getTracingMode() const
+{
+    CV_Assert(impl);
+    return impl->tracingMode;
+}
+
+void Net::setProfilingMode(ProfilingMode profilingMode)
+{
+    CV_Assert(impl);
+    impl->profilingMode = profilingMode;
+}
+
+ProfilingMode Net::getProfilingMode() const
+{
+    CV_Assert(impl);
+    return impl->profilingMode;
+}
+
+ModelFormat Net::getModelFormat() const
+{
+    CV_Assert(impl);
+    return impl->modelFormat;
+}
+
 CV__DNN_INLINE_NS_END
 }}  // namespace cv::dnn
diff --git a/modules/dnn/src/net_impl.cpp b/modules/dnn/src/net_impl.cpp
index 9f0c59634f..68c8b41cfd 100644
--- a/modules/dnn/src/net_impl.cpp
+++ b/modules/dnn/src/net_impl.cpp
@@ -59,11 +59,35 @@ Net::Impl::Impl()
     preferableTarget = DNN_TARGET_CPU;
     hasDynamicShapes = false;
     useWinograd = true;
+
+    ////////////// extra initialization for the new engine /////////////////
+
+    modelFormat = DNN_MODEL_GENERIC;
+    originalLayout = DATA_LAYOUT_NCHW;
+    onnx_opset = 0;
+
+    accuracy = CV_32F;
+    enableFP16 = haveFP16 = false;
+    // FP16 is not ready yet in the new DNN engine
+    // Ticket: https://github.com/opencv/opencv/issues/26196
+    /*if (checkHardwareSupport(CV_CPU_FP16)) {
+        enableFP16 = haveFP16 = true;
+    }*/
+
+    tracingMode = DNN_TRACE_NONE;
+    profilingMode = DNN_PROFILE_NONE;
+
+    dump_strm = &std::cout;
+    dump_indent = 3;
+
+    clear();
 }
 
 
 bool Net::Impl::empty() const
 {
+    if (mainGraph)
+        return false;
     return layers.size() <= 1;  // first layer is default Data layer
 }
 
@@ -92,6 +116,34 @@ void Net::Impl::clear()
     }
     netWasAllocated = false;
     layersTimings.clear();
+
+    /////////////// for the new inference engine //////////////////
+
+    modelFormat = DNN_MODEL_GENERIC;
+
+    dimnames = NamesHash();
+    dimnames_vec = std::vector<std::string>();
+
+    args = std::vector<ArgData>();
+    argnames = NamesHash();
+
+    __tensors__ = std::vector<Mat>();
+    bufidxs = std::vector<int>();
+    buffers = std::vector<Mat>();
+
+    mainGraph = Ptr<Graph>();
+
+    ArgData adata;
+    adata.name = "";
+    adata.kind = DNN_ARG_CONST;
+
+    args.push_back(adata);
+    argnames.insert(std::make_pair(std::string(""), 0));
+    __tensors__.push_back(Mat());
+    bufidxs.push_back(-1);
+
+    prepared = false;
+    finalizeLayers = true;
 }
 
 
@@ -208,8 +260,22 @@ void Net::Impl::setUpNet(const std::vector<LayerPin>& blobsToKeep_)
 
 Ptr<Layer> Net::Impl::getLayer(int layerId) const
 {
-    LayerData& ld = getLayerData(layerId);
-    return getLayerInstance(ld);
+    if (mainGraph) {
+        CV_Assert(0 <= layerId && layerId < totalLayers);
+        int graph_ofs = 0;
+        for (const Ptr<Graph>& graph : allgraphs) {
+            const std::vector<Ptr<Layer> >& prog = graph->prog();
+            int nops = (int)prog.size();
+            CV_Assert(layerId >= graph_ofs);
+            if (layerId < graph_ofs + nops)
+                return prog[layerId - graph_ofs];
+            graph_ofs += nops;
+        }
+        CV_Error_(Error::StsObjectNotFound, ("layer #%d is not found", layerId));
+    } else {
+        LayerData& ld = getLayerData(layerId);
+        return getLayerInstance(ld);
+    }
 }
 
 
@@ -351,7 +417,7 @@ int Net::Impl::addLayer(const String& name, const String& type, const int& dtype
     {
         if (!DNN_DIAGNOSTICS_RUN || type != "NotImplemented")
         {
-            CV_Error(Error::StsBadArg, "Layer \"" + name + "\" already into net");
+            CV_Error(Error::StsBadArg, "Layer \"" + name + "\" has been already added into net");
             return -1;
         }
         else
@@ -613,12 +679,23 @@ void Net::Impl::allocateLayers(const std::vector<LayerPin>& blobsToKeep_)
 }
 
 
+#define TRACE_INFERENCE 0
+
 void Net::Impl::forwardLayer(LayerData& ld)
 {
     CV_TRACE_FUNCTION();
 
     Ptr<Layer> layer = ld.layerInstance;
 
+#if TRACE_INFERENCE
+    if (layer) {
+        printf("------------------------------------------------\n");
+        printf("Running layer '%s' (%s)\n",
+               layer->name.c_str(),
+               layer->type.c_str());
+    }
+#endif
+
     if (!ld.skip)
     {
         TickMeter tm;
@@ -842,6 +919,29 @@ void Net::Impl::forwardLayer(LayerData& ld)
         tm.stop();
         int64 t = tm.getTimeTicks();
         layersTimings[ld.id] = (t > 0) ? t : t + 1;  // zero for skipped layers only
+#if TRACE_INFERENCE
+        size_t noutputs = ld.outputBlobs.size();
+        for (size_t i = 0; i < noutputs; i++) {
+            const Mat& out = ld.outputBlobs[i];
+            printf("Output %zu.\n", i);
+            printf("  Type: %s\n", typeToString(out.type()).c_str());
+            printf("  Shape: ");
+            if (out.empty()) {
+                printf("<empty>\n");
+            } else if (out.dims == 0) {
+                printf("<scalar>\n");
+            } else {
+                for (int j = 0; j < out.dims; j++) {
+                    printf("%s%d", (j == 0 ? "[" : " x "), out.size[j]);
+                }
+                printf("]\n");
+            }
+            //fflush(stdout);
+            //pprint(std::cout, out, 0, 3, 100, '[');
+            //std::cout.flush();
+            //printf("\n");
+        }
+#endif
     }
     else
     {
@@ -890,6 +990,12 @@ Mat Net::Impl::forward(const String& outputName)
     CV_Assert(!empty());
     FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
 
+    if (mainGraph) {
+        if (!outputName.empty())
+            CV_Error(Error::StsNotImplemented, "The new dnn engine doesn't support inference until a specified layer. If you want to run the whole model, please don't set the outputName argument in the forward() call. If you want to run the model until a specified layer, please use the old dnn engine");
+        return forwardWithSingleOutput(outputName);
+    }
+
     String layerName = outputName;
 
     if (layerName.empty())
@@ -912,6 +1018,9 @@ AsyncArray Net::Impl::forwardAsync(const String& outputName)
     CV_Assert(!empty());
     FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
 
+    if (mainGraph)
+        CV_Error(Error::StsNotImplemented, "The new dnn engine doesn't support the async inference. If you want to run the sync inference, please call forward() instead of forwardAsync(). If you want to run the async inference, please use the old dnn engine");
+
     String layerName = outputName;
 
     if (layerName.empty())
@@ -940,6 +1049,13 @@ void Net::Impl::forward(OutputArrayOfArrays outputBlobs, const String& outputNam
     CV_Assert(!empty());
     FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
 
+    if (mainGraph) {
+        if (!outputName.empty())
+            CV_Error(Error::StsNotImplemented, "The new dnn engine doesn't support inference until a specified layer. If you want to run the whole model, please don't set the outputName argument in the forward() call. If you want to run the model until a specified layer, please use the old dnn engine");
+        forwardWithMultipleOutputs(outputBlobs, {});
+        return;
+    }
+
     String layerName = outputName;
 
     if (layerName.empty())
@@ -1028,6 +1144,11 @@ void Net::Impl::forward(OutputArrayOfArrays outputBlobs,
     CV_Assert(!empty());
     FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
 
+    if (mainGraph) {
+        forwardWithMultipleOutputs(outputBlobs, outBlobNames);
+        return;
+    }
+
     std::vector<LayerPin> pins;
     for (int i = 0; i < outBlobNames.size(); i++)
     {
@@ -1266,11 +1387,18 @@ void Net::Impl::getLayerShapes(const ShapesVec& netInputShapes,
         const int layerId,
         LayerShapes& shapes)
 {
-    LayersShapesMap inOutShapes;
-    inOutShapes[0].in = netInputShapes;  // insert shape for first input layer
-    inOutShapes[0].inTypes = netInputTypes;
-    getLayerShapesRecursively(layerId, inOutShapes);
-    shapes = inOutShapes[layerId];
+    if (mainGraph) {
+        std::vector<MatShape> shapeCache;
+        std::vector<int> typeCache;
+        CV_Assert(layerId == 0);
+        tryInferShapes(netInputShapes, netInputTypes, shapes, shapeCache, typeCache);
+    } else {
+        LayersShapesMap inOutShapes;
+        inOutShapes[0].in = netInputShapes;  // insert shape for first input layer
+        inOutShapes[0].inTypes = netInputTypes;
+        getLayerShapesRecursively(layerId, inOutShapes);
+        shapes = inOutShapes[layerId];
+    }
 }
 
 void Net::Impl::updateLayersShapes()
@@ -1411,6 +1539,13 @@ void Net::Impl::setInput(InputArray blob, const String& name, double scalefactor
 {
     FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
 
+    if (mainGraph) {
+        CV_Assert(scalefactor == 1);
+        CV_Assert(mean.val[0] == 0 && mean.val[1] == 0 && mean.val[2] == 0 && mean.val[3] == 0);
+        setMainGraphInput(blob, name);
+        return;
+    }
+
     LayerPin pin;
     pin.lid = 0;
     pin.oid = resolvePinOutputName(getLayerData(pin.lid), name);
@@ -2154,13 +2289,23 @@ std::vector<Ptr<Layer>> Net::Impl::getLayerInputs(int layerId) const
 std::vector<String> Net::Impl::getLayerNames() const
 {
     std::vector<String> res;
-    res.reserve(layers.size());
 
-    Impl::MapIdToLayerData::const_iterator it;
-    for (it = layers.begin(); it != layers.end(); it++)
-    {
-        if (it->second.id)  // skip Data layer
-            res.push_back(it->second.name);
+    if (mainGraph) {
+        res.reserve(totalLayers);
+        for (const Ptr<Graph>& graph: allgraphs) {
+            const std::vector<Ptr<Layer> >& prog = graph->prog();
+            for (const Ptr<Layer>& layer: prog)
+                res.push_back(layer->name);
+        }
+    } else {
+        res.reserve(layers.size());
+
+        Impl::MapIdToLayerData::const_iterator it;
+        for (it = layers.begin(); it != layers.end(); it++)
+        {
+            if (it->second.id)  // skip Data layer
+                res.push_back(it->second.name);
+        }
     }
 
     return res;
@@ -2199,6 +2344,15 @@ std::vector<int> Net::Impl::getUnconnectedOutLayers() const
 // FIXIT drop "unconnected" API
 std::vector<String> Net::Impl::getUnconnectedOutLayersNames() /*const*/
 {
+    if (mainGraph) {
+        std::vector<std::string> outnames;
+        const std::vector<Arg>& outargs = mainGraph->outputs();
+        for (auto out: outargs) {
+            const ArgData& adata = args.at(out.idx);
+            outnames.push_back(adata.name);
+        }
+        return outnames;
+    }
     std::vector<int> ids = getUnconnectedOutLayers();
     const size_t n = ids.size();
     std::vector<String> names(n);
@@ -2368,6 +2522,20 @@ void Net::Impl::enableWinograd(bool useWinograd_)
 void Net::Impl::getLayerTypes(std::vector<String>& layersTypes) const
 {
     layersTypes.clear();
+    if (mainGraph) {
+        std::set<std::string> layersTypesSet;
+        for (const Ptr<Graph>& g: allgraphs) {
+            const std::vector<Ptr<Layer> >& prog = g->prog();
+            for (const Ptr<Layer>& layer: prog) {
+                if (!layer)
+                    continue;
+                layersTypesSet.insert(layer->type);
+            }
+        }
+        for (auto it = layersTypesSet.begin(); it != layersTypesSet.end(); ++it)
+            layersTypes.push_back(*it);
+        return;
+    }
 
     std::map<String, int> layers_type_map;
     for (MapIdToLayerData::const_iterator it = layers.begin(); it != layers.end(); it++)
diff --git a/modules/dnn/src/net_impl.hpp b/modules/dnn/src/net_impl.hpp
index 999454aebb..209786d7b9 100644
--- a/modules/dnn/src/net_impl.hpp
+++ b/modules/dnn/src/net_impl.hpp
@@ -25,6 +25,8 @@
 
 #include "legacy_backend.hpp"  // wrapMat BlobManager OpenCLBackendWrapper
 
+#include <unordered_map>
+
 namespace cv {
 namespace dnn {
 CV__DNN_INLINE_NS_BEGIN
@@ -32,6 +34,8 @@ CV__DNN_INLINE_NS_BEGIN
 using std::make_pair;
 using std::string;
 
+typedef std::unordered_map<std::string, int64_t> NamesHash;
+
 // NB: Implementation is divided between of multiple .cpp files
 struct Net::Impl : public detail::NetImplBase
 {
@@ -66,6 +70,35 @@ struct Net::Impl : public detail::NetImplBase
     bool useWinograd;
     std::vector<int64> layersTimings;
 
+    std::string modelFileName;
+    ModelFormat modelFormat;
+    DataLayout originalLayout;
+    int onnx_opset;
+
+    NamesHash argnames;
+    NamesHash dimnames;
+    NamesHash graphofs;
+    size_t totalLayers;
+    std::vector<std::string> dimnames_vec;
+    std::vector<ArgData> args;
+    std::vector<Mat> __tensors__;
+    std::vector<int> bufidxs;
+    std::vector<Mat> buffers;
+    std::vector<Mat> scratchBufs;
+    std::vector<Ptr<Graph> > allgraphs;
+
+    Ptr<Graph> mainGraph;
+    int globGraphIdx;
+
+    int accuracy;
+    bool enableFP16, haveFP16;
+    bool prepared; // need to rerun graph transformations/optimizations
+    bool finalizeLayers; // need to initialize each layer
+    TracingMode tracingMode;
+    ProfilingMode profilingMode;
+    std::vector<int64_t> dimvalues;
+    std::ostream* dump_strm;
+    int dump_indent;
 
     virtual bool empty() const;
     virtual void setPreferableBackend(Net& net, int backendId);
@@ -282,8 +315,119 @@ struct Net::Impl : public detail::NetImplBase
 
     void dumpNetworkToFile() const;
 
+    ///////////////////////////// the new engine ////////////////////////////
+
+    // Create a new graph/subgraph, mode 2: we construct the graph manually.
+    // First, we create empty graph with certain input Args (they may or may not have names).
+    // once the graph is constructed, we set the graph outputs using Graph::setOutputs().
+    // When it's the first created graph, it automatically becomes the main model graph.
+    Ptr<Graph> newGraph(const std::string& name,
+                        const std::vector<Arg>& inputs,
+                        bool isMainGraph);
+
+    const ArgData& argData(Arg arg) const;
+    const std::string& argName(Arg arg) const;
+    ArgKind argKind(Arg arg) const;
+
+    // if the name is empty, always creates a new argument;
+    // if it's not empty, returns argument with the specific name if it already exists,
+    // otherwise creates new argument with the specified name
+    Arg getArg(const std::string& name);
+    bool haveArg(const std::string& name) const;
+
+    Arg newConstArg(const std::string& name, const Mat& m);
+    Arg newConstScalarArg(const std::string& name, int type, const void* value);
+    Arg newArg(const std::string& name, ArgKind kind, bool allowEmptyName=false);
+    bool isConstArg(Arg arg) const;
+    Mat& argTensor(Arg arg) const;
+    int argType(Arg arg) const;
+    void checkArg(Arg arg) const;
+    void checkArgs(const std::vector<Arg>& args) const;
+
+    int findDim(const std::string& name, bool insert=false);
+
+    void prepareForInference();
+
+    // pre-allocates memory for output tensors.
+    // if useBufferPool==true, the method uses 'buffers'
+    // for outputs (according to bufidxs)
+    // instead of allocating fresh outputs
+    void allocateLayerOutputs(const Ptr<Layer>& layer,
+                              const std::vector<int>& inpTypes,
+                              const std::vector<MatShape>& inpShapes,
+                              std::vector<int>& outTypes,
+                              std::vector<MatShape>& outShapes,
+                              std::vector<std::pair<uchar*, size_t> >& outOrigData,
+                              std::vector<Mat>& outputs, // [TODO] replace with something else to cover other backends
+                              std::vector<int>& tempTypes,
+                              std::vector<MatShape>& tempShapes,
+                              std::vector<Mat>& temps, // [TODO] ditto
+                              std::vector<Mat>& globalTemps,
+                              bool useBufferPool
+                              );
+
+    // set input of the model before running it
+    void setMainGraphInput(InputArray blob, const std::string& name);
+    // set input in some graph, the main one or a subgraph
+    void setGraphInput(Ptr<Graph>& graph, size_t idx, const Mat& m);
+    // run graph or subgraph.
+    void forwardGraph(Ptr<Graph>& graph, InputArrayOfArrays inputs, OutputArrayOfArrays outputs, bool isMainGraph);
+    // run the whole model
+    void forwardMainGraph(InputArrayOfArrays inputs, OutputArrayOfArrays outputs);
+    // run the whole model, convenience wrapper
+    Mat forwardWithSingleOutput(const std::string& outname);
+    // run the whole model, convenience wrapper
+    void forwardWithMultipleOutputs(OutputArrayOfArrays outputBlobs,
+                                    const std::vector<std::string>& outBlobNames);
+    // try infer shapes; if some layers produce tensors with dynamic shapes, shape inference is impossible
+    bool tryInferShapes(const std::vector<MatShape>& suggestedInpShapes,
+                        const std::vector<MatType>& suggestedInpTypes,
+                        LayerShapes& shapes,
+                        std::vector<MatShape>& shapeCache,
+                        std::vector<MatType>& typeCache) const;
+    bool tryInferGraphShapes(const Ptr<Graph>& graph,
+                             std::vector<MatShape>& shapeCache,
+                             std::vector<MatType>& typeCache) const;
+
+    // helper function for useCounts()
+    void updateUseCounts(const Ptr<Graph>& graph, std::vector<int>& usecounts) const;
+    // computes how many times each argument is used, i.e. on output usecounts.size() == args.size()
+    void useCounts(std::vector<int>& usecounts) const;
+
+    int updateGraphOfs(const Ptr<Graph>& graph, int currofs, bool ismain);
+
+    // deals with numeric and symblic shape values.
+    void checkAndUpdateDim(const Ptr<Graph>& graph, const Layer* layer, Arg inp, int j, int64_t value);
+
+    // dump information about certain input or output argument of an operation
+    void traceArg(std::ostream& strm_, const char* prefix, size_t i, Arg arg, bool dumpdata);
+    std::ostream& dumpArg(std::ostream& strm, Arg arg, int indent,
+                          bool comma, bool dump_details) const;
+    std::ostream& dumpDim(std::ostream& strm, int value) const;
+    std::ostream& dumpTypeShape(std::ostream& strm, int type, const MatShape& shape) const;
+    std::ostream& dump(std::ostream& strm);
+
+    // infers all types
+    void inferTypes();
+    // infers all shapes
+    void inferShapes(bool symbolic);
+    // sets certain buffer index for each intermediate argument (Arg)
+    void assignBuffers();
+    //void useBlockLayout();
+    void fuse();
+    void constFold();
+    void constArgs();
+
 };  // Net::Impl
 
+inline Net::Impl* getNetImpl(const Layer* layer)
+{
+    return reinterpret_cast<Net::Impl*>(layer->netimpl);
+}
+
+Net readNetFromONNX2(const String&);
+Net readNetFromONNX2(const char*, size_t);
+Net readNetFromONNX2(const std::vector<uchar>&);
 
 CV__DNN_INLINE_NS_END
 }}  // namespace cv::dnn
diff --git a/modules/dnn/src/net_impl2.cpp b/modules/dnn/src/net_impl2.cpp
new file mode 100644
index 0000000000..003c9e4056
--- /dev/null
+++ b/modules/dnn/src/net_impl2.cpp
@@ -0,0 +1,1107 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "precomp.hpp"
+
+#include "net_impl.hpp"
+
+namespace cv {
+namespace dnn {
+CV__DNN_INLINE_NS_BEGIN
+
+std::string modelFormatToString(ModelFormat modelFormat)
+{
+    return
+        modelFormat == DNN_MODEL_ONNX ? "ONNX" :
+        modelFormat == DNN_MODEL_TF ? "TF" :
+        modelFormat == DNN_MODEL_TFLITE ? "TFLite" :
+        modelFormat == DNN_MODEL_CAFFE ? "Caffe" : "Unknown/Generic";
+}
+
+std::string argKindToString(ArgKind kind)
+{
+    return
+        kind == DNN_ARG_CONST ? "Const" :
+        kind == DNN_ARG_INPUT ? "Input" :
+        kind == DNN_ARG_OUTPUT ? "Output" :
+        kind == DNN_ARG_TEMP ? "Temp" :
+        kind == DNN_ARG_PATTERN ? "Pattern" : "???";
+}
+
+ArgData::ArgData()
+{
+    kind = DNN_ARG_EMPTY;
+    type = -1;
+}
+
+class GraphImpl : public Graph
+{
+public:
+    GraphImpl(Net::Impl* netimpl, const std::string& name,
+              const std::vector<Arg>& inputs)
+    {
+        netimpl_ = netimpl;
+        name_ = name;
+        inputs_ = inputs;
+    }
+
+    virtual ~GraphImpl()
+    {
+    }
+
+    virtual std::string name() const override { return name_; }
+    virtual bool empty() const override { return prog_.empty(); }
+    virtual void clear() override
+    {
+        prog_.clear();
+    }
+
+    /*Ptr<Graph> clone(Net* newnet) const
+    {
+        Graph g = std::make_shared<GraphData>((newnet ? *newnet : *net_), name_, inputs_, ispattern_);
+        g->outputs_ = outputs_;
+        g->backend_ = backend_;
+        // don't copy optigraph_. It has to be re-created
+        for (auto n : prog_) {
+            g->prog_.push_back(n->clone(g->net_));
+        }
+        return g;
+    }*/
+
+    virtual const std::vector<Arg>& append(Ptr<Layer>& layer,
+                const std::vector<std::string>& outnames) override
+    {
+        CV_Assert(layer);
+        int i, noutputs = (int)outnames.size();
+        //CV_Assert(layer->minNumOutputs() <= noutputs && noutputs <= layer->maxNumOutputs());
+
+        layer->outputs.resize(noutputs);
+        for (i = 0; i < noutputs; i++) {
+            Arg outarg = netimpl_->getArg(outnames[i]);
+            ArgKind kind = netimpl_->argKind(outarg);
+            CV_Assert(kind == DNN_ARG_TEMP || kind == DNN_ARG_OUTPUT);
+            layer->outputs[i] = outarg;
+        }
+
+        prog_.push_back(layer);
+        return layer->outputs;
+    }
+
+    virtual Arg append(Ptr<Layer>& layer,
+               const std::string& outname) override
+    {
+        std::vector<std::string> outnames = {outname};
+        const std::vector<Arg>& outputs = append(layer, outnames);
+        CV_Assert(outputs.size() == 1);
+        return outputs[0];
+    }
+
+    virtual std::ostream& dump(std::ostream& strm, int indent, bool comma) override
+    {
+        CV_Assert(netimpl_);
+        size_t ninputs = inputs_.size(), noutputs = outputs_.size();
+        int delta_indent = netimpl_->dump_indent;
+        int subindent = indent + delta_indent;
+        int argindent = subindent + delta_indent;
+        strm << "{\n";
+        prindent(strm, subindent);
+        strm << "name: ";
+        if (name_.empty())
+            strm << "<noname>\n";
+        else
+            strm << '\"' << name_ << "\"\n";
+        prindent(strm, subindent);
+        strm << "inputs: [\n";
+        for (size_t i = 0; i < ninputs; i++) {
+            netimpl_->dumpArg(strm, inputs_[i], argindent, i+1 < ninputs, true);
+        }
+        prindent(strm, subindent);
+        strm << "],\n";
+        prindent(strm, subindent);
+        strm << "outputs: [\n";
+        for (size_t i = 0; i < noutputs; i++) {
+            netimpl_->dumpArg(strm, outputs_[i], argindent, i+1 < noutputs, true);
+        }
+        prindent(strm, subindent);
+        strm << "],\n";
+        prindent(strm, subindent);
+        strm << "nodes: [\n";
+        size_t nlayers = prog_.size();
+        for (size_t i = 0; i < nlayers; i++) {
+            prindent(strm, argindent);
+            strm << "// op #" << i << "\n";
+            const Ptr<Layer>& layer = prog_[i];
+            layer->dump(strm, argindent, i+1 < nlayers);
+        }
+        prindent(strm, subindent);
+        strm << "]\n";
+        prindent(strm, indent);
+        strm << '}';
+        if (comma)
+            strm << ',';
+        strm << '\n';
+        return strm;
+    }
+
+    //virtual Net* net() const override { return net_; }
+    virtual const std::vector<Arg>& inputs() const override { return inputs_; }
+    virtual const std::vector<Arg>& outputs() const override { return outputs_; }
+
+    virtual void setOutputs(const std::vector<Arg>& outputs) override {
+        CV_Assert(netimpl_);
+        netimpl_->checkArgs(outputs);
+        outputs_ = outputs;
+    }
+    virtual const std::vector<Ptr<Layer> >& prog() const override { return prog_; }
+    virtual void setProg(const std::vector<Ptr<Layer> >& newprog) override { prog_ = newprog; }
+
+protected:
+    Net::Impl* netimpl_;
+    std::string name_;
+    std::vector<Arg> inputs_;
+    std::vector<Arg> outputs_;
+    std::vector<Ptr<Layer> > prog_;
+};
+
+Ptr<Graph> Graph::create(void* netimpl, const std::string& name,
+                         const std::vector<Arg>& inputs)
+{
+    return Ptr<Graph>(new GraphImpl(reinterpret_cast<Net::Impl*>(netimpl), name, inputs));
+}
+
+Graph::~Graph() {}
+
+bool Net::Impl::isConstArg(Arg arg) const
+{
+    return argKind(arg) == DNN_ARG_CONST;
+}
+
+const ArgData& Net::Impl::argData(Arg arg) const
+{
+    CV_Assert((size_t)arg.idx < args.size());
+    return args[arg.idx];
+}
+
+const std::string& Net::Impl::argName(Arg arg) const
+{
+    return argData(arg).name;
+}
+
+ArgKind Net::Impl::argKind(Arg arg) const
+{
+    return argData(arg).kind;
+}
+
+Mat& Net::Impl::argTensor(Arg arg) const
+{
+    const ArgData& adata = argData(arg);
+    if (adata.kind == DNN_ARG_TEMP) {
+        CV_Assert(__tensors__.at(arg.idx).empty());
+        int bufidx = bufidxs.at(arg.idx);
+        return const_cast<Mat&>(buffers.at(bufidx));
+    }
+    return const_cast<Mat&>(__tensors__.at(arg.idx));
+}
+
+Arg Net::Impl::getArg(const std::string& name)
+{
+    auto it = argnames.find(name);
+    if (it != argnames.end()) {
+        return Arg((int)it->second);
+    }
+    return newArg(name, DNN_ARG_TEMP);
+}
+
+bool Net::Impl::haveArg(const std::string& name) const
+{
+    return argnames.find(name) != argnames.end();
+}
+
+Arg Net::Impl::newConstArg(const std::string& name, const Mat& m)
+{
+    if (name.empty()) {
+        CV_Assert(m.empty());
+        return Arg();
+    }
+    Arg arg = newArg(name, DNN_ARG_CONST, true);
+    __tensors__[arg.idx] = m;
+    ArgData& adata = args[arg.idx];
+    adata.type = m.type();
+    adata.shape = m.shape();
+    return arg;
+}
+
+Arg Net::Impl::newArg(const std::string& name, ArgKind kind, bool allowEmptyName)
+{
+    CV_Assert(allowEmptyName || !name.empty());
+    int idx = (int)args.size();
+
+    if (!name.empty()) {
+        CV_Assert(argnames.find(name) == argnames.end());
+        argnames.insert(std::make_pair(name, (int64_t)idx));
+    }
+
+    ArgData adata;
+    adata.name = name;
+    adata.kind = kind;
+    args.push_back(adata);
+    __tensors__.push_back(Mat());
+    bufidxs.push_back(-1);
+
+    return Arg(idx);
+}
+
+
+int Net::Impl::findDim(const std::string& dimname, bool insert)
+{
+    if (!dimname.empty()) {
+        auto it = dimnames.find(dimname);
+        if (it != dimnames.end()) {
+            return (int)it->second;
+        }
+    }
+    if (!insert) {
+        CV_Error_(Error::StsObjectNotFound, ("symbolic dimension '%s' is not found",
+                                             dimname.empty() ? "<some unique name>" : dimname.c_str()));
+    }
+    int value = -(int)dimnames_vec.size() - 1;
+    std::string inserted_dimname = dimname.empty() ? format("N!%d", -value) : dimname;
+    dimnames.insert(std::make_pair(inserted_dimname, (int64_t)value));
+    dimnames_vec.push_back(inserted_dimname);
+    return value;
+}
+
+Ptr<Graph> Net::Impl::newGraph(const std::string& name_, const std::vector<Arg>& inpargs, bool ismain)
+{
+    if (ismain)
+        globGraphIdx = 0;
+    std::string name = name_;
+    if (name_.empty())
+        name = ismain ? std::string("main") : format("subgraph_%d", globGraphIdx);
+    globGraphIdx++;
+    Ptr<Graph> graph = Graph::create(this, name, inpargs);
+    if (ismain)
+        mainGraph = graph;
+    return graph;
+}
+
+void Net::Impl::prepareForInference()
+{
+    if (!prepared) {
+        constFold();
+        //inferTypes();
+        //constArgs();
+        //inferShapes(true);
+        //fuse();
+        //useBlockLayout();
+        //inferShapes(true);
+        assignBuffers();
+        totalLayers = updateGraphOfs(mainGraph, 0, true);
+        prepared = true;
+        finalizeLayers = true;
+    }
+}
+
+void Net::Impl::allocateLayerOutputs(
+                          const Ptr<Layer>& layer,
+                          const std::vector<int>& inpTypes,
+                          const std::vector<MatShape>& inpShapes,
+                          std::vector<int>& outTypes,
+                          std::vector<MatShape>& outShapes,
+                          std::vector<std::pair<uchar*, size_t> >& outOrigData,
+                          std::vector<Mat>& outputs,
+                          std::vector<int>& tempTypes,
+                          std::vector<MatShape>& tempShapes,
+                          std::vector<Mat>& temps,
+                          std::vector<Mat>& globalTemps,
+                          bool useBufferPool)
+{
+    // In theory, when
+    // 1) useBufferPool==true,
+    // 2) the buffers in the pool are already big enough (e.g. when we already run inference a few times to let them grow)
+    // 3) getMemoryShapes() and getTypes() are implemented efficiently without any memory allocations
+    // the method allocateLayerOutputs() should not allocate any memory either.
+    //
+    // Well, currently it still may do so, because Mat::fit() may create e.g. 4D tensor on top of 1D buffer and then
+    // MatSize and MatStep will require dynamic memory allocation (those are very small buffers though).
+    // But we plan to make MatSize and MatStep lighter so that they don't use dynamic memory.
+    size_t noutputs = layer->outputs.size();
+    outShapes.clear();
+    outTypes.clear();
+    tempShapes.clear();
+    tempTypes.clear();
+    layer->getMemoryShapes(inpShapes, (int)noutputs, outShapes, tempShapes);
+    layer->getTypes(inpTypes, (int)noutputs, (int)tempShapes.size(), outTypes, tempTypes);
+    CV_Assert(tempShapes.size() == tempTypes.size());
+    CV_Assert(outShapes.size() == outTypes.size());
+    CV_Assert(outShapes.size() == noutputs);
+    outputs.assign(noutputs, Mat());
+    outOrigData.resize(noutputs);
+    for (size_t i = 0; i < noutputs; i++) {
+        Arg out = layer->outputs[i];
+        if (useBufferPool) {
+            Mat& out_t = argTensor(out);
+            out_t.fit(outShapes[i], outTypes[i]);
+            outputs[i] = out_t;
+        } else {
+            outputs[i].fit(outShapes[i], outTypes[i]);
+        }
+        outOrigData[i].first = outputs[i].u ? outputs[i].u->data : nullptr;
+        outOrigData[i].second = outputs[i].u ? outputs[i].u->size : 0;
+    }
+    // [TODO] probably there should be a smarter algorithm that e.g. sorts
+    // temp buffers by size in decreasing order and assigns global temps accordingly
+    // in order to minimize the total size of temp buffers
+    size_t ntemps = tempShapes.size();
+    temps.resize(ntemps);
+    globalTemps.resize(std::max(ntemps, globalTemps.size()));
+    for (size_t i = 0; i < ntemps; i++) {
+        globalTemps[i].fit(tempShapes[i], tempTypes[i]);
+        temps[i] = globalTemps[i];
+    }
+}
+
+void Net::Impl::forwardMainGraph(InputArrayOfArrays inputs, OutputArrayOfArrays outputs)
+{
+    if (!mainGraph) {
+        CV_Error(Error::StsNullPtr, "the model was not loaded");
+    }
+    // ************ uncomment one of the lines below for debugging **********
+    //tracingMode = DNN_TRACE_OP;
+    //tracingMode = DNN_TRACE_ALL;
+    // [TODO] initialize profile, tracer, symbolic shapes etc.
+    size_t nsymdims = dimnames_vec.size();
+    dimvalues.assign(nsymdims, -1);
+    layersTimings.assign(totalLayers + 1, 0.);
+
+    forwardGraph(mainGraph, inputs, outputs, true);
+
+    // reset finalizeLayer so that layers are only initialized once.
+    // [TODO] if a target or backend change or there are some other important
+    // global changes in configuration, finalizeLayers should be set to 'true' again
+    finalizeLayers = false;
+}
+
+Mat Net::Impl::forwardWithSingleOutput(const std::string& outname)
+{
+    if (!mainGraph) {
+        CV_Error(Error::StsNullPtr, "the model was not loaded");
+    }
+    const std::vector<Arg>& outargs = mainGraph->outputs();
+    CV_Assert(outargs.size() > 0);
+    if (!outname.empty()) {
+        const ArgData& outdata = args.at(outargs[0].idx);
+        CV_Assert(outdata.name == outname);
+    }
+    std::vector<Mat> inps={}, outs;
+    forwardMainGraph(inps, outs);
+    return outs[0];
+}
+
+void Net::Impl::forwardWithMultipleOutputs(OutputArrayOfArrays outblobs, const std::vector<std::string>& outnames)
+{
+    if (!mainGraph) {
+        CV_Error(Error::StsNullPtr, "the model was not loaded");
+    }
+    const std::vector<Arg>& outargs = mainGraph->outputs();
+    std::vector<int> outidxs;
+    int i, j, noutputs = (int)outargs.size();
+    if (!outnames.empty()) {
+        CV_CheckEQ((int)outnames.size(), noutputs, "the number of requested and actual outputs must be the same");
+        if (noutputs == 1 && outnames[0].empty())
+            ;
+        else {
+            for (i = 0; i < noutputs; i++) {
+                const std::string& outname = outnames[i];
+                for (j = 0; j < noutputs; j++) {
+                    const ArgData& adata = args.at(outargs[j].idx);
+                    if (adata.name == outname) {
+                        outidxs.push_back((int)j);
+                        break;
+                    }
+                }
+                if (j == noutputs) {
+                    CV_Error_(Error::StsObjectNotFound, ("the required output '%s' is not found", outname.c_str()));
+                }
+            }
+        }
+    }
+    std::vector<Mat> inps={}, outs;
+    forwardMainGraph(inps, outs);
+    CV_Assert(outs.size() == noutputs);
+    std::vector<Mat>* outMats = nullptr;
+    std::vector<UMat>* outUMats = nullptr;
+    _InputArray::KindFlag outKind = outblobs.kind();
+    if (outKind == _InputArray::STD_VECTOR_MAT) {
+        outMats = &outblobs.getMatVecRef();
+        outMats->resize(noutputs);
+    } else if (outKind == _InputArray::STD_VECTOR_UMAT) {
+        outUMats = &outblobs.getUMatVecRef();
+        outUMats->resize(noutputs);
+    } else if (outKind == _InputArray::MAT || outKind == _InputArray::UMAT) {
+        CV_Assert(noutputs == 1);
+    } else {
+        CV_Error(Error::StsBadArg, "outputs must be Mat, UMat, a vector of Mat's or a vector of UMat's");
+    }
+    for (i = 0; i < noutputs; i++) {
+        int j = outidxs.empty() ? i : outidxs[i];
+        Mat src = outs[j];
+        if (outMats) {
+            src.copyTo(outMats->at(i));
+        } else if (outUMats) {
+            src.copyTo(outUMats->at(i));
+        } else {
+            src.copyTo(outblobs);
+        }
+    }
+}
+
+/*void Net::Impl::checkAndUpdateDim(const Ptr<Graph>& g, const Ptr<Layer>& layer, Arg inp, int j, int value)
+{
+    const ArgData& adata = args[inp.idx];
+    int64_t value0 = adata.size.size[j];
+    if (value0 >= 0) {
+        if (value0 != value) {
+            CV_Error_(Error::StsBadArg, ("graph '%s': node '%s': %d-th dimension of argument '%s' is wrong: %lld given, %lld expected",
+                                        g->name().data(), node ? node->name().data() : "none (graph input)", j, adata.name.c_str(), value, value0));
+        }
+    } else {
+        int64_t idx = -value0-1;
+        CV_Assert(0 <= idx && idx < (int64_t)dimvalues.size());
+        value0 = dimvalues[idx];
+        if (value0 < 0) {
+            dimvalues[idx] = value;
+        } else if (value0 != value) {
+            CV_Error_(Error::StsBadArg,
+            ("graph '%s': node '%s': %d-th dimension '%s' of argument '%s' is wrong: %lld given, but '%s' is already set to %lld",
+                    g->name().data(), node ? node->name().data() : "none (graph input)",
+                    j, dimnames_[idx].c_str(), adata.name.c_str(),
+                    value, dimnames_[idx].c_str(), value0));
+        }
+    }
+}*/
+
+void Net::Impl::traceArg(std::ostream& strm_, const char* prefix, size_t i, Arg arg, bool dumpdata)
+{
+    const int PPRINT_CONTEXT = 3;
+    const int PPRINT_CONST_THRESHOLD = 16;
+    const int PPRINT_ALL_THRESHOLD = 100;
+    const Mat& m = argTensor(arg);
+    const ArgData& adata = args.at(arg.idx);
+    bool constArg = adata.kind == DNN_ARG_CONST;
+    // [TODO] replace with type compatibility check
+    // CV_Assert(m.type() == adata.type);
+    strm_ << prefix << " " << i << ". Name: " << (arg.idx > 0 ? adata.name.c_str() : "<empty>") << "\n";
+    if (arg.idx == 0)
+        return;
+    strm_ << "  Buf: " << bufidxs.at(arg.idx) << "\n";
+    strm_ << "  Type: " << typeToString(adata.type) << " \n";
+    MatShape shape = m.shape();
+    strm_ << "  Shape: " << shape;
+    if (constArg && m.total() <= PPRINT_CONST_THRESHOLD) {
+        strm_ << " /* ";
+        pprint(strm_, m, 0, PPRINT_CONTEXT, PPRINT_CONST_THRESHOLD, '{');
+        strm_ << " */";
+    }
+    strm_ << "\n  Layout: " << layoutToString(shape.layout) << "\n";
+    if (dumpdata && !constArg) {
+        // [TODO] when we support block layout, block-layout tensor
+        // should be converted to the original layout before printing it
+        pprint(strm_, m, 0, PPRINT_CONTEXT, PPRINT_ALL_THRESHOLD, '[');
+        strm_ << "\n";
+    }
+}
+
+void Net::Impl::setMainGraphInput(InputArray m, const std::string& inpname)
+{
+    CV_Assert(mainGraph);
+    const std::vector<Arg>& gr_inputs = mainGraph->inputs();
+    size_t i, ninputs = gr_inputs.size();
+    if (inpname.empty()) {
+        CV_Assert(ninputs == 1 && "empty name can only be used to set input if there is just one input");
+        i = 0;
+    } else {
+        for (i = 0; i < ninputs; i++) {
+            const ArgData& adata = args.at(gr_inputs[i].idx);
+            CV_Assert(adata.kind == DNN_ARG_INPUT);
+            if (adata.name == inpname)
+                break;
+        }
+        if ((i == ninputs) && (!isdigit(inpname[0]) || !sscanf(inpname.c_str(), "%zu", &i))) {
+            CV_Error_(Error::StsObjectNotFound, ("input '%s' is not found", inpname.c_str()));
+        }
+    }
+    setGraphInput(mainGraph, i, m.getMat());
+}
+
+void Net::Impl::setGraphInput(Ptr<Graph>& graph, size_t idx, const Mat& m)
+{
+    int mtype = m.type();
+    MatShape mshape = m.shape();
+    const std::vector<Arg>& gr_inputs = graph->inputs();
+    CV_Assert(idx < gr_inputs.size());
+    Arg inp = gr_inputs[idx];
+    const ArgData& adata = args.at(inp.idx);
+    /*
+     [TODO] add more detailed shape check
+     if (adata.shape.dims != mshape.dims) {
+     CV_Error_(Error::StsBadArg, ("wrong dimensionality of argument '%s': %d given, %d expected",
+     adata.name.c_str(), tsize.ndims, adata.size.ndims));
+     }
+
+     for (int k = 0; k < mshape.dims; k++) {
+        checkAndUpdateDim(graph, Node(), inp, k, tsize.size[k]);
+     }
+    */
+
+    if (adata.kind == DNN_ARG_INPUT) {
+        int adata_type = adata.type;
+        if ((adata_type == CV_16F || adata_type == CV_16BF) && !enableFP16)
+            adata_type = CV_32F;
+        // [TODO] need to analyze this situation more carefully
+        if (adata_type == CV_64F)
+            adata_type = CV_32F;
+        if (adata_type != mtype &&
+            !((adata_type == CV_64F || adata_type == CV_32F || adata_type == CV_16F || adata_type == CV_16BF) &&
+              (mtype == CV_64F || mtype == CV_32F || mtype == CV_16F || mtype == CV_16BF)))
+        {
+            CV_Error_(Error::StsBadArg, ("incompatible type of input tensor #%zu '%s': %s given, %s expected",
+                                         idx, adata.name.c_str(), typeToString(mtype).c_str(),
+                                         typeToString(adata.type).c_str()));
+        }
+        Mat& inp_t = argTensor(inp);
+        if (inp_t.shape() != mshape || inp_t.type() != adata_type)
+            finalizeLayers = true;
+        inp_t.fit(mshape, adata_type);
+        m.convertTo(inp_t, adata_type);
+    } else if (adata.kind == DNN_ARG_TEMP) {
+        int bufidx = bufidxs.at(inp.idx);
+        Mat& buf = buffers.at(bufidx);
+        buf.fit(mshape, mtype); // minimize reallocations
+        m.copyTo(buf);
+    } else {
+        CV_Error_(Error::StsBadArg, ("graph %s: argument '%s' must be 'INPUT' or 'TEMP', not '%s'",
+                                     graph->name().data(), adata.name.c_str(),
+                                     argKindToString(adata.kind).c_str()));
+    }
+}
+
+void Net::Impl::forwardGraph(Ptr<Graph>& graph, InputArrayOfArrays inputs_,
+                             OutputArrayOfArrays outputs_, bool isMainGraph)
+{
+    auto graphofs_it = graphofs.find(graph->name());
+    if (graphofs_it == graphofs.end()) {
+        CV_Error_(Error::StsObjectNotFound, ("graph '%s' does not belong to the model", graph->name().c_str()));
+    }
+
+    std::ostream& strm_ = dump_strm ? *dump_strm : std::cout;
+    const std::vector<Ptr<Layer> >& prog = graph->prog();
+    size_t i, nops = prog.size();
+    const std::vector<Arg>& gr_inputs = graph->inputs();
+    const std::vector<Arg>& gr_outputs = graph->outputs();
+    size_t n_gr_inputs = gr_inputs.size(), n_gr_outputs = gr_outputs.size();
+    std::vector<Mat> inpMats, outMats, tempMats;
+    std::vector<int> inpTypes, outTypes, tempTypes;
+    std::vector<std::pair<uchar*, size_t> > outOrigData;
+    std::vector<MatShape> inpShapes, outShapes, tempShapes;
+    double tickfreq = getTickFrequency();
+    int64_t timestamp = 0;
+
+    size_t graph_ofs = (size_t)graphofs_it->second;
+    CV_Assert(graph_ofs + nops <= totalLayers);
+
+    if (inputs_.empty()) {
+        // inputs are already set; it's only possible to do with the main graph
+        CV_Assert(isMainGraph);
+        for (i = 0; i < n_gr_inputs; i++)
+            CV_CheckFalse(argTensor(gr_inputs[i]).empty(), "Some of the model inputs were not set");
+    }
+    else {
+        if (inputs_.total() != n_gr_inputs) {
+            CV_Error_(Error::StsBadArg, ("wrong number of inputs in graph '%s': %zu given, %zu expected",
+                                         graph->name().data(), inputs_.total(), n_gr_inputs));
+        }
+        for (i = 0; i < n_gr_inputs; i++) {
+            Mat m = inputs_.getMat((int)i);
+            setGraphInput(graph, i, m);
+        }
+    }
+
+    for (size_t opidx = 0; opidx < nops; opidx++) {
+        const Ptr<Layer>& layer = prog.at(opidx);
+        if (!layer) // in theory we shouldn't have any 'nops' at this stage, but just in case we skip them.
+            continue;
+        const std::vector<Arg>& inputs = layer->inputs;
+        const std::vector<Arg>& outputs = layer->outputs;
+        size_t ninputs = inputs.size(), noutputs = outputs.size();
+
+        inpMats.resize(ninputs);
+        inpTypes.resize(ninputs);
+        inpShapes.resize(ninputs);
+        outMats.clear();
+        outOrigData.clear();
+
+        for (i = 0; i < ninputs; i++) {
+            Arg inp = inputs[i];
+            //const ArgData& adata = args[inp.idx];
+            const Mat& m = argTensor(inp);
+            inpMats[i] = m;
+            inpTypes[i] = m.type();
+            inpShapes[i] = m.shape();
+        }
+
+        if (tracingMode != DNN_TRACE_NONE) {
+            strm_ << "-----------\n";
+            strm_ << "'" << graph->name() << "' [" << opidx << "/" << nops << "]. " << layer->type << " node: " << layer->name << "\n";
+            for (i = 0; i < ninputs; i++) {
+                Arg inp = inputs[i];
+                traceArg(strm_, "Input", i, inp, false);
+            }
+        }
+
+        bool dynamicOutShapes = layer->dynamicOutputShapes();
+        if (!dynamicOutShapes) {
+            allocateLayerOutputs(layer, inpTypes, inpShapes, outTypes, outShapes, outOrigData, outMats,
+                                 tempTypes, tempShapes, tempMats, scratchBufs, true);
+        } else {
+            outMats.resize(noutputs);
+            for (i = 0; i < noutputs; i++) {
+                Arg out = outputs[i];
+                outMats[i] = argTensor(out);
+            }
+            tempMats = scratchBufs;
+        }
+
+        timestamp = getTickCount();
+
+        // [TODO] handle If/Loop/...
+        CV_Assert(!layer->subgraphs());
+        if (finalizeLayers)
+            layer->finalize(inpMats, outMats);
+        layer->forward(inpMats, outMats, tempMats);
+        CV_Assert(outMats.size() == noutputs);
+
+        for (i = 0; i < noutputs; i++) {
+            Arg out = outputs[i];
+            ArgData& adata = args[out.idx];
+            const Mat& m = outMats[i];
+            //checkRange(m, false);
+            adata.type = m.type();
+            adata.shape = m.shape();
+            if (adata.kind == DNN_ARG_TEMP) {
+                int bufidx = bufidxs.at(out.idx);
+                Mat& buf = buffers.at(bufidx);
+
+                if (!dynamicOutShapes) {
+                    // a sanity check: make sure that the data was not reallocated during Layer::forward()
+                    // if the layer claims it does not produce dynamic-shape outputs.
+                    CV_Assert_N(buf.u == m.u,
+                                buf.shape() == m.shape(),
+                                buf.type() == m.type(),
+                                (!m.u || m.u->data == outOrigData[i].first),
+                                (!m.u || m.u->size == outOrigData[i].second));
+                } else if (!buf.u || m.u->size > buf.u->size) {
+                    buf = m;
+                } else {
+                    // this branch means that the layer still calls
+                    // 'create()' rather than 'fit()'; that needs to be fixed, but
+                    // we provide workaround here at the expense of extra copy.
+                    buf.fit(m.shape(), m.type());
+                    m.copyTo(buf);
+                }
+            } else {
+                __tensors__.at(out.idx) = m;
+            }
+        }
+
+        timestamp = getTickCount() - timestamp;
+        layersTimings[opidx + graph_ofs + 1] += timestamp;
+
+        if (tracingMode != DNN_TRACE_NONE) {
+            strm_ << "TIME (\"" << layer->name << "\", \"" << layer->type << "\"): " <<
+                format("%.2fms", (double)timestamp*1000./tickfreq) << "\n";
+            for (i = 0; i < noutputs; i++) {
+                Arg out = outputs[i];
+                traceArg(strm_, "Output", i, out, tracingMode == DNN_TRACE_ALL);
+            }
+        }
+    }
+
+    std::vector<Mat>& outputsVec = outputs_.getMatVecRef();
+    outputsVec.resize(n_gr_outputs);
+    for (i = 0; i < n_gr_outputs; i++) {
+        Arg out = gr_outputs[i];
+        const Mat& outm = argTensor(out);
+        if (isMainGraph) {
+            outputsVec[i].fit(outm.shape(), outm.type());
+            outm.copyTo(outputsVec[i]);
+        } else {
+            outputsVec[i] = outm;
+        }
+    }
+}
+
+
+void Net::Impl::updateUseCounts(const Ptr<Graph>& graph, std::vector<int>& usecounts) const
+{
+    if (!graph)
+        return;
+    const std::vector<Ptr<Layer> >& prog = graph->prog();
+    for (const Ptr<Layer>& layer: prog) {
+        const std::vector<Arg>& inputs = layer->inputs;
+        for (const Arg& input: inputs) {
+            CV_Assert(input.idx < (int)usecounts.size());
+            usecounts[input.idx]++;
+        }
+        const std::vector<Ptr<Graph> >* subgraphs = layer->subgraphs();
+        if (subgraphs) {
+            for (const Ptr<Graph>& subgraph: *subgraphs) {
+                updateUseCounts(subgraph, usecounts);
+            }
+        }
+    }
+}
+
+void Net::Impl::useCounts(std::vector<int>& usecounts) const
+{
+    size_t nargs = args.size();
+    usecounts.assign(nargs, 0);
+    usecounts[0] = 1; // empty Arg() is always useful
+    updateUseCounts(mainGraph, usecounts);
+}
+
+int Net::Impl::updateGraphOfs(const Ptr<Graph>& graph, int currofs, bool ismain)
+{
+    CV_Assert(currofs >= 0);
+    if (ismain) {
+        graphofs.clear();
+        allgraphs.clear();
+        layerNameToId.clear();
+    }
+    const std::vector<Ptr<Layer> >& prog = graph->prog();
+    size_t i, nops = prog.size();
+    int subgraph_ofs = currofs + (int)nops;
+    std::string name = graph->name();
+    graphofs.insert(std::make_pair(name, currofs));
+    allgraphs.push_back(graph);
+    for (i = 0; i < nops; i++) {
+        const Ptr<Layer>& layer = prog[i];
+        layerNameToId.insert(std::make_pair(layer->name, currofs + (int)i));
+        const std::vector<Ptr<Graph> >* subgraphs = layer->subgraphs();
+        if (subgraphs) {
+            for (const Ptr<Graph>& subgraph : *subgraphs) {
+                subgraph_ofs = updateGraphOfs(subgraph, subgraph_ofs, false);
+            }
+        }
+    }
+    return subgraph_ofs;
+}
+
+bool Net::Impl::tryInferShapes(const std::vector<MatShape>& suggestedInpShapes,
+                               const std::vector<MatType>& suggestedInpTypes,
+                               LayerShapes& result,
+                               std::vector<MatShape>& shapeCache,
+                               std::vector<MatType>& typeCache) const
+{
+    result.in.clear();
+    result.out.clear();
+    result.inTypes.clear();
+    result.outTypes.clear();
+
+    CV_Assert(mainGraph);
+    size_t nargs = args.size();
+    shapeCache.assign(nargs, MatShape());
+    typeCache.assign(nargs, -1);
+
+    const std::vector<Arg>& inputs = mainGraph->inputs();
+    const std::vector<Arg>& outputs = mainGraph->outputs();
+
+    size_t ninputs = inputs.size();
+    size_t noutputs = outputs.size();
+
+    size_t nsuggestedShapes = suggestedInpShapes.size();
+    size_t nsuggestedTypes = suggestedInpTypes.size();
+    CV_Assert(nsuggestedShapes == 0 || nsuggestedShapes == ninputs ||
+
+              // workaround, but this is not quite correct usage of the function
+              (nsuggestedShapes == 1 && suggestedInpShapes[0].empty())
+              );
+    CV_Assert(nsuggestedTypes <= 1 || nsuggestedTypes == ninputs);
+    bool dynamicInputShapes = false;
+
+    result.in.resize(ninputs);
+    result.inTypes.resize(ninputs);
+
+    for (size_t i = 0; i < ninputs; i++) {
+        Arg inp = inputs[i];
+        const ArgData& adata = args.at(inp.idx);
+        CV_Assert(adata.kind == DNN_ARG_INPUT);
+
+        int type;
+        MatShape shape;
+        const Mat& tensor = argTensor(inp);
+        if (!tensor.empty()) {
+            type = tensor.type();
+            shape = tensor.shape();
+        } else {
+            type = adata.type;
+            shape = adata.shape;
+        }
+
+        if (nsuggestedTypes) {
+            int suggestedType = suggestedInpTypes[i < nsuggestedTypes ? i : 0];
+            if (suggestedType == -1)
+                suggestedType = type;
+            if (adata.type == type ||
+                ((adata.type == CV_32F || adata.type == CV_16F || adata.type == CV_16BF) &&
+                 (suggestedType == CV_32F || suggestedType == CV_16F || suggestedType == CV_16BF)))
+                ;
+            else {
+                CV_Error_(Error::StsBadArg, ("mismatched type for model input '%s': %s provided, %s expected",
+                                             adata.name.c_str(), typeToString(suggestedType).c_str(),
+                                             typeToString(adata.type).c_str()));
+            }
+            type = suggestedType;
+        }
+
+        if (nsuggestedShapes) {
+            MatShape suggestedShape = suggestedInpShapes[i < nsuggestedShapes ? i : 0];
+            if (suggestedShape.empty()) {
+                suggestedShape = shape;
+            }
+            // [TODO] shut up it for now;
+            // too many ONNX conformance tests
+            // depend on this "liberal" behaviour
+            //
+            // CV_Assert(suggestedShape.dims == adata.shape.dims);
+            shape = suggestedShape;
+        }
+
+        typeCache[inp.idx] = type;
+        shapeCache[inp.idx] = shape;
+
+        if (shape.hasSymbols()) {
+            CV_LOG_WARNING(NULL, format("the shape of model input '%s' includes symbols. Shape inference is impossible without prior calls to setInput()",
+                adata.name.c_str()));
+            dynamicInputShapes = true;
+            shape = MatShape();
+        }
+
+        result.inTypes[i] = type;
+        result.in[i] = shape;
+    }
+
+    bool inferenced = false;
+    if (!dynamicInputShapes)
+        inferenced = tryInferGraphShapes(mainGraph, shapeCache, typeCache);
+    bool missingOutputs = false;
+
+    result.outTypes.resize(noutputs, -1);
+    result.out.resize(noutputs);
+
+    for (size_t i = 0; i < noutputs; i++) {
+        Arg out = outputs[i];
+        const ArgData adata = args.at(out.idx);
+        int type = typeCache.at(out.idx);
+        MatShape shape = shapeCache.at(out.idx);
+        if (type < 0) {
+            if (!inferenced)
+                type = adata.type;
+            if (type < 0) {
+                CV_LOG_WARNING(NULL, format("type for output '%s' was not inferred",                        adata.name.c_str()));
+                missingOutputs = true;
+            }
+        }
+
+        result.outTypes[i] = type;
+        result.out[i] = shape;
+    }
+
+    return inferenced && !missingOutputs;
+}
+
+// [TODO]
+// The current 'pure' shape inference is quite fragile, it does not handle any dynamic cases
+// or even some seemingly dynamic cases.
+// It would be nice maybe to some optional speculative forward() with some dummy inputs when
+// straight-forward shape inference mechanism failed.
+bool Net::Impl::tryInferGraphShapes(const Ptr<Graph>& graph,
+                                    std::vector<MatShape>& shapeCache,
+                                    std::vector<MatType>& typeCache) const
+{
+    if (!graph)
+        return true;
+
+    const std::vector<Ptr<Layer> >& prog = graph->prog();
+
+    std::vector<MatShape> inpShapes, outShapes, tempShapes;
+    std::vector<int> inpTypes, outTypes, tempTypes;
+
+    for (const Ptr<Layer>& layer: prog) {
+        if (!layer)
+            continue;
+
+        const std::vector<Ptr<Graph> >* subgraphs = layer->subgraphs();
+        if (subgraphs) {
+            CV_LOG_WARNING(NULL, format("shape inference for the model with subgraphs (node %s (%s)) is not supported yet", layer->name.c_str(), layer->type.c_str()));
+        }
+
+        if (layer->dynamicOutputShapes()) {
+            CV_LOG_WARNING(NULL, format("DNN/InferShape: Layer '%s' (%s) output shapes cannot be inferenced without running forward()", layer->name.c_str(), layer->type.c_str()));
+            return false;
+        }
+
+        const std::vector<Arg>& inputs = layer->inputs;
+        const std::vector<Arg>& outputs = layer->outputs;
+
+        int ninputs = (int)inputs.size();
+        int noutputs = (int)outputs.size();
+
+        inpShapes.resize(ninputs);
+        inpTypes.resize(ninputs);
+        outShapes.clear();
+        outTypes.clear();
+        tempShapes.clear();
+        tempTypes.clear();
+
+        for (int i = 0; i < ninputs; i++) {
+            Arg inp = inputs[i];
+            const ArgData& adata = args.at(inp.idx);
+            MatShape shape;
+            int type;
+
+            if (adata.kind == DNN_ARG_CONST || adata.kind == DNN_ARG_EMPTY) {
+                shape = adata.shape;
+                type = adata.type;
+
+                // unnecessary, but nice to have for consistency
+                shapeCache[inp.idx] = shape;
+                typeCache[inp.idx] = type;
+            } else {
+                shape = shapeCache[inp.idx];
+                type = typeCache[inp.idx];
+                if (type < 0) {
+                    CV_Error_(Error::StsInternal, ("input '%s' of operation '%s' (%s) does not have a proper type",
+                                                   adata.name.c_str(), layer->name.c_str(), layer->type.c_str()));
+                }
+            }
+            inpShapes[i] = shape;
+            inpTypes[i] = type;
+        }
+
+        layer->getMemoryShapes(inpShapes, noutputs, outShapes, tempShapes);
+        CV_Assert((int)outShapes.size() == noutputs);
+        layer->getTypes(inpTypes, noutputs, (int)tempShapes.size(), outTypes, tempTypes);
+        CV_Assert((int)outTypes.size() == noutputs);
+
+        for (int i = 0; i < noutputs; i++) {
+            Arg out = outputs[i];
+            if (out.idx == 0)
+                continue;
+            shapeCache[out.idx] = outShapes[i];
+            typeCache[out.idx] = outTypes[i];
+        }
+    }
+
+    return true;
+}
+
+void Net::Impl::checkArgs(const std::vector<Arg>& args_) const
+{
+    for (const Arg& a: args_) {
+        checkArg(a);
+    }
+}
+
+void Net::Impl::checkArg(Arg a) const
+{
+    CV_Assert(a.idx >= 0);
+    CV_Assert(a.idx < (int)args.size());
+}
+
+std::ostream& Net::Impl::dumpDim(std::ostream& strm, int value) const
+{
+    if (value >= 0) {
+        strm << value;
+    } else {
+        size_t idx = -value;
+        if (idx < dimnames_vec.size())
+            strm << dimnames_vec[idx];
+        else
+            strm << "sym(" << idx << ")";
+    }
+    return strm;
+}
+
+std::ostream& Net::Impl::dumpTypeShape(std::ostream& strm, int type, const MatShape& shape) const
+{
+    if (shape.empty()) {
+        strm << "<empty>";
+    } else {
+        strm << typeToString(type);
+        if (shape.dims > 0 && shape.layout != DATA_LAYOUT_UNKNOWN) {
+            strm << " " << layoutToString(shape.layout);
+        }
+        strm << " [";
+        for (int i = 0; i < shape.dims; i++) {
+            strm << (i > 0 ? " x " : "");
+            dumpDim(strm, shape[i]);
+        }
+        strm << "]";
+    }
+    return strm;
+}
+
+std::ostream& Net::Impl::dumpArg(std::ostream& strm, Arg arg, int indent,
+                                 bool comma, bool dump_details) const
+{
+    checkArg(arg);
+    const ArgData& adata = args.at(arg.idx);
+    prindent(strm, indent);
+    if (arg.empty()) {
+        strm << "<empty>" << (comma ? "," : "");
+    } else {
+        strm << '\"' << adata.name << (comma ? "\"," : "\"");
+        if (dump_details && arg.idx > 0) {
+            strm << " // ";
+            strm << (adata.kind == DNN_ARG_INPUT ? "<Input>" :
+                     adata.kind == DNN_ARG_OUTPUT ? "<Output>" :
+                     adata.kind == DNN_ARG_CONST ? "<Const>" :
+                     adata.kind == DNN_ARG_TEMP ? "<Temp>" :
+                     "<Uknown kind>");
+            if (adata.type >= 0) {
+                strm << " ";
+                dumpTypeShape(strm, adata.type, adata.shape);
+            }
+            if (adata.kind == DNN_ARG_TEMP && ((size_t)arg.idx < bufidxs.size()))
+                strm << " (buf #" << bufidxs[arg.idx] << ")";
+        }
+    }
+    strm << "\n";
+    return strm;
+}
+
+std::ostream& Net::Impl::dump(std::ostream& strm)
+{
+    int indent = dump_indent;
+    strm << "{\n";
+    prindent(strm, indent);
+    strm << "model_format: \"" << modelFormatToString(modelFormat) << "\",\n";
+    if (modelFormat == DNN_MODEL_ONNX) {
+        prindent(strm, indent);
+        strm << "onnx_opset: " << onnx_opset << ",\n";
+    }
+    prindent(strm, indent);
+    strm << "layout: \"" << layoutToString(originalLayout) << "\",\n";
+    if (mainGraph) {
+        prindent(strm, indent);
+        strm << "main_graph: ";
+        mainGraph->dump(strm, indent, false);
+    }
+    strm << "}\n";
+    return strm;
+}
+
+CV__DNN_INLINE_NS_END
+}}  // namespace cv::dnn
diff --git a/modules/dnn/src/net_impl_backend.cpp b/modules/dnn/src/net_impl_backend.cpp
index 4ec202dd8f..fc62cc991c 100644
--- a/modules/dnn/src/net_impl_backend.cpp
+++ b/modules/dnn/src/net_impl_backend.cpp
@@ -188,6 +188,13 @@ void Net::Impl::setPreferableBackend(Net& net, int backendId)
 
     if (preferableBackend != backendId)
     {
+        if (mainGraph)
+        {
+            CV_LOG_WARNING(NULL, "Back-ends are not supported by the new graph egine for now");
+            preferableBackend = backendId;
+            return;
+        }
+
         clear();
         if (backendId == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH)
         {
@@ -217,6 +224,11 @@ void Net::Impl::setPreferableBackend(Net& net, int backendId)
 
 void Net::Impl::setPreferableTarget(int targetId)
 {
+    if (mainGraph)
+    {
+        CV_LOG_WARNING(NULL, "Targtes are not supported by the new graph egine for now");
+        return;
+    }
     if (netWasQuantized && targetId != DNN_TARGET_CPU &&
         targetId != DNN_TARGET_OPENCL && targetId != DNN_TARGET_OPENCL_FP16 && targetId != DNN_TARGET_NPU)
     {
diff --git a/modules/dnn/src/net_impl_fuse.cpp b/modules/dnn/src/net_impl_fuse.cpp
index b81bf14acc..57afa92315 100644
--- a/modules/dnn/src/net_impl_fuse.cpp
+++ b/modules/dnn/src/net_impl_fuse.cpp
@@ -662,6 +662,15 @@ void Net::Impl::fuseLayers(const std::vector<LayerPin>& blobsToKeep_)
         if (preferableBackend != DNN_BACKEND_OPENCV && preferableBackend != DNN_BACKEND_CUDA)
             continue;  // Go to the next layer.
 
+        // [TODO] temporarily disabled Concat optimization.
+        //
+        // Ticket: https://github.com/opencv/opencv/issues/26195
+        //
+        // It's not quite compatible with dynamic shapes,
+        // so we need to make sure that we correctly predicted shapes
+        // of all the concatenated tensors and their offsets inside the result
+        // and also properly allocated that concatenated tensor in advance
+#if 0
         // the optimization #2. if there is concat layer that concatenates channels
         // from the inputs together (i.e. axis == 1) then we make the inputs of
         // the concat layer to write to the concatenation output buffer
@@ -832,6 +841,7 @@ void Net::Impl::fuseLayers(const std::vector<LayerPin>& blobsToKeep_)
                 }
             }
         }
+#endif
     }
 }
 
diff --git a/modules/dnn/src/net_openvino.cpp b/modules/dnn/src/net_openvino.cpp
index 501a596e5d..9b18308c96 100644
--- a/modules/dnn/src/net_openvino.cpp
+++ b/modules/dnn/src/net_openvino.cpp
@@ -667,7 +667,7 @@ Net NetImplOpenVINO::createNetworkFromModelOptimizer(std::shared_ptr<ov::Model>&
     {
         inputsNames.push_back(it->get_friendly_name());
         std::vector<size_t> dims = it->get_shape();
-        inp_shapes.push_back(std::vector<int>(dims.begin(), dims.end()));
+        inp_shapes.push_back(MatShape(dims.begin(), dims.end()));
     }
     // nGraph models produce output "Result" layers which have "/sink_port" suffix in their names.
     // Their inputs are actual model outputs and we change friendly name to it.
diff --git a/modules/dnn/src/ocl4dnn/src/ocl4dnn_softmax.cpp b/modules/dnn/src/ocl4dnn/src/ocl4dnn_softmax.cpp
index 7b32189fdc..3208fd02a5 100644
--- a/modules/dnn/src/ocl4dnn/src/ocl4dnn_softmax.cpp
+++ b/modules/dnn/src/ocl4dnn/src/ocl4dnn_softmax.cpp
@@ -57,17 +57,17 @@ OCL4DNNSoftmax<Dtype>::OCL4DNNSoftmax(OCL4DNNSoftmaxConfig config)
     inner_num_ = 1;
     outer_num_ = 1;
     count_ = 1;
-    int32_t scale_sz = 1;
-    for (int32_t i = softmax_axis_ + 1; i < config.in_shape.size(); i++)
+    int scale_sz = 1;
+    for (int i = (int)softmax_axis_ + 1; i < config.in_shape.dims; i++)
         inner_num_ *= config.in_shape[i];
     use_slm_ = (config.in_shape[softmax_axis_] * inner_num_ + inner_num_ * 17) <= 8192;
-    for (int32_t i = 0; i < softmax_axis_; i++)
+    for (int i = 0; i < (int)softmax_axis_; i++)
         outer_num_ *= config.in_shape[i];
     count_ = inner_num_ + outer_num_;
 
-    std::vector<int32_t> scale_dims = config.in_shape;
+    auto scale_dims = config.in_shape;
     scale_dims[softmax_axis_] = use_slm_ ? 1 : 17;
-    for (int32_t i = 0; i < scale_dims.size(); i++)
+    for (int i = 0; i < scale_dims.dims; i++)
         scale_sz *= scale_dims[i];
 
     scale_data_.create(1, scale_sz, CV_32FC1);
diff --git a/modules/dnn/src/onnx/onnx_importer.cpp b/modules/dnn/src/onnx/onnx_importer.cpp
index 4be7124768..90f72c3638 100644
--- a/modules/dnn/src/onnx/onnx_importer.cpp
+++ b/modules/dnn/src/onnx/onnx_importer.cpp
@@ -6,12 +6,12 @@
 // Third party copyrights are property of their respective owners.
 
 #include "../precomp.hpp"
-#include <opencv2/dnn/shape_utils.hpp>
+#include "../net_impl.hpp"
 
+#include <opencv2/dnn/shape_utils.hpp>
 #include <opencv2/dnn/layer_reg.private.hpp>
 
 #include <opencv2/core/utils/fp_control_utils.hpp>
-
 #include <opencv2/core/utils/logger.defines.hpp>
 #undef CV_LOG_STRIP_LEVEL
 #define CV_LOG_STRIP_LEVEL CV_LOG_LEVEL_VERBOSE + 1
@@ -2314,10 +2314,10 @@ void ONNXImporter::parseUnsqueeze(LayerParams& layerParams, const opencv_onnx::N
     int axis = axes.getIntValue(0);
     axis = axis < 0 ? axis + (int)inpShape.size() + 1 : axis;
     CV_Assert(0 <= axis && axis <= inpShape.size());
-    std::vector<int> outShape = inpShape;
+    MatShape outShape = inpShape;
     outShape.insert(outShape.begin() + axis, 1);
     layerParams.type = (depth == CV_8S) ? "ReshapeInt8" : "Reshape";
-    layerParams.set("dim", DictValue::arrayInt(&outShape[0], outShape.size()));
+    layerParams.set("dim", DictValue::arrayInt(&outShape[0], (int)outShape.size()));
     if (hasDynamicShapes)
     {
         std::vector<int> dynamicAxes;
@@ -2328,8 +2328,8 @@ void ONNXImporter::parseUnsqueeze(LayerParams& layerParams, const opencv_onnx::N
         }
         for (int index = 0; index < inpShape.size(); ++index)
             inputIndices.push_back(index);
-        layerParams.set("dynamic_axes", DictValue::arrayInt(dynamicAxes.data(), dynamicAxes.size()));
-        layerParams.set("input_indices", DictValue::arrayInt(inputIndices.data(), inputIndices.size()));
+        layerParams.set("dynamic_axes", DictValue::arrayInt(dynamicAxes.data(), (int)dynamicAxes.size()));
+        layerParams.set("input_indices", DictValue::arrayInt(inputIndices.data(), (int)inputIndices.size()));
     }
     addLayer(layerParams, node_proto);
 }
@@ -2527,10 +2527,11 @@ void ONNXImporter::parseConstantFill(LayerParams& layerParams, const opencv_onnx
     else
         fill_value = layerParams.get("value", 0);
 
-    MatShape inpShape = getIntBlob(node_proto, 0);
-    for (int i = 0; i < inpShape.size(); i++)
+    std::vector<int> inpShape = getIntBlob(node_proto, 0);
+    size_t i, total = inpShape.size();
+    for (i = 0; i < total; i++)
         CV_CheckGT(inpShape[i], 0, "");
-    Mat tensor(inpShape.size(), &inpShape[0], depth, Scalar(fill_value));
+    Mat tensor(inpShape, depth, Scalar(fill_value));
     addConstant(node_proto.output(0), tensor);
 }
 
@@ -2645,7 +2646,7 @@ void ONNXImporter::parseConcat(LayerParams& layerParams, const opencv_onnx::Node
         for (size_t i = 0; i < inputs.size(); ++i)
         {
             inputs[i] = getBlob(node_proto, (int)i);
-            if (inputs[i].size.dims() > (int)inputShape.size())
+            if (inputs[i].dims > inputShape.dims)
             {
                 inputShape = shape(inputs[i]);
             }
@@ -3229,7 +3230,7 @@ void ONNXImporter::parseEinsum(LayerParams& layerParams, const opencv_onnx::Node
     for (int j = 0; j < node_proto.input_size(); j++)
     {
         // create Const layer for constants and mark its shape
-        std::vector<int> input_shape;
+        MatShape input_shape;
         if (layer_id.find(node_proto.input(j)) == layer_id.end()) {
             Mat blob = getBlob(node_proto, j);
 
@@ -4061,19 +4062,79 @@ void ONNXImporter::buildDispatchMap_COM_MICROSOFT(int opset_version)
 }
 
 
-Net readNetFromONNX(const String& onnxFile)
+Net readNetFromONNX(const String& onnxFile, int engine)
 {
-    return detail::readNetDiagnostic<ONNXImporter>(onnxFile.c_str());
+    static const int engine_forced = (int)utils::getConfigurationParameterSizeT("OPENCV_FORCE_DNN_ENGINE", ENGINE_AUTO);
+    if(engine_forced != ENGINE_AUTO)
+        engine = engine_forced;
+
+    switch(engine)
+    {
+        case ENGINE_CLASSIC:
+            return detail::readNetDiagnostic<ONNXImporter>(onnxFile.c_str());
+        case ENGINE_NEW:
+            return readNetFromONNX2(onnxFile);
+        case ENGINE_AUTO:
+        {
+            Net net = readNetFromONNX2(onnxFile);
+            if (!net.empty())
+                return net;
+            else
+                return detail::readNetDiagnostic<ONNXImporter>(onnxFile.c_str());
+        }
+        default:
+            CV_Error(Error::StsBadArg, "Invalid DNN engine selected!");
+    }
 }
 
-Net readNetFromONNX(const char* buffer, size_t sizeBuffer)
+Net readNetFromONNX(const char* buffer, size_t sizeBuffer, int engine)
 {
-    return detail::readNetDiagnostic<ONNXImporter>(buffer, sizeBuffer);
+    static const int engine_forced = (int)utils::getConfigurationParameterSizeT("OPENCV_FORCE_DNN_ENGINE", ENGINE_AUTO);
+    if(engine_forced != ENGINE_AUTO)
+        engine = engine_forced;
+
+    switch(engine)
+    {
+        case ENGINE_CLASSIC:
+            return detail::readNetDiagnostic<ONNXImporter>(buffer, sizeBuffer);
+        case ENGINE_NEW:
+            return readNetFromONNX2(buffer, sizeBuffer);
+        case ENGINE_AUTO:
+        {
+            Net net = readNetFromONNX2(buffer, sizeBuffer);
+            if (!net.empty())
+                return net;
+            else
+                return detail::readNetDiagnostic<ONNXImporter>(buffer, sizeBuffer);
+        }
+        default:
+            CV_Error(Error::StsBadArg, "Invalid DNN engine selected!");
+    }
 }
 
-Net readNetFromONNX(const std::vector<uchar>& buffer)
+Net readNetFromONNX(const std::vector<uchar>& buffer, int engine)
 {
-    return readNetFromONNX(reinterpret_cast<const char*>(buffer.data()), buffer.size());
+    static const int engine_forced = (int)utils::getConfigurationParameterSizeT("OPENCV_FORCE_DNN_ENGINE", ENGINE_AUTO);
+    if(engine_forced != ENGINE_AUTO)
+        engine = engine_forced;
+
+    switch(engine)
+    {
+        case ENGINE_CLASSIC:
+            return readNetFromONNX(reinterpret_cast<const char*>(buffer.data()), buffer.size());
+        case ENGINE_NEW:
+            return readNetFromONNX2(buffer);
+        case ENGINE_AUTO:
+        {
+            Net net = readNetFromONNX2(buffer);
+            if (!net.empty())
+                return net;
+            else
+                return readNetFromONNX(reinterpret_cast<const char*>(buffer.data()), buffer.size());
+        }
+        default:
+            CV_Error(Error::StsBadArg, "Invalid DNN engine selected!");
+    }
 }
 
 Mat readTensorFromONNX(const String& path)
diff --git a/modules/dnn/src/onnx/onnx_importer2.cpp b/modules/dnn/src/onnx/onnx_importer2.cpp
new file mode 100644
index 0000000000..92054fa18b
--- /dev/null
+++ b/modules/dnn/src/onnx/onnx_importer2.cpp
@@ -0,0 +1,2450 @@
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html.
+
+#include "../precomp.hpp"
+#include "../net_impl.hpp"
+
+#include <opencv2/dnn/shape_utils.hpp>
+#include <opencv2/dnn/layer_reg.private.hpp>
+
+#include <opencv2/core/utils/fp_control_utils.hpp>
+#include <opencv2/core/utils/logger.defines.hpp>
+#undef CV_LOG_STRIP_LEVEL
+#define CV_LOG_STRIP_LEVEL CV_LOG_LEVEL_VERBOSE + 1
+#include <opencv2/core/utils/logger.hpp>
+
+#include <opencv2/core/utils/configuration.private.hpp>
+
+#ifdef HAVE_PROTOBUF
+
+#include <algorithm>
+#include <array>
+#include <iostream>
+#include <fstream>
+#include <limits>
+#include <set>
+#include <string>
+
+#if defined _MSC_VER && _MSC_VER < 1910/*MSVS 2017*/
+#pragma warning(push)
+#pragma warning(disable: 4503)  // decorated name length exceeded, name was truncated
+#endif
+
+#if defined(__GNUC__) && __GNUC__ >= 5
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wsuggest-override"
+#endif
+#include "opencv-onnx.pb.h"
+#if defined(__GNUC__) && __GNUC__ >= 5
+#pragma GCC diagnostic pop
+#endif
+
+#include "onnx_graph_simplifier.hpp"
+#endif
+
+namespace cv {
+namespace dnn {
+CV__DNN_INLINE_NS_BEGIN
+
+extern bool DNN_DIAGNOSTICS_RUN;
+
+#ifdef HAVE_PROTOBUF
+
+template <typename T>
+static T getScalarFromMat(Mat m)
+{
+    CV_Assert(m.total() == 1);
+    return m.at<T>(0);
+}
+
+static int dataType2cv(opencv_onnx::TensorProto_DataType dt)
+{
+    return
+        dt == opencv_onnx::TensorProto_DataType_UINT8 ? CV_8U :
+        dt == opencv_onnx::TensorProto_DataType_INT8 ? CV_8S :
+        dt == opencv_onnx::TensorProto_DataType_UINT16 ? CV_16U :
+        dt == opencv_onnx::TensorProto_DataType_INT16 ? CV_16S :
+        dt == opencv_onnx::TensorProto_DataType_UINT32 ? CV_32U :
+        dt == opencv_onnx::TensorProto_DataType_INT32 ? CV_32S :
+        dt == opencv_onnx::TensorProto_DataType_UINT64 ? CV_64U :
+        dt == opencv_onnx::TensorProto_DataType_INT64 ? CV_64S :
+        dt == opencv_onnx::TensorProto_DataType_FLOAT ? CV_32F :
+        dt == opencv_onnx::TensorProto_DataType_DOUBLE ? CV_64F :
+        dt == opencv_onnx::TensorProto_DataType_FLOAT16 ? CV_16F :
+        dt == opencv_onnx::TensorProto_DataType_COMPLEX64 ? CV_32FC2 :
+        dt == opencv_onnx::TensorProto_DataType_COMPLEX128 ? CV_64FC2 :
+        dt == opencv_onnx::TensorProto_DataType_BOOL ? CV_Bool : -1;
+}
+
+static std::string dataType2str(opencv_onnx::TensorProto_DataType dt)
+{
+    const char* str =
+    dt == opencv_onnx::TensorProto_DataType_UNDEFINED ? "UNDEFINED" :
+    dt == opencv_onnx::TensorProto_DataType_STRING ? "STRING" :
+    dt == opencv_onnx::TensorProto_DataType_UINT8 ? "UINT8" :
+    dt == opencv_onnx::TensorProto_DataType_INT8 ? "INT8" :
+    dt == opencv_onnx::TensorProto_DataType_UINT16 ? "UINT16" :
+    dt == opencv_onnx::TensorProto_DataType_INT16 ? "INT16" :
+    dt == opencv_onnx::TensorProto_DataType_UINT32 ? "UINT32" :
+    dt == opencv_onnx::TensorProto_DataType_INT32 ? "INT32" :
+    dt == opencv_onnx::TensorProto_DataType_UINT64 ? "UINT64" :
+    dt == opencv_onnx::TensorProto_DataType_INT64 ? "INT64" :
+    dt == opencv_onnx::TensorProto_DataType_FLOAT ? "FLOAT" :
+    dt == opencv_onnx::TensorProto_DataType_FLOAT16 ? "FLOAT16" :
+    dt == opencv_onnx::TensorProto_DataType_BOOL ? "BOOL" :
+    dt == opencv_onnx::TensorProto_DataType_COMPLEX64 ? "COMPLEX64" :
+    dt == opencv_onnx::TensorProto_DataType_COMPLEX128 ? "COMPLEX128" : nullptr;
+    if (!str)
+        return format("<unknown_type #%d>", (int)dt);
+    return std::string(str);
+}
+
+static Mat getMatFromTensor2(const opencv_onnx::TensorProto& tensor_proto)
+{
+    Mat m = getMatFromTensor(tensor_proto, false);
+    m.dims = (int)tensor_proto.dims_size();
+    return m;
+}
+
+class ONNXImporter2
+{
+public:
+    ONNXImporter2();
+
+    Net parseFile(const char *onnxFile);
+    Net parseBuffer(const void* buffer, size_t sizeBuffer);
+
+protected:
+    FPDenormalsIgnoreHintScope fp_denormals_ignore_scope;
+    opencv_onnx::ModelProto model_proto;
+
+    Net parseModel();
+    Ptr<Graph> parseGraph(opencv_onnx::GraphProto* graph_proto, bool mainGraph);
+    void parseNode(const opencv_onnx::NodeProto& node_proto);
+    bool parseValueInfo(const opencv_onnx::ValueInfoProto& valueInfoProto, ArgData& data);
+    Mat parseTensor(const opencv_onnx::TensorProto& tensorProto);
+    void rememberMissingOp(const std::string& opname);
+
+    LayerParams getLayerParams(const opencv_onnx::NodeProto& node_proto);
+
+    void addLayer(LayerParams& layerParams,
+                  const opencv_onnx::NodeProto& node_proto,
+                  int max_inputs = std::numeric_limits<int>::max());
+    void setParamsDtype(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+
+    void raiseError() {
+        have_errors = true;
+    }
+
+    Net net;
+    Net::Impl* netimpl;
+    std::string onnxFilename;
+    Ptr<Graph> curr_graph;
+    opencv_onnx::GraphProto* curr_graph_proto;
+    std::vector<Ptr<Layer> > curr_prog;
+    std::vector<Arg> node_inputs, node_outputs;
+
+    std::string framework_name;
+    std::set<std::string> missing_ops;
+
+    // Used when Onnx does not contain node names.
+    // In this case each node is assigned a name 'onnx_node!<current global_node_idx value>'
+    int global_node_idx;
+    bool have_errors;
+
+    typedef void (ONNXImporter2::*ONNXImporterNodeParser)(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    typedef std::map<std::string, ONNXImporterNodeParser> DispatchMap;
+    typedef std::map<std::string, DispatchMap> DomainDispatchMap;
+
+    DomainDispatchMap domain_dispatch_map;
+    std::string getLayerTypeDomain(const opencv_onnx::NodeProto& node_proto);
+    const DispatchMap& getDispatchMap(const opencv_onnx::NodeProto& node_proto);
+    void buildDispatchMap_ONNX_AI(int opset_version);
+    void buildDispatchMap_COM_MICROSOFT(int opset_version);
+
+    // Domain: 'ai.onnx' (default)
+    // URL: https://github.com/onnx/onnx/blob/master/docs/Operators.md
+    void parseAbs                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseArgMinMax            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseAveragePool          (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseBatchNormalization   (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseCast                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseClip                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseConcat               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseConstant             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseConstantOfShape      (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseConv                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseConvTranspose        (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseCumSum               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseDepthSpaceOps        (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseDetectionOutput      (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseEinsum               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseElementWise          (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseElu                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseExpand               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseFlatten              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseGather               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseGatherElements       (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseGemm                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseGlobalPool           (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseGRU                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseImageScaler          (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseInstanceNormalization(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseLayerNorm            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseLeakyRelu            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseLRN                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseLSTM                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseMatMul               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseMaxPool              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseMaxUnpool            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseNeg                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parsePad                  (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parsePRelu                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseRange                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseReduce               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseRelu                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseResize               (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseReshape              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseScatter              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseShape                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseSimpleLayers         (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseSlice                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseSoftMax              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseSplit                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseSqueeze              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseTanh                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseTile                 (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseTranspose            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseUnsqueeze            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseUpsample             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+
+    // Domain: com.microsoft
+    // URL: https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md
+    void parseAttention            (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseDequantizeLinear     (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseQuantizeLinear       (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    void parseCustomLayer          (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQAvgPool             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQConcat              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQConv                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQEltwise             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQGemm                (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQLeakyRelu           (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQMatMul              (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQSigmoid             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+    //void parseQSoftmax             (LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto);
+
+    int onnx_opset;  // OperatorSetIdProto for 'onnx' domain
+    std::map<std::string, int> onnx_opset_map;  // map from OperatorSetIdProto
+    void parseOperatorSet();
+
+    const std::string str_domain_ai_onnx = "ai.onnx";
+
+    bool useLegacyNames;
+    bool getParamUseLegacyNames()
+    {
+        //bool param = utils::getConfigurationParameterBool("OPENCV_DNN_ONNX_USE_LEGACY_NAMES", false);
+        //return param;
+        return true;
+    }
+    std::string extractNodeName(const opencv_onnx::NodeProto& node_proto);
+};
+
+ONNXImporter2::ONNXImporter2() :
+    onnx_opset(0),
+    useLegacyNames(getParamUseLegacyNames())
+{
+    netimpl = net.getImpl();
+}
+
+Net ONNXImporter2::parseFile(const char *onnxFilename_)
+{
+    CV_Assert(onnxFilename_);
+    onnxFilename = onnxFilename_;
+    CV_LOG_DEBUG(NULL, "DNN/ONNX: processing ONNX model from file: " << onnxFilename);
+
+    std::fstream input(onnxFilename, std::ios::in | std::ios::binary);
+    if (!input)
+    {
+        CV_Error(Error::StsBadArg, format("Can't read ONNX file: %s", onnxFilename_));
+    }
+
+    if (!model_proto.ParseFromIstream(&input))
+    {
+        CV_Error(Error::StsUnsupportedFormat, format("Failed to parse ONNX model: %s", onnxFilename_));
+    }
+
+    return parseModel();
+}
+
+Net ONNXImporter2::parseBuffer(const void* buffer, size_t sizeBuffer)
+{
+    onnxFilename = std::string();
+    CV_LOG_DEBUG(NULL, "DNN/ONNX: processing in-memory ONNX model (" << sizeBuffer << " bytes)");
+
+    struct _Buf: public std::streambuf
+    {
+        _Buf(const void* buffer, size_t sizeBuffer)
+        {
+            char* p = (char*)buffer;
+            setg(p, p, p + sizeBuffer);
+        }
+    };
+
+    _Buf buf(buffer, sizeBuffer);
+    std::istream input(&buf);
+
+    if (!model_proto.ParseFromIstream(&input))
+        CV_Error(Error::StsUnsupportedFormat, "Failed to parse onnx model from in-memory byte array.");
+
+    return parseModel();
+}
+
+
+inline void replaceLayerParam(LayerParams& layerParams, const String& oldKey, const String& newKey)
+{
+    if (layerParams.has(oldKey)) {
+        layerParams.set(newKey, layerParams.get(oldKey));
+        layerParams.erase(oldKey);
+    }
+}
+
+/*static void runLayer(LayerParams& params, const std::vector<Mat>& inputs,
+              std::vector<Mat>& outputs)
+{
+    Ptr<Layer> layer = LayerFactory::createLayerInstance(params.type, params);
+    CV_Assert((bool)layer);
+
+    std::vector<MatShape> inpShapes(inputs.size());
+    std::vector<MatType> inpTypes(inputs.size());
+    for (size_t i = 0; i < inputs.size(); ++i)
+    {
+        inpShapes[i] = shape(inputs[i]);
+        inpTypes[i] = inputs[i].type();
+    }
+
+    std::vector<MatShape> outShapes, internalShapes;
+    std::vector<MatType> outTypes, internalTypes;
+    layer->getMemoryShapes(inpShapes, 0, outShapes, internalShapes);
+    layer->getTypes(inpTypes, outShapes.size(), internalShapes.size(), outTypes, internalTypes);
+
+    std::vector<Mat> internals(internalShapes.size());
+    outputs.resize(outShapes.size());
+    for (size_t i = 0; i < outShapes.size(); ++i)
+        outputs[i].create(outShapes[i], outTypes[i]);
+    for (size_t i = 0; i < internalShapes.size(); ++i)
+        internals[i].create(internalShapes[i], internalTypes[i]);
+
+    layer->finalize(inputs, outputs);
+    layer->forward(inputs, outputs, internals);
+}*/
+
+/*std::map<std::string, Mat> ONNXImporter2::getGraphTensors(
+                                        const opencv_onnx::GraphProto& graph_proto)
+{
+    std::map<std::string, Mat> layers_weights;
+
+    for (int i = 0; i < graph_proto.initializer_size(); i++)
+    {
+        const opencv_onnx::TensorProto& tensor_proto = graph_proto.initializer(i);
+        dumpTensorProto(i, tensor_proto, "initializer");
+        Mat mat = getMatFromTensor2(tensor_proto);
+        releaseONNXTensor(const_cast<opencv_onnx::TensorProto&>(tensor_proto));  // drop already loaded data
+
+        if (DNN_DIAGNOSTICS_RUN && mat.empty())
+            continue;
+
+        layers_weights.insert(std::make_pair(tensor_proto.name(), mat));
+        constBlobsExtraInfo.insert(std::make_pair(tensor_proto.name(), TensorInfo(tensor_proto.dims_size())));
+    }
+    return layers_weights;
+}*/
+
+static DictValue parse(const ::google::protobuf::RepeatedField< ::google::protobuf::int64>& src) {
+    std::vector<int32_t> dst(src.size());
+    convertInt64ToInt32(src, dst, src.size());
+    return DictValue::arrayInt(&dst[0], src.size());
+}
+
+static DictValue parseStr(const ::google::protobuf::RepeatedPtrField< ::std::string>& src) {
+    return DictValue::arrayString(src.begin(), static_cast<int>(src.size()));
+}
+
+LayerParams ONNXImporter2::getLayerParams(const opencv_onnx::NodeProto& node_proto)
+{
+    LayerParams lp;
+    for(int i = 0; i < node_proto.attribute_size(); i++)
+    {
+        opencv_onnx::AttributeProto attribute_proto = node_proto.attribute(i);
+        std::string attribute_name = attribute_proto.name();
+
+        try
+        {
+            if(attribute_name == "kernel_shape")
+            {
+                CV_Assert(attribute_proto.ints_size() == 1 || attribute_proto.ints_size() == 2 || attribute_proto.ints_size() == 3);
+                lp.set("kernel_size", parse(attribute_proto.ints()));
+            }
+            else if(attribute_name == "strides")
+            {
+                CV_Assert(attribute_proto.ints_size() == 1 || attribute_proto.ints_size() == 2 || attribute_proto.ints_size() == 3);
+                lp.set("stride", parse(attribute_proto.ints()));
+            }
+            else if(attribute_name == "pads")
+            {
+                if (node_proto.op_type() == "Pad")
+                {
+                    // Padding layer.
+                    // Paddings are in order begin0, begin1, .. beginN, end0, end1, ..., endN.
+                    // We need to shuffle it to begin0, end0, begin1, end1, ...
+                    CV_Assert(attribute_proto.ints_size() % 2 == 0);
+                    const int dims = attribute_proto.ints_size() / 2;
+                    std::vector<int32_t> paddings;
+                    paddings.reserve(attribute_proto.ints_size());
+                    for (int i = 0; i < dims; ++i)
+                    {
+                        paddings.push_back(attribute_proto.ints(i));
+                        paddings.push_back(attribute_proto.ints(dims + i));
+                    }
+                    lp.set("paddings", DictValue::arrayInt(&paddings[0], paddings.size()));
+                }
+                else
+                {
+                    // Convolution or pooling.
+                    CV_Assert(attribute_proto.ints_size() == 2 || attribute_proto.ints_size() == 4 || attribute_proto.ints_size() == 6);
+                    lp.set("pad", parse(attribute_proto.ints()));
+                }
+            }
+            else if(attribute_name == "auto_pad")
+            {
+                if (attribute_proto.s() == "SAME_UPPER" || attribute_proto.s() == "SAME_LOWER") {
+                    lp.set("pad_mode",  "SAME");
+                }
+                else if (attribute_proto.s() == "VALID") {
+                    lp.set("pad_mode", "VALID");
+                }
+            }
+            else if(attribute_name == "dilations")
+            {
+                CV_Assert(attribute_proto.ints_size() == 1 || attribute_proto.ints_size() == 2 || attribute_proto.ints_size() == 3);
+                lp.set("dilation", parse(attribute_proto.ints()));
+            }
+            else if(attribute_name == "activations" && node_proto.op_type() == "LSTM")
+            {
+                lp.set(attribute_name, parseStr(attribute_proto.strings()));
+            }
+            else if (attribute_proto.has_i())
+            {
+                ::google::protobuf::int64 src = attribute_proto.i();
+                if (src < std::numeric_limits<int32_t>::min() || src > std::numeric_limits<int32_t>::max())
+                    CV_Error(Error::StsOutOfRange, "Input is out of OpenCV 32S range");
+                else
+                    lp.set(attribute_name, saturate_cast<int32_t>(src));
+            }
+            else if (attribute_proto.has_f())
+            {
+                lp.set(attribute_name, attribute_proto.f());
+            }
+            else if (attribute_proto.has_s())
+            {
+                lp.set(attribute_name, attribute_proto.s());
+            }
+            else if (attribute_proto.floats_size() > 0)
+            {
+                lp.set(attribute_name, DictValue::arrayReal(
+                    attribute_proto.floats().data(), attribute_proto.floats_size()));
+            }
+            else if (attribute_proto.ints_size() > 0)
+            {
+                lp.set(attribute_name, parse(attribute_proto.ints()));
+            }
+            else if (attribute_proto.has_t())
+            {
+                opencv_onnx::TensorProto tensor = attribute_proto.t();
+                Mat blob = getMatFromTensor2(tensor);
+                lp.blobs.push_back(blob);
+                lp.set("original_dims_of_mat", tensor.dims_size());
+            }
+            else if (attribute_proto.has_g())
+            {
+                CV_Error(Error::StsNotImplemented, format("DNN/ONNX/Attribute[%s]: 'Graph' is not supported", attribute_name.c_str()));
+            }
+            else if (attribute_proto.graphs_size() > 0)
+            {
+                CV_Error(Error::StsNotImplemented,
+                        format("DNN/ONNX/Attribute[%s]: 'Graphs' (%d) in attributes is not supported",
+                                attribute_name.c_str(), attribute_proto.graphs_size())
+                );
+            }
+            else if (attribute_proto.strings_size() > 0)
+            {
+                std::string msg = format("DNN/ONNX/Attribute[%s]: 'Strings' (%d) are not supported",
+                        attribute_name.c_str(), attribute_proto.strings_size());
+                CV_LOG_ERROR(NULL, msg);
+                for (int i = 0; i < attribute_proto.strings_size(); i++)
+                {
+                    CV_LOG_ERROR(NULL, "    Attribute[" << attribute_name << "].string(" << i << ") = '" << attribute_proto.strings(i) << "'");
+                }
+                CV_Error(Error::StsNotImplemented, msg);
+            }
+            else if (attribute_proto.tensors_size() > 0)
+            {
+                CV_Error(Error::StsNotImplemented,
+                        format("DNN/ONNX/Attribute[%s]: 'Tensors' (%d) in attributes are not supported",
+                                attribute_name.c_str(), attribute_proto.tensors_size())
+                );
+            }
+            else
+            {
+                CV_Error(Error::StsNotImplemented, format("DNN/ONNX/Attribute[%s]: unsupported attribute format", attribute_name.c_str()));
+            }
+        }
+        catch (const cv::Exception& e)
+        {
+            CV_UNUSED(e);
+            if (DNN_DIAGNOSTICS_RUN)
+            {
+                CV_LOG_ERROR(NULL, "DNN/ONNX: Potential problem with processing attributes for node " << node_proto.name() << " Attribute " << attribute_name.c_str()
+                );
+                continue;
+            }
+            throw;
+        }
+    }
+    return lp;
+}
+
+void ONNXImporter2::parseOperatorSet()
+{
+    int ir_version = model_proto.has_ir_version() ? static_cast<int>(model_proto.ir_version()) : -1;
+    if (ir_version < 3)
+        return;
+
+    int opset_size = model_proto.opset_import_size();
+    if (opset_size <= 0)
+    {
+        CV_LOG_INFO(NULL, "DNN/ONNX: missing opset information")
+        return;
+    }
+
+    for (int i = 0; i < opset_size; ++i)
+    {
+        const ::opencv_onnx::OperatorSetIdProto& opset_entry = model_proto.opset_import(i);
+        const std::string& domain = opset_entry.has_domain() ? opset_entry.domain() : std::string();
+        int version = opset_entry.has_version() ? opset_entry.version() : -1;
+        if (domain.empty() || domain == str_domain_ai_onnx)
+        {
+            // ONNX opset covered by specification: https://github.com/onnx/onnx/blob/master/docs/Operators.md
+            onnx_opset = std::max(onnx_opset, version);
+            onnx_opset_map[str_domain_ai_onnx] = onnx_opset;
+        }
+        else
+        {
+            CV_LOG_DEBUG(NULL, "DNN/ONNX: using non-standard ONNX opset[" << i << "]: domain='" << domain << "' version=" << version);
+            onnx_opset_map[domain] = onnx_opset;
+        }
+    }
+
+    CV_LOG_INFO(NULL, "DNN/ONNX: ONNX opset version = " << onnx_opset);
+
+    buildDispatchMap_ONNX_AI(onnx_opset);
+    for (const auto& pair : onnx_opset_map)
+    {
+        if (pair.first == str_domain_ai_onnx)
+        {
+            continue;  // done above
+        }
+        else if (pair.first == "com.microsoft")
+        {
+            buildDispatchMap_COM_MICROSOFT(pair.second);
+        }
+        else
+        {
+            CV_LOG_INFO(NULL, "DNN/ONNX: unknown domain='" << pair.first << "' version=" << pair.second << ". No dispatch map, you may need to register 'custom' layers.");
+        }
+    }
+}
+
+/*static bool ifInt8Output(const String& layerType)
+{
+    // Contains all node types whose output should be int8 when it get int8 input.
+    // ai.onnx opset 15
+    static std::vector<String> input8output8List = {
+            "QuantizeLinear",
+            "QLinearAdd",
+            "QLinearMul",
+            "QLinearAveragePool",
+            "QLinearGlobalAveragePool",
+            "QLinearLeakyRelu",
+            "QLinearSigmoid",
+            "QLinearConcat",
+            "QGemm",
+            "QLinearSoftmax",
+            "QLinearConv",
+            "QLinearMatMul",
+            "MaxPool",
+            "ReduceMax",
+            "ReduceMin",
+            "Split",
+            "Clip",
+            "Abs",
+            "Transpose",
+            "Squeeze",
+            "Flatten",
+            "Unsqueeze",
+            "Expand",
+            "Reshape",
+            "Pad",
+            "Gather",
+            "Concat",
+            "Resize",
+            "SpaceToDepth",
+            "DepthToSpace",
+            "Pow",
+            "Add",
+            "Sub",
+            "Mul",
+            "Div"
+    };
+    auto layerIt = std::find(input8output8List.begin(), input8output8List.end(), layerType);
+    return layerIt != input8output8List.end();
+}*/
+
+Net ONNXImporter2::parseModel()
+{
+    global_node_idx = 0;
+    have_errors = false;
+    CV_Assert(model_proto.has_graph());
+    opencv_onnx::GraphProto* graph_proto = model_proto.mutable_graph();
+
+    std::string framework_version;
+    if (model_proto.has_producer_name())
+        framework_name = model_proto.producer_name();
+    if (model_proto.has_producer_version())
+        framework_version = model_proto.producer_version();
+
+    CV_LOG_INFO(NULL, "DNN/ONNX: loading ONNX"
+            << (model_proto.has_ir_version() ? format(" v%d", (int)model_proto.ir_version()) : cv::String())
+            << " model produced by '" << framework_name << "'"
+            << (framework_version.empty() ? cv::String() : format(":%s", framework_version.c_str()))
+            << ". Number of nodes = " << graph_proto->node_size()
+            << ", initializers = " << graph_proto->initializer_size()
+            << ", inputs = " << graph_proto->input_size()
+            << ", outputs = " << graph_proto->output_size()
+            );
+
+    parseOperatorSet();
+    Ptr<Graph> mainGraph = parseGraph(graph_proto, true);
+    netimpl->mainGraph = mainGraph;
+    netimpl->modelFormat = DNN_MODEL_ONNX;
+    netimpl->originalLayout = DATA_LAYOUT_NCHW;
+    netimpl->onnx_opset = onnx_opset;
+
+    if (have_errors) {
+        std::stringstream sstrm;
+        sstrm << "DNN/ONNX: the model ";
+        if (!onnxFilename.empty())
+            sstrm << "'"  << onnxFilename << "' ";
+        sstrm << "cannot be loaded with the new parser. Trying the older parser. ";
+        if (!missing_ops.empty()) {
+            sstrm << " Unsupported operations:\n";
+            auto it = missing_ops.begin();
+            size_t i, nmissing = missing_ops.size();
+            for (i = 0; i < nmissing; i++, ++it) {
+                sstrm << "\t" << *it << (i+1 < nmissing ? ",\n" : "\n");
+            }
+        }
+        CV_LOG_WARNING(NULL, sstrm.str());
+        return Net();
+    }
+    netimpl->prepareForInference();
+    // ************ uncomment for debugging **********
+    //net.dumpToStream(std::cout);
+    return net;
+}
+
+bool ONNXImporter2::parseValueInfo(const opencv_onnx::ValueInfoProto& valueInfoProto, ArgData& data)
+{
+    CV_Assert(valueInfoProto.has_name());
+    CV_Assert(valueInfoProto.has_type());
+    const opencv_onnx::TypeProto& typeProto = valueInfoProto.type();
+    CV_Assert(typeProto.has_tensor_type());
+    const opencv_onnx::TypeProto::Tensor& tensor = typeProto.tensor_type();
+    CV_Assert(tensor.has_shape());
+    const opencv_onnx::TensorShapeProto& tensorShape = tensor.shape();
+    auto elem_type = tensor.elem_type();
+
+    data.type = dataType2cv(elem_type);
+    if (data.type < 0) {
+        CV_Error(Error::StsNotImplemented, format("unsupported datatype '%s'", dataType2str(elem_type).c_str()));
+    }
+
+    int dim_size = tensorShape.dim_size();
+    CV_CheckGE(dim_size, 0, "");
+    MatShape shape(dim_size);
+    for (int j = 0; j < dim_size; ++j)
+    {
+        const opencv_onnx::TensorShapeProto_Dimension& dimension = tensorShape.dim(j);
+        int64_t val_j;
+        if (dimension.has_dim_value()) {
+            val_j = dimension.dim_value();
+        } else if (dimension.has_dim_param()) {
+            const std::string& param_j = dimension.dim_param();
+            val_j = net.findDim(param_j, true);
+        } else {
+            raiseError();
+            return false;
+        }
+        //CV_Assert(0 <= val_j && val_j <= INT_MAX);
+        shape[j] = (int)val_j;
+    }
+    data.shape = shape;
+    return true;
+}
+
+Mat ONNXImporter2::parseTensor(const opencv_onnx::TensorProto& tensor_proto)
+{
+    return getMatFromTensor2(tensor_proto);
+}
+
+Ptr<Graph> ONNXImporter2::parseGraph(opencv_onnx::GraphProto* graph_proto, bool mainGraph_)
+{
+    CV_LOG_DEBUG(NULL, "DNN/ONNX: parsing graph '" << graph_proto->name() << "' of " << graph_proto->node_size() << " nodes");
+    simplifySubgraphs(*graph_proto);
+    int n_nodes = graph_proto->node_size();
+    CV_LOG_DEBUG(NULL, "DNN/ONNX: simplified the graph to " << n_nodes << " nodes");
+
+    opencv_onnx::GraphProto* saved_graph_proto = curr_graph_proto;
+    Ptr<Graph> saved_graph = curr_graph;
+    std::vector<Ptr<Layer> > saved_prog;
+
+    curr_graph_proto = graph_proto;
+    std::vector<Arg> inputs, outputs;
+
+    // parse constant tensors
+    int n_consts = graph_proto->initializer_size();
+    for (int i = 0; i < n_consts; i++) {
+        //const opencv_onnx::
+        const opencv_onnx::TensorProto& const_i = graph_proto->initializer(i);
+        Mat t = parseTensor(const_i);
+        netimpl->newConstArg(const_i.name(), t);
+    }
+
+    // parse graph inputs
+    int n_inputs = graph_proto->input_size();
+    for (int i = 0; i < n_inputs; i++) {
+        const opencv_onnx::ValueInfoProto& input_i = graph_proto->input(i);
+        if (net.haveArg(input_i.name()))
+            continue;
+        Arg arg = netimpl->newArg(input_i.name(), mainGraph_ ? DNN_ARG_INPUT : DNN_ARG_TEMP);
+        if (!parseValueInfo(input_i, netimpl->args.at(arg.idx))) {
+            raiseError();
+            return Ptr<Graph>();
+        }
+        inputs.push_back(arg);
+    }
+
+    // parse graph outputs
+    int n_outputs = graph_proto->output_size();
+    for (int i = 0; i < n_outputs; i++) {
+        const opencv_onnx::ValueInfoProto& output_i = graph_proto->output(i);
+        Arg arg = netimpl->newArg(output_i.name(), mainGraph_ ? DNN_ARG_OUTPUT : DNN_ARG_TEMP);
+        if (!parseValueInfo(output_i, netimpl->args.at(arg.idx))) {
+           raiseError();
+           return Ptr<Graph>();
+        }
+        outputs.push_back(arg);
+    }
+
+    curr_graph = netimpl->newGraph(graph_proto->name(), inputs, mainGraph_);
+    curr_graph->setOutputs(outputs);
+
+    std::swap(saved_prog, curr_prog);
+
+    std::vector<Ptr<Layer> > prog;
+    for (int i = 0; i < n_nodes && !have_errors; i++) {
+        parseNode(graph_proto->node(i));
+    }
+
+    curr_graph->setProg(curr_prog);
+    curr_prog = saved_prog;
+
+    Ptr<Graph> just_constructed = curr_graph;
+    curr_graph_proto = saved_graph_proto;
+    curr_graph = saved_graph;
+
+    return just_constructed;
+}
+
+std::string ONNXImporter2::getLayerTypeDomain(const opencv_onnx::NodeProto& node_proto)
+{
+    if (!node_proto.has_domain())
+        return str_domain_ai_onnx;
+    const std::string& domain = node_proto.domain();
+    if (domain.empty())
+        return str_domain_ai_onnx;
+    return domain;
+}
+
+const ONNXImporter2::DispatchMap& ONNXImporter2::getDispatchMap(const opencv_onnx::NodeProto& node_proto)
+{
+    static DispatchMap empty_map;
+    const std::string& layer_type_domain = getLayerTypeDomain(node_proto);
+    auto it = domain_dispatch_map.find(layer_type_domain);
+    if (it == domain_dispatch_map.end())
+    {
+        return empty_map;
+    }
+
+    return it->second;
+}
+
+std::string ONNXImporter2::extractNodeName(const opencv_onnx::NodeProto& node_proto)
+{
+    // We need to rework DNN outputs API, this is a workaround for #21698
+    if (node_proto.has_name() && !node_proto.name().empty())
+    {
+        if (useLegacyNames)
+            return node_proto.name();
+        return format("onnx_node!%s", node_proto.name().c_str());
+    }
+    return format("onnx_node!%d", global_node_idx++);
+}
+
+void ONNXImporter2::rememberMissingOp(const std::string& opname)
+{
+    missing_ops.insert(opname);
+    raiseError();
+}
+
+void ONNXImporter2::parseNode(const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.output_size() >= 1);
+    std::string node_name = extractNodeName(node_proto);
+    const std::string& layer_type = node_proto.op_type();
+    std::string layer_type_domain = getLayerTypeDomain(node_proto);
+    const auto& dispatch = getDispatchMap(node_proto);
+
+    /*CV_LOG_INFO(NULL, "DNN/ONNX: processing node '" << node_name << "' ("
+                << layer_type << ") with " << node_proto.input_size() << " inputs and "
+                << node_proto.output_size() << " outputs from domain '"
+                << layer_type_domain << "'");*/
+
+    if (dispatch.empty())
+    {
+        CV_LOG_ERROR(NULL, "DNN/ONNX: missing dispatch map for domain='" << layer_type_domain << "'");
+        rememberMissingOp(layer_type);
+        return;
+    }
+
+    node_inputs.clear();
+    node_outputs.clear();
+
+    int n_inputs = node_proto.input_size();
+    for (int i = 0; i < n_inputs; i++) {
+        const std::string& arg_name = node_proto.input(i);
+        if (!net.haveArg(arg_name)) {
+            CV_LOG_ERROR(NULL, "DNN/ONNX: unknown input '" << arg_name << "' of node '" << node_name << "'");
+            raiseError();
+        }
+        Arg arg = net.getArg(arg_name);
+        /*ArgData adata = net.argData(arg);
+        printf("%s (%s), arg '%s'/'%s': adata.kind = %s, type=%s\n", node_name.c_str(), layer_type.c_str(),
+               arg_name.c_str(), adata.name.c_str(),
+               argKindToString(adata.kind).c_str(), typeToString(adata.type).c_str());*/
+        node_inputs.push_back(arg);
+    }
+
+    int n_outputs = node_proto.output_size();
+    for (int i = 0; i < n_outputs; i++) {
+        const std::string& arg_name = node_proto.output(i);
+        Arg arg = net.getArg(arg_name);
+        node_outputs.push_back(arg);
+    }
+
+    LayerParams layerParams;
+    try
+    {
+        layerParams = getLayerParams(node_proto);
+
+        layerParams.name = node_name;
+        layerParams.type = layer_type;
+
+        DispatchMap::const_iterator iter = dispatch.find(layer_type);
+        if (iter != dispatch.end())
+        {
+            if (!have_errors)
+                CALL_MEMBER_FN(*this, iter->second)(layerParams, node_proto);
+        } else if (!have_errors) {
+            //try customly parsing the layer without explicit dispatch map
+            parseCustomLayer(layerParams, node_proto);
+        } else {
+            rememberMissingOp(layer_type);
+        }
+    }
+    catch (const cv::Exception& e)
+    {
+        raiseError();
+        CV_LOG_INFO(NULL, "DNN/ONNX: error '" << e.what() << "' occurred when processing node '" << node_name
+                    << "' (" << layer_type << ") with "
+                    << node_proto.input_size() << " inputs and "
+                    << node_proto.output_size() << " outputs from domain '"
+                    << layer_type_domain << "'");
+        for (int i = 0; i < n_inputs; i++)
+        {
+            CV_LOG_INFO(NULL, "    Input[" << i << "] = '" << node_proto.input(i) << "'");
+        }
+        for (int i = 0; i < n_outputs; i++)
+        {
+            CV_LOG_INFO(NULL, "    Output[" << i << "] = '" << node_proto.output(i) << "'");
+        }
+    }
+}
+
+void ONNXImporter2::addLayer(LayerParams& layerParams,
+                             const opencv_onnx::NodeProto& node_proto,
+                             int max_inputs)
+{
+    Ptr<Layer> layer = LayerFactory::createLayerInstance(layerParams.type, layerParams);
+    if (!layer) {
+        rememberMissingOp(layerParams.type);
+        return;
+    }
+    size_t actual_inputs = std::min((size_t)max_inputs, node_inputs.size());
+    layer->inputs = node_inputs;
+    layer->inputs.resize(actual_inputs);
+    layer->outputs = node_outputs;
+    layer->netimpl = netimpl;
+    CV_Assert(netimpl->dump_indent == 3);
+    curr_prog.push_back(layer);
+}
+
+void ONNXImporter2::parseNeg(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Power";
+    layerParams.set("scale", -1);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseCustomLayer(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    parseSimpleLayers(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseArgMinMax(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    const std::string& layer_type = node_proto.op_type();
+    layerParams.type = "Arg";
+    layerParams.set("op", layer_type == "ArgMax" ? "max" : "min");
+    addLayer(layerParams, node_proto);
+}
+
+static void setCeilMode(LayerParams& layerParams)
+{
+    // auto_pad attribute is deprecated and uses ceil
+    if (layerParams.has("pad_mode"))
+    {
+        layerParams.set("ceil_mode", true);
+    }
+    else if (!layerParams.has("ceil_mode"))
+    {
+        layerParams.set("ceil_mode", false);
+    }
+}
+
+void ONNXImporter2::parseMaxUnpool(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "MaxUnpool";
+
+    DictValue kernel_shape = layerParams.get("kernel_size");
+    CV_Assert(kernel_shape.size() == 2);
+    layerParams.set("pool_k_w", kernel_shape.get<int>(0));
+    layerParams.set("pool_k_h", kernel_shape.get<int>(1));
+
+    int pool_pad_w = 0, pool_pad_h = 0;
+    if (layerParams.has("pad"))
+    {
+        DictValue pads = layerParams.get("pad");
+        CV_CheckEQ(pads.size(), 2, "");
+        pool_pad_w = pads.get<int>(0);
+        pool_pad_h = pads.get<int>(1);
+    }
+    layerParams.set("pool_pad_w", pool_pad_w);
+    layerParams.set("pool_pad_h", pool_pad_h);
+
+
+    int pool_stride_w = 1, pool_stride_h = 1;
+    if (layerParams.has("stride"))
+    {
+        DictValue strides = layerParams.get("stride");
+        CV_CheckEQ(strides.size(), 2, "");
+        pool_stride_w = strides.get<int>(0);
+        pool_stride_h = strides.get<int>(1);
+    }
+    layerParams.set("pool_stride_w", pool_stride_w);
+    layerParams.set("pool_stride_h", pool_stride_h);
+
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseMaxPool(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int depth = layerParams.get<int>("depth", CV_32F);
+    layerParams.type = (depth == CV_8S) ? "PoolingInt8" : "Pooling";
+    layerParams.set("pool", "MAX");
+    setCeilMode(layerParams);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseAveragePool(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Pooling";
+    layerParams.set("pool", "AVE");
+    setCeilMode(layerParams);
+    layerParams.set("ave_pool_padded_area", framework_name == "pytorch");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseGlobalPool(LayerParams &layerParams, const opencv_onnx::NodeProto &node_proto_)
+{
+    opencv_onnx::NodeProto node_proto = node_proto_;
+    const std::string& layer_type = node_proto.op_type();
+    const std::string output_name = node_proto.output(0);
+
+    CV_Assert(node_proto.input_size() == 1);
+    layerParams.type = "Pooling";
+    String pool;
+    if (layer_type == "GlobalMaxPool")
+        pool = "MAX";
+    else if (layer_type == "GlobalAveragePool")
+        pool = "AVE";
+    else
+        CV_Error(Error::StsNotImplemented, "Unsupported Pooling type of " + layer_type + " operation.");
+
+    CV_Assert(!layerParams.has("axes"));
+    layerParams.set("global_pooling", true);
+    layerParams.set("pool", pool);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseReduce(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Reduce";
+    const auto& op_type = node_proto.op_type();
+    String reduce_type;
+    if (op_type == "ReduceMax")
+        reduce_type = "MAX";
+    else if (op_type == "ReduceMean")
+        reduce_type = "MEAN";
+    else if (op_type == "ReduceMin")
+        reduce_type = "MIN";
+    else if (op_type == "ReduceProd")
+        reduce_type = "PROD";
+    else if (op_type == "ReduceSum")
+        reduce_type = "SUM";
+    else if (op_type == "ReduceL1")
+        reduce_type = "L1";
+    else if (op_type == "ReduceL2")
+        reduce_type = "L2";
+    else if (op_type == "ReduceLogSum")
+        reduce_type = "LOG_SUM";
+    else if (op_type == "ReduceLogSumExp")
+        reduce_type = "LOG_SUM_EXP";
+    else if (op_type == "ReduceSumSquare")
+        reduce_type = "SUM_SQUARE";
+    else
+        CV_Error(Error::StsNotImplemented, "DNN/ONNX: " + op_type + " is not supported.");
+    layerParams.set("reduce", reduce_type);
+
+    int num_inputs = node_proto.input_size();
+    CV_Check(num_inputs, num_inputs >= 1 && num_inputs <= 2, "DNN/ONNX: Reduce layers should have at least one input and at most two inputs");
+
+    bool const_axis_input = false;
+    if (num_inputs >= 2) {
+        CV_CheckTrue(net.isConstArg(node_inputs[1]), "Reduce layer doesn't support non contant axes");
+        const_axis_input = true;
+    }
+
+    // "axes" is turned to one of the inputs since opset 18,
+    // except for ReduceSum, which has "axes" input since opset 13.
+    if (const_axis_input) {
+        Mat mat_axes = net.argTensor(node_inputs[1]);
+        int num_axes = (int)mat_axes.total();
+        std::vector<int> axes(num_axes);
+        for (int i = 0; i < num_axes; ++i)
+            axes[i] = mat_axes.at<int64_t>(i);
+        layerParams.set("axes", DictValue::arrayInt(&axes[0], num_axes));
+    }
+
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseSlice(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int ninputs = node_proto.input_size();
+    CV_Assert(ninputs == 1 || (3 <= ninputs && ninputs <= 5));
+    layerParams.type = "Slice2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseSplit(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckGE(node_proto.input_size(), 1, "");
+    CV_CheckLE(node_proto.input_size(), 2, "");
+    layerParams.type = "Split2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseConstant(LayerParams& layerParams, const opencv_onnx::NodeProto&)
+{
+    CV_Assert(node_inputs.empty());
+    CV_Assert(node_outputs.size() == 1);
+    CV_Assert(layerParams.blobs.size() == 1);
+    Mat m = layerParams.blobs[0];
+    Arg out = node_outputs[0];
+    ArgData& data = netimpl->args.at(out.idx);
+    data.kind = DNN_ARG_CONST;
+    data.type = m.type();
+    data.shape = m.shape();
+    netimpl->__tensors__.at(out.idx) = m;
+}
+
+// BUG: https://github.com/opencv/opencv/issues/26308
+void ONNXImporter2::parseLSTM(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    rememberMissingOp(node_proto_.op_type());
+}
+
+ // BUG: https://github.com/opencv/opencv/issues/26309
+void ONNXImporter2::parseGRU(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    rememberMissingOp(node_proto_.op_type());
+}
+
+void ONNXImporter2::parseImageScaler(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    const float scale = layerParams.has("scale") ? layerParams.get<float>("scale") : 1.0f;
+    layerParams.erase("scale");
+
+    if (layerParams.has("bias"))
+    {
+        layerParams.type = "Scale";
+        layerParams.blobs.push_back(
+                Mat(Size(1,  layerParams.get("bias").size()), CV_32FC1, scale));
+
+        layerParams.set("bias_term", true);
+        Mat bias(1, layerParams.get("bias").size(), CV_32FC1);
+        for (int j = 0; j < bias.total(); j++) {
+            bias.at<float>(0, j) = layerParams.get("bias").getRealValue(j);
+        }
+        layerParams.blobs.push_back(bias);
+        layerParams.erase("bias");
+    }
+    else {
+        layerParams.set("scale", scale);
+        layerParams.type = "Power";
+    }
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseClip(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "ReLU6";
+    float min_value = -FLT_MAX, max_value = FLT_MAX;
+    int input_size = node_proto.input_size();
+    CV_Check(input_size, 1 <= input_size && input_size <= 3, "");
+
+    if (input_size >= 2 && !node_proto.input(1).empty())
+    {
+        Mat m;
+        CV_Assert(net.isConstArg(node_inputs[1]));
+        net.argTensor(node_inputs[1]).convertTo(m, CV_32F);
+        CV_Assert(m.total() == 1);
+        min_value = m.at<float>(0);
+    }
+
+    if (input_size == 3 && !node_proto.input(2).empty())
+    {
+        Mat m;
+        CV_Assert(net.isConstArg(node_inputs[2]));
+        net.argTensor(node_inputs[2]).convertTo(m, CV_32F);
+        CV_Assert(m.total() == 1);
+        max_value = m.at<float>(0);
+    }
+
+    layerParams.set("min_value", layerParams.get<float>("min", min_value));
+    layerParams.set("max_value", layerParams.get<float>("max", max_value));
+    addLayer(layerParams, node_proto, 1);
+}
+
+void ONNXImporter2::parseLeakyRelu(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "ReLU";
+    layerParams.set("negative_slope", layerParams.get<float>("alpha", 0.01));
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseRelu(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "ReLU";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseElu(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "ELU";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseTanh(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "TanH";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseAbs(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "AbsVal";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parsePRelu(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "PReLU";
+    CV_Assert(node_inputs.size() == 2);
+    CV_Assert(net.isConstArg(node_inputs[1]));
+    layerParams.blobs.push_back(net.argTensor(node_inputs[1]));
+    addLayer(layerParams, node_proto, 1);
+}
+
+void ONNXImporter2::parseLRN(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    replaceLayerParam(layerParams, "size", "local_size");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseInstanceNormalization(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto) {
+    int num_inputs = node_proto.input_size();
+    CV_CheckEQ(num_inputs, 3, "DNN/ONNXImporter2 - InstanceNorm: three inputs are required");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseBatchNormalization(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    if (node_proto.input_size() != 5)
+        CV_Error(Error::StsNotImplemented,
+                 "Expected input, scale, bias, mean and var");
+
+    layerParams.type = "BatchNorm";
+    replaceLayerParam(layerParams, "epsilon", "eps");
+    replaceLayerParam(layerParams, "spatial", "use_global_stats");
+
+    CV_Assert(net.isConstArg(node_inputs[3]));
+    CV_Assert(net.isConstArg(node_inputs[4]));
+
+    Mat meanData = net.argTensor(node_inputs[3]);
+    Mat stdData =  net.argTensor(node_inputs[4]);
+
+    layerParams.blobs.push_back(meanData);
+    layerParams.blobs.push_back(stdData);
+
+    if (!node_proto.input(1).empty()) {
+        layerParams.set("has_weight", true);
+        CV_Assert(net.isConstArg(node_inputs[1]));
+        layerParams.blobs.push_back(net.argTensor(node_inputs[1]));  // weightData
+    } else {
+        layerParams.set("has_weight", false);
+    }
+
+    if (!node_proto.input(2).empty()) {
+        layerParams.set("has_bias", true);
+        CV_Assert(net.isConstArg(node_inputs[1]));
+        layerParams.blobs.push_back(net.argTensor(node_inputs[2]));  // biasData
+    } else {
+        layerParams.set("has_bias", false);
+    }
+    addLayer(layerParams, node_proto, 1);
+}
+
+void ONNXImporter2::parseGemm(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Gemm";
+    int n_inputs = node_proto.input_size();
+    CV_Assert(2 <= n_inputs && n_inputs <= 3);
+
+    if (net.isConstArg(node_inputs[1]) && (n_inputs == 2 || net.isConstArg(node_inputs[2]))) {
+        Mat B = net.argTensor(node_inputs[1]);
+        layerParams.blobs.push_back(B);
+        if (n_inputs > 2) {
+            Mat bias = net.argTensor(node_inputs[2]);
+            layerParams.blobs.push_back(bias);
+        }
+        n_inputs = 1;
+    }
+
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseMatMul(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto) {
+    int n_inputs = node_proto.input_size();
+    CV_Assert(2 <= n_inputs && n_inputs <= 3);
+
+    if (net.isConstArg(node_inputs[1]) && (n_inputs == 2 || net.isConstArg(node_inputs[2]))) {
+        Mat B = net.argTensor(node_inputs[1]);
+        layerParams.blobs.push_back(B);
+        if (n_inputs > 2) {
+            Mat bias = net.argTensor(node_inputs[2]);
+            layerParams.blobs.push_back(bias);
+        }
+        n_inputs = 1;
+    }
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseConv(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int n_inputs = node_proto.input_size();
+    CV_Assert(2 <= n_inputs && n_inputs <= 3);
+    layerParams.type = "Convolution";
+
+    if (net.isConstArg(node_inputs[1]) && (n_inputs == 2 || net.isConstArg(node_inputs[2]))) {
+        Mat weights = net.argTensor(node_inputs[1]);
+        layerParams.blobs.push_back(weights);
+        if (n_inputs > 2) {
+            Mat bias = net.argTensor(node_inputs[2]);
+            layerParams.blobs.push_back(bias);
+        }
+        n_inputs = 1;
+    }
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseConvTranspose(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int n_inputs = node_proto.input_size();
+    CV_Assert(2 <= n_inputs && n_inputs <= 3);
+    layerParams.type = "Deconvolution";
+
+    layerParams.set("bias_term", node_proto.input_size() == 3);
+
+    if (net.isConstArg(node_inputs[1]) && (n_inputs == 2 || net.isConstArg(node_inputs[2]))) {
+        Mat weights = net.argTensor(node_inputs[1]);
+        layerParams.blobs.push_back(weights);
+        if (n_inputs > 2) {
+            Mat bias = net.argTensor(node_inputs[2]);
+            layerParams.blobs.push_back(bias);
+        }
+        n_inputs = 1;
+    }
+
+    if (!layerParams.has("kernel_size"))
+        CV_Error(Error::StsNotImplemented,
+                 "Required attribute 'kernel_size' is not present.");
+
+    if (layerParams.has("output_shape"))
+    {
+        const DictValue& outShape = layerParams.get("output_shape");
+        DictValue strides = layerParams.get("stride");
+        DictValue kernel = layerParams.get("kernel_size");
+
+        String padMode;
+        std::vector<int> adjust_pads;
+        if (layerParams.has("pad_mode"))
+        {
+            padMode = toUpperCase(layerParams.get<String>("pad_mode"));
+            if (padMode != "SAME" && padMode != "VALID")
+                CV_Error(Error::StsError, "Unsupported padding mode " + padMode);
+
+            for (int i = 0; i < strides.size(); i++)
+            {
+                int sz = outShape.get<int>(2 + i);
+                int stride = strides.get<int>(i);
+                adjust_pads.push_back(padMode == "SAME"? (sz - 1) % stride :
+                                                         (sz - kernel.get<int>(i)) % stride);
+            }
+            layerParams.set("adj", DictValue::arrayInt(&adjust_pads[0], (int)adjust_pads.size()));
+        }
+    }
+    else if (layerParams.has("output_padding"))
+    {
+        replaceLayerParam(layerParams, "output_padding", "adj");
+    }
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseTranspose(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 1);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseSqueeze(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Squeeze";
+    CV_Assert(node_proto.input_size() <= 2);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseFlatten(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 1);
+    layerParams.set("onnx", true);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseUnsqueeze(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert((node_proto.input_size() == 1 && layerParams.has("axes")) ||
+              node_proto.input_size() == 2);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseExpand(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Expand2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseReshape(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    bool have_shape_attr = layerParams.has("shape");
+    CV_Assert((node_proto.input_size() == 2 && !have_shape_attr) ||
+              (node_proto.input_size() == 1 && have_shape_attr));
+    layerParams.type = "Reshape2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parsePad(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Pad2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseShape(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 1);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseCast(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    opencv_onnx::TensorProto_DataType onnx_type = (opencv_onnx::TensorProto_DataType)layerParams.get<int>("to");
+    int type = dataType2cv(onnx_type);
+
+    layerParams.type = "Cast";
+    layerParams.set("outputType", type);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseConstantOfShape(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "ConstantOfShape";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseGather(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    layerParams.type = "Gather2";
+    CV_CheckEQ(node_proto.input_size(), 2, "");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseGatherElements(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckEQ(node_proto.input_size(), 2, "GatherElements: two inputs are required");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseConcat(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckEQ(node_proto.output_size(), 1, "");
+    layerParams.type = "Concat2";
+    addLayer(layerParams, node_proto);
+}
+
+// https://github.com/onnx/onnx/blob/master/docs/Operators.md#Resize
+void ONNXImporter2::parseResize(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int ninputs = node_proto.input_size();
+    layerParams.type = "Resize";
+    String interp_mode = layerParams.get<String>("coordinate_transformation_mode", "half_pixel");
+
+    CV_Assert(interp_mode != "tf_crop_and_resize");
+    bool halfPixel = interp_mode == "tf_half_pixel_for_nn" || interp_mode == "half_pixel" || interp_mode == "pytorch_half_pixel";
+
+    layerParams.set("align_corners", interp_mode == "align_corners");
+    layerParams.set("half_pixel_centers", halfPixel);
+    if (layerParams.get<String>("mode") == "linear")
+    {
+        layerParams.set("mode", halfPixel ? "opencv_linear" : "bilinear");
+    }
+
+    if (layerParams.get<String>("mode") == "linear" && framework_name == "pytorch")
+        layerParams.set("mode", "opencv_linear");
+
+    // opset-10: input = [X, scales]
+    // opset-11: input = [X, roi, scales] or [x, roi, scales, sizes]
+    // opset-13: may have empty input, [X, "", "", sizes] or [x, "", scales]
+    int scalesInputId = ninputs == 2 ? 1 : 2;
+    Arg scalesArg = node_inputs[scalesInputId];
+    Mat scales;
+    if(scalesArg.idx > 0 && netimpl->isConstArg(scalesArg))
+        scales = netimpl->argTensor(scalesArg);
+
+    if (ninputs >= 3 && interp_mode == "tf_crop_and_resize") {
+        int roiInputId = 1;
+        Arg roiArg = node_inputs[roiInputId];
+        if (!netimpl->isConstArg(roiArg)) {
+            CV_Error(Error::StsNotImplemented, "ONNX/Resize: only empty ROI is supported");
+        }
+        Mat roi = netimpl->argTensor(roiArg);
+        if (!roi.empty()) {
+            CV_Error(Error::StsNotImplemented, "ONNX/Resize: only empty ROI is supported");
+        }
+    }
+
+    if (scales.total() == 4)
+    {
+        CV_CheckEQ(scales.total(), (size_t)4, "HCHW layout is expected");
+        CV_CheckEQ(scales.type(), CV_32F, "Scales should have 32F type");
+        layerParams.set("zoom_factor_y", scales.at<float>(2));
+        layerParams.set("zoom_factor_x", scales.at<float>(3));
+        ninputs = 1;
+    }
+    else if (ninputs >= 4)  // opset-11 [x, roi, scales, sizes] or opset-13: input = [X, "", "", sizes]
+    {
+        Arg sizesArg = node_inputs[3];
+        if (netimpl->isConstArg(sizesArg))
+        {
+            Mat shapes_ = netimpl->argTensor(sizesArg), shapes;
+            CV_CheckEQ(shapes_.total(), (size_t)4, "HCHW layout is expected");
+            shapes_.convertTo(shapes, CV_32S);
+            layerParams.set("width", shapes.at<int>(3));
+            layerParams.set("height", shapes.at<int>(2));
+            ninputs = 1;
+        }
+    }
+    replaceLayerParam(layerParams, "mode", "interpolation");
+    addLayer(layerParams, node_proto, ninputs);
+}
+
+void ONNXImporter2::parseUpsample(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int n_inputs = node_proto.input_size();
+    //fused from Resize Subgraph
+    if (layerParams.has("coordinate_transformation_mode"))
+    {
+        String interp_mode = layerParams.get<String>("coordinate_transformation_mode");
+        CV_Assert(interp_mode != "tf_crop_and_resize");
+
+        bool halfPixel = interp_mode == "tf_half_pixel_for_nn" || interp_mode == "half_pixel" || interp_mode == "pytorch_half_pixel";
+
+        layerParams.set("align_corners", interp_mode == "align_corners");
+        layerParams.set("half_pixel_centers", halfPixel);
+        if (layerParams.get<String>("mode") == "linear")
+        {
+            layerParams.set("mode", halfPixel ? "opencv_linear" : "bilinear");
+        }
+    }
+    if (layerParams.get<String>("mode") == "linear" && framework_name == "pytorch")
+        layerParams.set("mode", "opencv_linear");
+
+    layerParams.type = "Resize";
+    if (layerParams.has("scales"))
+    {
+        // Pytorch layer
+        DictValue scales = layerParams.get("scales");
+        CV_Assert(scales.size() == 4);
+        layerParams.set("zoom_factor_y", scales.getIntValue(2));
+        layerParams.set("zoom_factor_x", scales.getIntValue(3));
+    }
+    else if (layerParams.has("height_scale") && layerParams.has("width_scale"))
+    {
+        // Caffe2 layer
+        replaceLayerParam(layerParams, "height_scale", "zoom_factor_y");
+        replaceLayerParam(layerParams, "width_scale", "zoom_factor_x");
+    }
+    else
+    {
+        CV_Assert(n_inputs >= 2);
+        // scales as input
+        if (net.isConstArg(node_inputs[1])) {
+            Mat scales;
+            net.argTensor(node_inputs[1]).convertTo(scales, CV_32F);
+            CV_Assert(scales.total() == 4);
+            layerParams.set("zoom_factor_y", scales.at<float>(2));
+            layerParams.set("zoom_factor_x", scales.at<float>(3));
+            n_inputs = 1;
+        }
+    }
+    replaceLayerParam(layerParams, "mode", "interpolation");
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseSoftMax(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    const std::string& layer_type = node_proto.op_type();
+    int axis;
+    if (onnx_opset != 0 && onnx_opset <= 11) {
+        axis = layerParams.get<int>("axis", 1);
+    } else {
+        axis = layerParams.get<int>("axis", -1);
+    }
+    layerParams.set<int>("axis", axis);
+    layerParams.type = "Softmax";
+    layerParams.set("log_softmax", layer_type == "LogSoftmax");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseDetectionOutput(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckEQ(node_proto.input_size(), 3, "");
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseCumSum(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int ninputs = node_proto.input_size();
+    CV_Assert(ninputs == 2);
+    layerParams.type = "CumSum";
+    if (net.isConstArg(node_inputs[1]))
+    {
+        Mat axisTensor;
+        net.argTensor(node_inputs[1]).convertTo(axisTensor, CV_32S);
+        CV_Assert(axisTensor.total() == 1);
+        int axis = axisTensor.at<int>(0);
+        layerParams.set("axis", axis);
+        ninputs = 1;
+    }
+    addLayer(layerParams, node_proto, ninputs);
+}
+
+// "Equal" "Greater" "Less" "Pow" "Add" "Sub" "Mul" "Div" "Sum" "Min" "Max" "GreaterOrEqual" "LessOrEqual" "And" "Or" "Xor"
+void ONNXImporter2::parseElementWise(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    opencv_onnx::NodeProto node_proto = node_proto_;
+    String op_type = toLowerCase(node_proto.op_type());
+
+    layerParams.type = "NaryEltwise";
+    layerParams.set("operation", toLowerCase(node_proto.op_type()));
+    if (node_proto.op_type() == "Mod") {
+        if (layerParams.get<int>("fmod", 0)) {
+            layerParams.set("operation", "fmod");
+        };
+    }
+    // add element-wise layer
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseDepthSpaceOps(LayerParams &layerParams, const opencv_onnx::NodeProto& node_proto) {
+    CV_CheckTrue(layerParams.has("blocksize"), "blocksize is required but not found");
+    addLayer(layerParams, node_proto);
+}
+
+// Currently we only support range with all constant inputs
+void ONNXImporter2::parseRange(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 3); // 0 - start, 1 - limit, 2 - delta
+    layerParams.type = "Range";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseScatter(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckEQ(node_proto.input_size(), 3, "Scatter: three inputs are required.");
+    layerParams.type = "Scatter";
+    if (node_proto.op_type() == "ScatterND")
+        layerParams.type = "ScatterND";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseTile(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    // for Tile>1, only the case of 'repeats' being constant is supported.
+    // 'repeats' is treated as a parameter instead of an input to determine shape in pre-run.
+    layerParams.type = "Tile2";
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseLayerNorm(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int n_inputs = node_proto.input_size();
+    CV_Assert(2 <= n_inputs && n_inputs <= 3);
+    if (net.isConstArg(node_inputs[1]) && (n_inputs == 2 || net.isConstArg(node_inputs[2]))) {
+        Mat scale = net.argTensor(node_inputs[1]);
+        layerParams.blobs.push_back(scale);
+        if (n_inputs > 2) {
+            Mat bias = net.argTensor(node_inputs[2]);
+            layerParams.blobs.push_back(bias);
+        }
+        n_inputs = 1;
+    }
+    addLayer(layerParams, node_proto, n_inputs);
+}
+
+void ONNXImporter2::parseSimpleLayers(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseEinsum(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    // Check if of equation is valid
+    std::string equation = layerParams.get<std::string>("equation");
+    CV_CheckFalse(equation.empty(), "Equation is empty");
+
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseDequantizeLinear(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQuantizeLinear(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int dt = layerParams.get<int>("output_dtype", -1);
+    if (dt >= 0)
+    {
+        layerParams.set<int>("output_dtype", dataType2cv((opencv_onnx::TensorProto_DataType)dt));
+    }
+    addLayer(layerParams, node_proto);
+}
+
+// BUG: https://github.com/opencv/opencv/issues/26310
+/*void ONNXImporter2::parseQConv(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    opencv_onnx::NodeProto node_proto = node_proto_;
+    int ninputs = node_proto.input_size();
+    CV_Assert(ninputs == 8 || ninputs == 9);
+
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int inp_zp = (int)getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+
+    if (layerParams.has("pad"))
+    {
+        bool asymmetricPadding = false;
+        DictValue pads = layerParams.get("pad");
+        const int dims = pads.size() / 2;
+
+        for (int i = 0; i < dims; ++i)
+        {
+            if (pads.get<int>(i) != pads.get<int>(i + dims))
+            {
+                asymmetricPadding = true;
+                break;
+            }
+        }
+        if (asymmetricPadding && pads.size() == 4)
+        {
+            layerParams.erase("pad");
+            std::vector<int> paddings(4, 0);
+            for (int i = 0; i < dims; ++i)
+            {
+                paddings.push_back(pads.get<int>(i));
+                paddings.push_back(pads.get<int>(dims + i));
+            }
+            LayerParams padLp;
+            padLp.name = layerParams.name + "/pad";
+            padLp.type = "PaddingInt8";
+            padLp.set("paddings", DictValue::arrayInt(&paddings[0], paddings.size()));
+            padLp.set("depth", CV_8S);
+            padLp.set<double>("value", (double)inp_zp);
+
+            opencv_onnx::NodeProto proto;
+            proto.add_input(node_proto.input(0));
+            proto.add_output(padLp.name);
+
+            addLayer(padLp, proto);
+            node_proto.set_input(0, padLp.name);
+        }
+    }
+
+    Mat weights = getBlob(node_proto, 3);
+    int outCn = weights.size[0];
+    Mat w_scale = getBlob(node_proto, 4);
+    CV_Assert(w_scale.total() == 1 || w_scale.total() == outCn);
+    bool per_channel = w_scale.total() == outCn;
+    Mat wt_sc = (w_scale.total() == outCn) ? w_scale : Mat(1, outCn, CV_32F, Scalar(w_scale.at<float>(0)));
+
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 6));
+    int8_t out_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 7));
+
+    Mat bias = (ninputs == 9) ? getBlob(node_proto, 8) : Mat::zeros(1, outCn, CV_32S);
+
+    Mat weights_2d = weights.reshape(1, outCn);
+    Mat biasFused(1, outCn, CV_32S);
+    Mat outputMultiplier(1, outCn, CV_32F);
+    for (int i = 0; i < outCn; i++)
+    {
+        biasFused.at<int>(i) = bias.at<int>(i) - inp_zp*(cv::sum(weights_2d.row(i))[0]);
+        outputMultiplier.at<float>(i) = (inp_sc * wt_sc.at<float>(i)) / out_sc;
+    }
+
+    layerParams.type = "ConvolutionInt8";
+    layerParams.set("num_output", outCn);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("input_scale",inp_sc);
+    layerParams.set("zeropoints", out_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("per_channel", per_channel);
+    layerParams.blobs.push_back(weights);
+    layerParams.blobs.push_back(biasFused);
+    layerParams.blobs.push_back(outputMultiplier);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQMatMul(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int ninputs = node_proto.input_size();
+    CV_Assert(ninputs == 8);
+
+    if (constBlobs.find(node_proto.input(3)) == constBlobs.end())
+        CV_Error(Error::StsNotImplemented, "Variable weights is not supported");
+
+    int firstInpDims = outShapes[node_proto.input(0)].size();
+
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+
+    Mat weights = getBlob(node_proto, 3).t();
+    int outCn = weights.size[0];
+    int secondInpDims = weights.dims;
+
+    Mat w_scale = getBlob(node_proto, 4);
+    CV_Assert(w_scale.total() == 1 || w_scale.total() == outCn);
+    bool per_channel = w_scale.total() == outCn ? true : false;
+    Mat wt_sc = (w_scale.total() == outCn) ? w_scale : Mat(1, outCn, CV_32F, Scalar(w_scale.at<float>(0)));
+
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 6));
+    int8_t out_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 7));
+
+    Mat bias(1, outCn, CV_32S);
+    Mat outputMultiplier(1, outCn, CV_32F);
+    for (int i = 0; i < outCn; i++)
+    {
+        bias.at<int>(i) = -inp_zp*(cv::sum(weights.row(i))[0]);
+        outputMultiplier.at<float>(i) = (inp_sc * wt_sc.at<float>(i)) / out_sc;
+    }
+
+    layerParams.type = "InnerProductInt8";
+    layerParams.set("num_output", outCn);
+    layerParams.set("axis", firstInpDims - secondInpDims + 1);
+    layerParams.set("input_scale", inp_sc);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("zeropoints", out_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("per_channel", per_channel);
+
+    layerParams.blobs.push_back(weights);
+    layerParams.blobs.push_back(bias);
+    layerParams.blobs.push_back(outputMultiplier);
+    addLayer(layerParams, node_proto);
+}
+
+// A * B + C = Y, we require that the dimension of A is [m, k], and the dimension of B is [n, k].
+// And the dim of output Y is [m, n]
+void ONNXImporter2::parseQGemm(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    int ninputs = node_proto.input_size();
+    CV_Assert(ninputs == 8 || ninputs == 9);
+
+    layerParams.type = "InnerProductInt8";
+
+    if (constBlobs.find(node_proto.input(3)) == constBlobs.end())
+        CV_Error(Error::StsNotImplemented, "Variable weights is not supported");
+
+    Mat weights = getBlob(node_proto, 3);
+
+    if (!layerParams.get<int>("transB", 0))
+    {
+        transpose(weights, weights);
+    }
+
+    CV_Assert(layerParams.get<float>("alpha", 1) == 1.0f);
+    CV_Assert(layerParams.get<int>("transA", 0) == 0);
+
+    int firstInpDims = outShapes[node_proto.input(0)].size();
+
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+
+    int outCn = weights.size[0];
+    int secondInpDims = weights.dims;
+
+    Mat w_scale = getBlob(node_proto, 4);
+    CV_Assert(w_scale.total() == 1 || w_scale.total() == outCn);
+    bool per_channel = w_scale.total() == outCn;
+    Mat wt_sc = (w_scale.total() == outCn) ? w_scale : Mat(1, outCn, CV_32F, Scalar(w_scale.at<float>(0)));
+
+    Mat w_zp = getBlob(node_proto, 5);
+    int8_t* ptrZp = w_zp.ptr<int8_t>(0);
+
+    for (int i = 0; i < w_zp.total(); i++)
+    {
+        if (ptrZp[i] != (int8_t)0)
+            CV_Error(Error::StsUnsupportedFormat, "The zero-point non-zero case of W is not supported!");
+    }
+
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 7));
+    int8_t out_zp = ninputs == 9 ? getScalarFromMat<int8_t>(getBlob(node_proto, 8)) : 0;
+
+    Mat bias;
+    if (constBlobs.find(node_proto.input(6)) != constBlobs.end())
+        bias = getBlob(node_proto, 6);
+    if (bias.empty())
+        bias = Mat::zeros(1, outCn, CV_32S);
+
+    Mat biasFused(1, outCn, CV_32S);
+    Mat outputMultiplier(1, outCn, CV_32F);
+    for (int i = 0; i < outCn; i++)
+    {
+        biasFused.at<int>(i) = bias.at<int>(i) - inp_zp*(cv::sum(weights.row(i))[0]);
+        outputMultiplier.at<float>(i) = (inp_sc * wt_sc.at<float>(i)) / out_sc;
+    }
+
+    layerParams.type = "InnerProductInt8";
+    layerParams.set("num_output", outCn);
+    layerParams.set("axis", firstInpDims - secondInpDims + 1);
+    layerParams.set("input_scale", inp_sc);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("zeropoints", out_zp);
+    layerParams.set("per_channel", per_channel);
+
+    layerParams.blobs.push_back(weights);
+    layerParams.blobs.push_back(biasFused);
+    layerParams.blobs.push_back(outputMultiplier);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQEltwise(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    opencv_onnx::NodeProto node_proto = node_proto_;
+    CV_Assert(node_proto.input_size() == 7 || node_proto.input_size() == 8);
+    std::string op = (node_proto.op_type() == "QLinearAdd") ? "sum" : "prod";
+    int constId = -1;
+    for (int i = 0; i < 4; i += 3)
+    {
+        if (constBlobs.find(node_proto.input(i)) != constBlobs.end())
+            constId = i;
+    }
+
+    float inp_0_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_0_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+
+    float inp_1_sc = getScalarFromMat<float>(getBlob(node_proto, 4));
+    int8_t inp_1_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 5));
+
+    // Set 2nd input as the const input
+    if (constId == 0)
+    {
+        cv::swap(inp_0_sc, inp_1_sc);
+        cv::swap(inp_0_zp, inp_1_zp);
+    }
+
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 6));
+
+    int8_t out_zp = 0;
+    if (node_proto.input_size() == 8)
+        out_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 7));
+
+    std::vector<float> inp_scales = {inp_0_sc, inp_1_sc};
+    std::vector<int8_t> inp_zps = {inp_0_zp, inp_1_zp};
+
+    std::vector<float> coeffs;
+    float offset;
+    if (op == "sum")
+    {
+        coeffs = {inp_scales[0]/out_sc, inp_scales[1]/out_sc};
+        offset = out_zp - coeffs[0]*inp_zps[0] - coeffs[1]*inp_zps[1];
+    }
+    else
+    {
+        coeffs = {inp_scales[0]/out_sc, inp_scales[1]};
+        offset = out_zp;
+    }
+
+    if (constId != -1)
+    {
+        Mat blob = getBlob(node_proto, constId);
+        if (blob.total() == 1)
+        {
+            float val = inp_scales[1] * (blob.at<int8_t>(0) - inp_zps[1]);
+            float scale = inp_scales[0] / out_sc;
+            if (op == "prod")
+                scale *= val;
+
+            float shift = out_zp - scale*inp_zps[0];
+            if (op == "sum")
+                shift += (val/out_sc);
+
+            LayerParams rescaleParams;
+            rescaleParams.name = layerParams.name;
+            rescaleParams.type = "Requantize";
+            rescaleParams.set("depth", CV_8S);
+            rescaleParams.set("scale", scale);
+            rescaleParams.set("shift", shift);
+            rescaleParams.set("isEltwise", true);
+            addLayer(rescaleParams, node_proto);
+            return;
+        }
+        else
+        {
+            MatShape inpShape = outShapes[node_proto.input(3 - constId)];
+            if (blob.dims == 2)
+                blob = blob.t();
+
+            if (shape(blob) == inpShape)
+            {
+                LayerParams constParams;
+                constParams.name = layerParams.name + "/const";
+                constParams.type = "ConstInt8";
+                constParams.set("depth", CV_8S);
+                constParams.set("scales", inp_1_sc);
+                constParams.set("zeropoints", inp_1_zp);
+                constParams.blobs.push_back(blob);
+
+                int id = net.addLayer(constParams.name, constParams.type, CV_8S, constParams);
+                layer_id.insert(std::make_pair(constParams.name, LayerInfo(id, 0, CV_8S)));
+                outShapes[constParams.name] = shape(blob);
+                node_proto.set_input(constId, constParams.name);
+
+                layerParams.type = "EltwiseInt8";
+                layerParams.set("operation", op);
+                layerParams.set("coeff", DictValue::arrayReal(coeffs.data(), coeffs.size()));
+                layerParams.set("offset", offset);
+            }
+            else
+            {
+                layerParams.type = "ScaleInt8";
+                layerParams.set("bias_term", op == "sum");
+                int axis = 1;
+                for (int i = 0; i < graph_proto->initializer_size(); i++)
+                {
+                    opencv_onnx::TensorProto tensor_proto = graph_proto->initializer(i);
+                    if (tensor_proto.name() == node_proto.input(constId))
+                    {
+                        axis = inpShape.size() - tensor_proto.dims_size();
+                        break;
+                    }
+                }
+                layerParams.set("axis", axis);
+                blob = blob.reshape(1, 1);
+                Mat blob_dequantized;
+                blob.convertTo(blob_dequantized, CV_32F, inp_scales[1], -(inp_scales[1] * inp_zps[1]));
+                layerParams.blobs.push_back(blob_dequantized);
+            }
+        }
+    }
+    else if (outShapes[node_proto.input(0)] == outShapes[node_proto.input(3)])
+    {
+        layerParams.type = "EltwiseInt8";
+        layerParams.set("operation", op);
+        layerParams.set("coeff", DictValue::arrayReal(coeffs.data(), coeffs.size()));
+        layerParams.set("offset", offset);
+    }
+    else
+    {
+        layerParams.type = "ScaleInt8";
+        layerParams.set("bias_term", op == "sum");
+    }
+
+    layerParams.set("input_scales", DictValue::arrayReal(inp_scales.data(), inp_scales.size()));
+    layerParams.set("input_zeropoints", DictValue::arrayInt(inp_zps.data(), inp_zps.size()));
+    layerParams.set("scales", out_sc);
+    layerParams.set("zeropoints", out_zp);
+
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQLeakyRelu(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 4 || node_proto.input_size() == 5);
+
+    float slope = layerParams.get<float>("alpha");
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 3));
+    int8_t out_zp = node_proto.input_size() == 4 ? 0 : getScalarFromMat<int8_t>(getBlob(node_proto, 4));
+
+    Mat lookUpTable(1, 256, CV_8S);
+    int8_t* table = lookUpTable.ptr<int8_t>();
+    for (int i = -128; i < 128; i++)
+    {
+        float x = inp_sc*(i - inp_zp);
+        float y = x >= 0.f ? x : slope*x;
+        int quantized = out_zp + cvRound(y/out_sc);
+        table[i+128] = saturate_cast<int8_t>(quantized);
+    }
+
+    layerParams.type = "ReLUInt8";
+    layerParams.set("input_scale", inp_sc);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("zeropoints", out_zp);
+    layerParams.set("slope", slope);
+    layerParams.blobs.push_back(lookUpTable);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQSigmoid(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 4 || node_proto.input_size() == 5);
+
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 3));
+    int8_t out_zp = node_proto.input_size() == 4 ? 0 : getScalarFromMat<int8_t>(getBlob(node_proto, 4));
+
+    Mat lookUpTable(1, 256, CV_8S);
+    int8_t* table = lookUpTable.ptr<int8_t>();
+    for (int i = -128; i < 128; i++)
+    {
+        float x = inp_sc*(i - inp_zp);
+        float y = 1.f/(1.f + std::exp(-x));
+        int quantized = out_zp + cvRound(y/out_sc);
+        table[i+128] = saturate_cast<int8_t>(quantized);
+    }
+
+    layerParams.type = "SigmoidInt8";
+    layerParams.set("input_scale", inp_sc);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("zeropoints", out_zp);
+    layerParams.blobs.push_back(lookUpTable);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQAvgPool(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_Assert(node_proto.input_size() == 4 || node_proto.input_size() == 5);
+
+    float inp_sc = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+    float out_sc = getScalarFromMat<float>(getBlob(node_proto, 3));
+    int8_t out_zp = node_proto.input_size() == 4 ? 0 : getScalarFromMat<int8_t>(getBlob(node_proto, 4));
+
+    layerParams.type = "PoolingInt8";
+    layerParams.set("pool", "ave");
+    layerParams.set("global_pooling", node_proto.op_type() == "QLinearGlobalAveragePool");
+    layerParams.set("multiplier", inp_sc/out_sc);
+    layerParams.set("input_scale", inp_sc);
+    layerParams.set("input_zeropoint", inp_zp);
+    layerParams.set("scales", out_sc);
+    layerParams.set("zeropoints", out_zp);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQConcat(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto_)
+{
+    opencv_onnx::NodeProto node_proto = node_proto_;
+    layerParams.type = "ConcatInt8";
+    int num_inputs = node_proto.input_size();
+
+    float out_scale = getScalarFromMat<float>(getBlob(node_proto, 0));
+    int8_t out_zp = getScalarFromMat<int8_t>(getBlob(node_proto, 1));
+
+    for (int i = 2; i < num_inputs; i += 3)
+    {
+        float inp_scale = getScalarFromMat<float>(getBlob(node_proto, i + 1));
+        int8_t inp_zp = getScalarFromMat<int8_t>(getBlob(node_proto, i + 2));
+
+        if (inp_scale != out_scale || inp_zp != out_zp)
+        {
+            float scale = inp_scale/out_scale;
+            float shift = out_zp - scale*inp_zp;
+
+            if (constBlobs.find(node_proto.input(i)) != constBlobs.end())
+            {
+                Mat blob = getBlob(node_proto, i);
+                Mat blob_rescaled;
+                blob.convertTo(blob_rescaled, CV_8S, scale, shift);
+                constBlobs[node_proto.input(i)] = blob_rescaled;
+            }
+            else
+            {
+                LayerParams rescaleParams;
+                rescaleParams.name = node_proto.input(i) + "/rescale";
+                rescaleParams.type = "Requantize";
+                rescaleParams.set("depth", CV_8S);
+                rescaleParams.set("scale", scale);
+                rescaleParams.set("shift", shift);
+                rescaleParams.set("isEltwise", false);
+
+                opencv_onnx::NodeProto proto;
+                proto.add_input(node_proto.input(i));
+                proto.add_output(rescaleParams.name);
+                addLayer(rescaleParams, proto);
+                node_proto.set_input(i, rescaleParams.name);
+            }
+        }
+    }
+
+    bool hasVariableInps = false;
+    for (int i = 2; i < num_inputs; i += 3)
+    {
+        if (layer_id.find(node_proto.input(i)) != layer_id.end())
+        {
+            hasVariableInps = true;
+            break;
+        }
+    }
+
+    if (!hasVariableInps)
+    {
+        std::vector<Mat> inputs, concatenated;
+        MatShape inputShape;
+        for (size_t i = 2; i < num_inputs; i += 3)
+        {
+            Mat blob = getBlob(node_proto, i);
+            if (blob.size.dims() > inputShape.size())
+            {
+                inputShape = shape(blob);
+            }
+            inputs.push_back(blob);
+        }
+
+        int axis = layerParams.get<int>("axis", 1);
+        for (size_t i = 0; i < inputs.size(); ++i)
+        {
+            MatShape targetShape = inputShape;
+            targetShape[axis] = shape(inputs[i])[axis];
+            CV_CheckEQ(total(targetShape), total(shape(inputs[i])), "");
+            inputs[i] = inputs[i].reshape(0, targetShape);
+        }
+        runLayer(layerParams, inputs, concatenated);
+        CV_Assert(concatenated.size() == 1);
+        addConstant(layerParams.name, concatenated[0]);
+        return;
+    }
+    else
+    {
+        for (int i = 2; i < num_inputs; i += 3)
+        {
+            if (constBlobs.find(node_proto.input(i)) != constBlobs.end())
+            {
+                LayerParams constParams;
+                constParams.name = node_proto.input(i);
+                constParams.type = "ConstInt8";
+                constParams.blobs.push_back(getBlob(node_proto, i));
+                constParams.set("depth", CV_8S);
+
+                opencv_onnx::NodeProto proto;
+                proto.add_output(constParams.name);
+                addLayer(constParams, proto);
+            }
+        }
+    }
+    layerParams.set("scales", out_scale);
+    layerParams.set("zeropoints", out_zp);
+    addLayer(layerParams, node_proto);
+}
+
+void ONNXImporter2::parseQSoftmax(LayerParams& layerParams, const opencv_onnx::NodeProto& node_proto)
+{
+    CV_CheckEQ(node_proto.input_size(), 5, "DNN/ONNX: QLinearSoftmax requires 5 inputs, X, X_scale, X_zero_point, Y_scale, Y_zero_point");
+
+    int opset = layerParams.get<int>("opset");
+    if (opset < 13) {
+        layerParams.set("coerced_2d", true);
+    }
+
+    float x_scale = getScalarFromMat<float>(getBlob(node_proto, 1));
+    int8_t x_zero_point = getScalarFromMat<int8_t>(getBlob(node_proto, 2));
+    float y_scale = getScalarFromMat<float>(getBlob(node_proto, 3));
+    int8_t y_zero_point = getScalarFromMat<int8_t>(getBlob(node_proto, 4));
+
+    layerParams.type = "SoftmaxInt8";
+    // layerParams also has "axis" and "opset" attrs
+    layerParams.set("input_scale", x_scale);
+    layerParams.set("input_zeropoint", x_zero_point);
+    layerParams.set("scales", y_scale);
+    layerParams.set("zeropoints", y_zero_point);
+    addLayer(layerParams, node_proto);
+}*/
+
+void ONNXImporter2::parseAttention(LayerParams& params, const opencv_onnx::NodeProto& node_proto) {
+    int i, n_inputs = node_proto.input_size();
+    CV_CheckTrue(params.has("num_heads"), "ONNXImporter2/parseAttention: num_heads is required but missing");
+    CV_CheckTrue(params.has("qkv_hidden_sizes"), "ONNXImporter2/parseAttention: qkv_hidden_sizes is required but missing");
+
+    auto param_qkv_hidden_sizes = params.get("qkv_hidden_sizes");
+    CV_CheckEQ(param_qkv_hidden_sizes.size(), 3, "ONNXImporter2/parseAttention: qkv_hidden_sizes is must and only have three elements");
+
+    for (i = 1; i < n_inputs; i++) {
+        if (!net.isConstArg(node_inputs[i]))
+            break;
+    }
+
+    if (i == n_inputs) {
+        for (i = 1; i < n_inputs; i++) {
+            Mat blob = net.argTensor(node_inputs[i]);
+            params.blobs.push_back(blob);
+        }
+        n_inputs = 1;
+    }
+
+    addLayer(params, node_proto, n_inputs);
+}
+
+// Domain: ai.onnx (default)
+// URL: https://github.com/onnx/onnx/blob/master/docs/Operators.md
+void ONNXImporter2::buildDispatchMap_ONNX_AI(int opset_version)
+{
+    CV_UNUSED(opset_version);
+    DispatchMap dispatch;
+
+    dispatch["ArgMax"] = dispatch["ArgMin"] = &ONNXImporter2::parseArgMinMax;
+    dispatch["MaxUnpool"] = &ONNXImporter2::parseMaxUnpool;
+    dispatch["MaxPool"] = &ONNXImporter2::parseMaxPool;
+    dispatch["AveragePool"] = &ONNXImporter2::parseAveragePool;
+    dispatch["GlobalAveragePool"] = dispatch["GlobalMaxPool"] = &ONNXImporter2::parseGlobalPool;
+    dispatch["ReduceMax"] = dispatch["ReduceMin"] = dispatch["ReduceMean"] = dispatch["ReduceSum"] =
+            dispatch["ReduceSumSquare"] = dispatch["ReduceProd"] = dispatch["ReduceL1"] =
+            dispatch["ReduceL2"] = dispatch["ReduceLogSum"] = dispatch["ReduceLogSumExp"] = &ONNXImporter2::parseReduce;
+    dispatch["Slice"] = &ONNXImporter2::parseSlice;
+    dispatch["Split"] = &ONNXImporter2::parseSplit;
+    dispatch["Neg"] = &ONNXImporter2::parseNeg;
+    dispatch["Constant"] = &ONNXImporter2::parseConstant;
+    dispatch["LSTM"] = &ONNXImporter2::parseLSTM;
+    dispatch["GRU"] = &ONNXImporter2::parseGRU;
+    dispatch["ImageScaler"] = &ONNXImporter2::parseImageScaler;
+    dispatch["Clip"] = &ONNXImporter2::parseClip;
+    dispatch["LeakyRelu"] = &ONNXImporter2::parseLeakyRelu;
+    dispatch["Relu"] = &ONNXImporter2::parseRelu;
+    dispatch["Elu"] = &ONNXImporter2::parseElu;
+    dispatch["Tanh"] = &ONNXImporter2::parseTanh;
+    dispatch["Abs"] = &ONNXImporter2::parseAbs;
+    dispatch["PRelu"] = &ONNXImporter2::parsePRelu;
+    dispatch["LRN"] = &ONNXImporter2::parseLRN;
+    dispatch["InstanceNormalization"] = &ONNXImporter2::parseInstanceNormalization;
+    dispatch["BatchNormalization"] = &ONNXImporter2::parseBatchNormalization;
+    dispatch["Gemm"] = &ONNXImporter2::parseGemm;
+    dispatch["MatMul"] = &ONNXImporter2::parseMatMul;
+    dispatch["Conv"] = &ONNXImporter2::parseConv;
+    dispatch["ConvTranspose"] = &ONNXImporter2::parseConvTranspose;
+    dispatch["Transpose"] = &ONNXImporter2::parseTranspose;
+    dispatch["Squeeze"] = &ONNXImporter2::parseSqueeze;
+    dispatch["Flatten"] = &ONNXImporter2::parseFlatten;
+    dispatch["Unsqueeze"] = &ONNXImporter2::parseUnsqueeze;
+    dispatch["Expand"] = &ONNXImporter2::parseExpand;
+    dispatch["Reshape"] = &ONNXImporter2::parseReshape;
+    dispatch["Pad"] = &ONNXImporter2::parsePad;
+    dispatch["Shape"] = &ONNXImporter2::parseShape;
+    dispatch["Cast"] = &ONNXImporter2::parseCast;
+    dispatch["ConstantFill"] = dispatch["ConstantOfShape"] = &ONNXImporter2::parseConstantOfShape;
+    dispatch["Gather"] = &ONNXImporter2::parseGather;
+    dispatch["GatherElements"] = &ONNXImporter2::parseGatherElements;
+    dispatch["Concat"] = &ONNXImporter2::parseConcat;
+    dispatch["Resize"] = &ONNXImporter2::parseResize;
+    dispatch["Upsample"] = &ONNXImporter2::parseUpsample;
+    dispatch["SoftMax"] = dispatch["Softmax"] = dispatch["LogSoftmax"] = &ONNXImporter2::parseSoftMax;
+    dispatch["DetectionOutput"] = &ONNXImporter2::parseDetectionOutput;
+    dispatch["CumSum"] = &ONNXImporter2::parseCumSum;
+    dispatch["SpaceToDepth"] = dispatch["DepthToSpace"] = &ONNXImporter2::parseDepthSpaceOps;
+    dispatch["ScatterElements"] = dispatch["Scatter"] = dispatch["ScatterND"] = &ONNXImporter2::parseScatter;
+    dispatch["Tile"] = &ONNXImporter2::parseTile;
+    dispatch["LayerNormalization"] = &ONNXImporter2::parseLayerNorm;
+    dispatch["GroupNormalization"] = &ONNXImporter2::parseInstanceNormalization;
+
+    dispatch["Equal"] = dispatch["Greater"] = dispatch["Less"] = dispatch["Pow"] = dispatch["Add"] =
+            dispatch["Sub"] = dispatch["Mul"] = dispatch["Div"] = dispatch["GreaterOrEqual"] =
+            dispatch["LessOrEqual"] = dispatch["Mod"] = dispatch["And"] = dispatch["Or"] = dispatch["Xor"] = &ONNXImporter2::parseElementWise;
+
+    dispatch["Sum"] = dispatch["Min"] = dispatch["Max"] = dispatch["Mean"] = &ONNXImporter2::parseElementWise;
+    dispatch["Where"] = &ONNXImporter2::parseElementWise;
+    dispatch["Range"] = &ONNXImporter2::parseRange;
+    dispatch["Einsum"] = &ONNXImporter2::parseEinsum;
+
+    std::vector<std::string> simpleLayers {
+        "Acos", "Acosh", "Asin", "Asinh", "Atan", "Atanh", "Ceil", "Celu", "Cos",
+        "Cosh", "Dropout", "Erf", "Exp", "Floor", "HardSigmoid", "HardSwish",
+        "Identity", "Log", "Not", "Round", "Reciprocal", "Selu", "Sign", "Sigmoid", "Sin", "Sinh",
+        "Softplus", "Softsign", "Shrink", "Sqrt", "Tan", "ThresholdedRelu", "Gelu",
+        "GeluApproximation"
+    };
+    for (const auto& name : simpleLayers)
+    {
+        dispatch[name] = &ONNXImporter2::parseSimpleLayers;
+    }
+
+    // BUG: https://github.com/opencv/opencv/issues/26310
+    // ai.onnx: opset 10+
+    dispatch["DequantizeLinear"] = &ONNXImporter2::parseDequantizeLinear;
+    dispatch["QuantizeLinear"] = &ONNXImporter2::parseQuantizeLinear;
+    //dispatch["QLinearConv"] = &ONNXImporter2::parseQConv;
+    //dispatch["QLinearMatMul"] = &ONNXImporter2::parseQMatMul;
+
+    // com.microsft: This operator is added for compatibility via onnx graph simplifier.
+    //               Opset domain cannot be modified from onnx_graph_simplifier.cpp so this
+    //               operator cannot be parsed if only added in buildDispatchMap_COM_MICROSOFT
+    dispatch["Attention"] = &ONNXImporter2::parseAttention;
+
+    domain_dispatch_map[str_domain_ai_onnx] = dispatch;
+}
+
+// Domain: com.microsoft
+// URL: https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md
+void ONNXImporter2::buildDispatchMap_COM_MICROSOFT(int opset_version)
+{
+    CV_UNUSED(opset_version);
+    DispatchMap dispatch;
+
+    // BUG: https://github.com/opencv/opencv/issues/26310
+    //dispatch["QLinearAdd"] = dispatch["QLinearMul"] = &ONNXImporter2::parseQEltwise;
+    //dispatch["QLinearAveragePool"] = dispatch["QLinearGlobalAveragePool"] = &ONNXImporter2::parseQAvgPool;
+    //dispatch["QLinearLeakyRelu"] = &ONNXImporter2::parseQLeakyRelu;
+    //dispatch["QLinearSigmoid"] = &ONNXImporter2::parseQSigmoid;
+    //dispatch["QLinearConcat"] = &ONNXImporter2::parseQConcat;
+    //dispatch["QGemm"] = &ONNXImporter2::parseQGemm;
+    //dispatch["QLinearSoftmax"] = &ONNXImporter2::parseQSoftmax;
+    dispatch["Attention"] = &ONNXImporter2::parseAttention;
+
+    domain_dispatch_map["com.microsoft"] = dispatch;
+}
+
+
+Net readNetFromONNX2(const String& onnxFile)
+{
+    ONNXImporter2 importer;
+    Net net = importer.parseFile(onnxFile.c_str());
+    if (net.getMainGraph()) {
+        net.getImpl()->modelFileName = onnxFile;
+    }
+    return net;
+}
+
+Net readNetFromONNX2(const char* buffer, size_t size)
+{
+    ONNXImporter2 importer;
+    return importer.parseBuffer(buffer, size);
+}
+
+Net readNetFromONNX2(const std::vector<uchar>& buffer)
+{
+    ONNXImporter2 importer;
+    return importer.parseBuffer(buffer.data(), buffer.size());
+}
+
+#else  // HAVE_PROTOBUF
+
+#define DNN_PROTOBUF_UNSUPPORTED() CV_Error(Error::StsError, "DNN/ONNX: Build OpenCV with Protobuf to import ONNX models")
+
+Net readNetFromONNX2(const String&) {
+    DNN_PROTOBUF_UNSUPPORTED();
+}
+
+Net readNetFromONNX2(const char*, size_t) {
+    DNN_PROTOBUF_UNSUPPORTED();
+}
+
+Net readNetFromONNX2(const std::vector<uchar>&) {
+    DNN_PROTOBUF_UNSUPPORTED();
+}
+
+#endif  // HAVE_PROTOBUF
+
+CV__DNN_INLINE_NS_END
+}} // namespace
diff --git a/modules/dnn/src/op_cuda.hpp b/modules/dnn/src/op_cuda.hpp
index 2e4bf23b61..45d577a57a 100644
--- a/modules/dnn/src/op_cuda.hpp
+++ b/modules/dnn/src/op_cuda.hpp
@@ -611,7 +611,7 @@ namespace cv { namespace dnn {
         }
 
         void update(const MatShape& shape_, std::size_t offset_) override {
-            auto total = std::accumulate(std::begin(shape_), std::end(shape_), 1, std::multiplies<MatShape::value_type>());
+            std::size_t total = shape_.total();
             if (offset_ + total > shared_block->device.size()) {
                 CV_Error(Error::BadOffset, "shape and offset provided can potentially leads to OOB access");
             }
diff --git a/modules/dnn/src/tensorflow/tf_importer.cpp b/modules/dnn/src/tensorflow/tf_importer.cpp
index 3618a56982..f431a77d83 100644
--- a/modules/dnn/src/tensorflow/tf_importer.cpp
+++ b/modules/dnn/src/tensorflow/tf_importer.cpp
@@ -1164,9 +1164,9 @@ void TFImporter::parseExpandDims(tensorflow::GraphDef& net, const tensorflow::No
     std::vector<MatType> netInputTypes(netInputShapes.size(), CV_32F);
     dstNet.getLayerShapes(netInputShapes, netInputTypes, inpIdindex, inShape_, outShape_);
     MatShape inpShape = outShape_[0];
-    std::vector<int> outShape = inpShape;
+    MatShape outShape = inpShape;
 
-    int outShapeSize = outShape.size();
+    int outShapeSize = (int)outShape.size();
 
     CV_Assert(inpShape.size() >= 1);
     // 2nd blob is dims tensor
@@ -1175,10 +1175,10 @@ void TFImporter::parseExpandDims(tensorflow::GraphDef& net, const tensorflow::No
     // Convert negative numbers to positive numbers, axis can be in range [-(D+1), D].
     if(axis < 0)
     {
-        axis = inpShape.size() + axis + 1;
+        axis = (int)inpShape.size() + axis + 1;
     }
 
-    CV_Assert(0 <= axis && axis <= inpShape.size());
+    CV_Assert(0 <= axis && axis <= (int)inpShape.size());
 
     // After ExpendDims, 3-dim data will become 4-dim data, and OpenCV retains 4-dim data as NCHW data layout.
     // Convert OpenCV's NHC to NCH first.
diff --git a/modules/dnn/src/tflite/tflite_importer.cpp b/modules/dnn/src/tflite/tflite_importer.cpp
index 7e7f1d0503..8b9a824fbf 100644
--- a/modules/dnn/src/tflite/tflite_importer.cpp
+++ b/modules/dnn/src/tflite/tflite_importer.cpp
@@ -180,16 +180,16 @@ void TFLiteImporter::populateNet()
     std::vector<MatShape> inputsShapes(subgraph_inputs_size);
     for (size_t i = 0; i < subgraph_inputs_size; ++i)
     {
-        int idx = subgraph_inputs->Get(i);
+        size_t idx = subgraph_inputs->Get(i);
         layerIds[idx] = std::make_pair(0, i);
         const auto tensor = modelTensors->Get(idx);
         if (!tensor)
-            CV_Error(Error::StsError, cv::format("DNN/TFLite: subgraph input %d (%d) is NULL", (int)i, idx));
+            CV_Error(Error::StsError, cv::format("DNN/TFLite: subgraph input %zu (%zu) is NULL", i, idx));
         layouts[idx] = estimateLayout(*tensor);
 
         // Keep info about origin inputs names and shapes
         inputsNames[i] = tensor->name()->str();
-        std::vector<int> shape(tensor->shape()->begin(), tensor->shape()->end());
+        MatShape shape(tensor->shape()->begin(), tensor->shape()->end());
         if (layouts[idx] == DNN_LAYOUT_NHWC) {
             CV_CheckEQ(shape.size(), (size_t)4, "");
             std::swap(shape[2], shape[3]);
diff --git a/modules/dnn/test/test_backends.cpp b/modules/dnn/test/test_backends.cpp
index 7520f0844c..65f4b80949 100644
--- a/modules/dnn/test/test_backends.cpp
+++ b/modules/dnn/test/test_backends.cpp
@@ -514,7 +514,9 @@ TEST_P(DNNTestNetwork, FastNeuralStyle_eccv16)
 #if defined(HAVE_INF_ENGINE) && INF_ENGINE_VER_MAJOR_GE(2019010000)
     expectNoFallbacksFromIE(net);
 #endif
-    expectNoFallbacksFromCUDA(net);
+    // BUG: https://github.com/opencv/opencv/issues/26306
+    // Temporarily disabled check for no "fallbacks", since the new engine does not support CUDA yet
+    //expectNoFallbacksFromCUDA(net);
 }
 
 INSTANTIATE_TEST_CASE_P(/*nothing*/, DNNTestNetwork, dnnBackendsAndTargets(/* withInferenceEngine = */ true,
diff --git a/modules/dnn/test/test_common.cpp b/modules/dnn/test/test_common.cpp
index 4a9b9c4147..33a3ea7d28 100644
--- a/modules/dnn/test/test_common.cpp
+++ b/modules/dnn/test/test_common.cpp
@@ -11,7 +11,7 @@ void runLayer(cv::Ptr<cv::dnn::Layer> layer, std::vector<cv::Mat> &inpBlobs, std
 {
     size_t ninputs = inpBlobs.size();
     std::vector<cv::Mat> inp(ninputs), outp, intp;
-    std::vector<cv::dnn::MatShape> inputs, outputs, internals;
+    std::vector<cv::MatShape> inputs, outputs, internals;
     std::vector<cv::dnn::MatType> inputs_types, outputs_types, internals_types;
 
     for (size_t i = 0; i < ninputs; i++)
diff --git a/modules/dnn/test/test_darknet_importer.cpp b/modules/dnn/test/test_darknet_importer.cpp
index ba2a7f14c6..49982ec956 100644
--- a/modules/dnn/test/test_darknet_importer.cpp
+++ b/modules/dnn/test/test_darknet_importer.cpp
@@ -134,7 +134,7 @@ public:
                 applyTestTag(CV_TEST_TAG_DNN_SKIP_IE_MYRIAD);
 #endif
 
-            std::vector<int> sz2 = shape(inp);
+            MatShape sz2 = shape(inp);
             sz2[0] = 2;
 
             Net net2 = readNet(cfg, model);
diff --git a/modules/dnn/test/test_graph_simplifier.cpp b/modules/dnn/test/test_graph_simplifier.cpp
index 24da7e65b0..db1ef0333d 100644
--- a/modules/dnn/test/test_graph_simplifier.cpp
+++ b/modules/dnn/test/test_graph_simplifier.cpp
@@ -28,6 +28,12 @@ class Test_Graph_Simplifier : public ::testing::Test {
 
         // remove Const, Identity (output layer), __NetInputLayer__ (input layer)
         layers.erase(std::remove_if(layers.begin(), layers.end(), [] (const std::string l) { return l == "Const" || l == "Identity" || l == "__NetInputLayer__"; }), layers.end());
+        // Instead of 'Tile', 'Expand' etc. we may now have 'Tile2', 'Expand2' etc.
+        // We should correctly match them with the respective patterns
+        for (auto& l: layers) {
+            if (!l.empty() && l[l.size()-1] == '2')
+                l = l.substr(0, l.size()-1);
+        }
 
         EXPECT_EQ(layers, expected_layers);
     }
diff --git a/modules/dnn/test/test_layers.cpp b/modules/dnn/test/test_layers.cpp
index 56f7516417..11529e1562 100644
--- a/modules/dnn/test/test_layers.cpp
+++ b/modules/dnn/test/test_layers.cpp
@@ -231,7 +231,7 @@ void testReshape(const MatShape& inputShape, const MatShape& targetShape,
     runLayer(rl, inpVec, outVec);
 
     Mat& out = outVec[0];
-    MatShape shape(out.size.p, out.size.p + out.dims);
+    MatShape shape = out.shape();
     EXPECT_EQ(shape, targetShape);
 }
 
@@ -502,9 +502,9 @@ TEST_F(Layer_LSTM_Test, get_set_test)
 
     EXPECT_EQ(2u, outputs.size());
 
-    print(outResShape, "outResShape");
-    print(shape(outputs[0]), "out0");
-    print(shape(outputs[0]), "out1");
+    //print(outResShape, "outResShape");
+    //print(shape(outputs[0]), "out0");
+    //print(shape(outputs[0]), "out1");
 
     EXPECT_EQ(outResShape, shape(outputs[0]));
     EXPECT_EQ(outResShape, shape(outputs[1]));
@@ -1520,17 +1520,17 @@ public:
         return Ptr<Layer>(new CustomInterpLayer(params));
     }
 
-    virtual bool getMemoryShapes(const std::vector<std::vector<int> > &inputs,
+    virtual bool getMemoryShapes(const std::vector<MatShape> &inputs,
                                  const int requiredOutputs,
-                                 std::vector<std::vector<int> > &outputs,
-                                 std::vector<std::vector<int> > &internals) const CV_OVERRIDE
+                                 std::vector<MatShape> &outputs,
+                                 std::vector<MatShape> &internals) const CV_OVERRIDE
     {
         const int batchSize = inputs[0][0];
         const int numChannels = inputs[0][1];
         const int inpHeight = inputs[0][2];
         const int inpWidth = inputs[0][3];
 
-        std::vector<int> outShape(4);
+        MatShape outShape(4);
         outShape[0] = batchSize;
         outShape[1] = numChannels;
         outShape[2] = outHeight != 0 ? outHeight : (inpHeight + (inpHeight - 1) * (zoomFactor - 1));
@@ -1611,7 +1611,12 @@ private:
     int outWidth, outHeight, zoomFactor;
 };
 
-TEST_P(Test_Caffe_layers, Interp)
+// BUG: https://github.com/opencv/opencv/issues/26194
+// After unregistration of the custom 'Interp' the model uses the standard Resize layer.
+// According to the graph, the model must produce 2 x 3 x 18 x 16 tensor with Resize layer,
+// but the result is compared with 2 x 3 x 17 x 15 tensor, just like the custom 'Interp' layer produced,
+// so we get the test failure. It looks like the test needs to be fixed.
+TEST_P(Test_Caffe_layers, DISABLED_Interp)
 {
 #ifdef OPENCV_DNN_EXTERNAL_PROTOBUF
     throw SkipTestException("Requires patched protobuf");
@@ -1638,6 +1643,7 @@ TEST_P(Test_Caffe_layers, Interp)
     LayerFactory::unregisterLayer("Interp");
 
     // Test an implemented layer.
+
     testLayerUsingCaffeModels("layer_interp", false, false);
 #endif
 }
diff --git a/modules/dnn/test/test_layers_1d.cpp b/modules/dnn/test/test_layers_1d.cpp
index 9e1a509e3d..91dc2f9eba 100644
--- a/modules/dnn/test/test_layers_1d.cpp
+++ b/modules/dnn/test/test_layers_1d.cpp
@@ -1030,7 +1030,10 @@ TEST_P(Layer_Concat_Test, Accuracy_01D)
 }
 INSTANTIATE_TEST_CASE_P(/*nothing*/, Layer_Concat_Test,
 /*input blob shape*/    testing::Values(
-    std::vector<int>({}),
+    // ONNX Concat produces output tensor of the same dimensionality as inputs.
+    // Therefore 0-dimensional tensors cannot be concatenated.
+    // They first need to be converted to 1D tensors, e.g. using Unsqueeze.
+    //std::vector<int>({}),
     std::vector<int>({1})
 ));
 
@@ -1140,13 +1143,11 @@ TEST_P(Layer_Reduce_Test, Accuracy_01D)
     auto reduceOperation = [](const cv::Mat& input, const std::string& operation, int axis) -> cv::Mat {
         // Initialize result matrix
         cv::Mat result;
-        if (shape(input).size() == 0 || shape(input).size() == 1){
-            result = cv::Mat(shape(input).size(), shape(input).data(), CV_32F);
-            int sz[1] = {1};
-            if (!shape(input).empty() && shape(input)[0] != 1){
-                result = cv::Mat(1, 1, CV_32F);
-                result = result.reshape(1, 1, sz);
-            }
+        MatShape inpshape = input.shape();
+        if (inpshape.dims == 0) {
+            result = cv::Mat(0, nullptr, CV_32F);
+        } else if (inpshape.dims == 1) {
+            result = cv::Mat({1}, CV_32F);
         } else {
             if (axis == 0) {
                 result = cv::Mat::zeros(1, input.cols, CV_32F);
@@ -1225,11 +1226,16 @@ TEST_P(Layer_Reduce_Test, Accuracy_01D)
     lp.type = "Reduce";
     lp.name = "reduceLayer";
     lp.set("reduce", reduce_operation);
-    lp.set("axes", axis);
+
+    // for scalar tensors we cannot specify reduction axis,
+    // because it will be out-of-range anyway
+    if (!input_shape.empty())
+        lp.set("axes", axis);
+
     lp.set("keepdims", true);
     Ptr<ReduceLayer> layer = ReduceLayer::create(lp);
 
-    cv::Mat input(input_shape.size(), input_shape.data(), CV_32F, 1.0);
+    cv::Mat input((int)input_shape.size(), input_shape.data(), CV_32F, 1.0);
     cv::randu(input, 0.0, 1.0);
 
     cv::Mat output_ref = reduceOperation(input, reduce_operation, axis);
@@ -1238,7 +1244,10 @@ TEST_P(Layer_Reduce_Test, Accuracy_01D)
 
     runLayer(layer, inputs, outputs);
     ASSERT_EQ(outputs.size(), 1);
-    ASSERT_EQ(shape(output_ref), shape(outputs[0]));
+
+    MatShape ref_shape = output_ref.shape();
+    MatShape out_shape = outputs[0].shape();
+    ASSERT_EQ(ref_shape, out_shape) << "ref_shape " << ref_shape.str() << " does not match output shape " << out_shape.str();
     normAssert(output_ref, outputs[0]);
 }
 INSTANTIATE_TEST_CASE_P(/*nothing*/, Layer_Reduce_Test, Combine(
@@ -1398,7 +1407,10 @@ TEST_P(Layer_Padding_Test, Accuracy_01D){
 }
 INSTANTIATE_TEST_CASE_P(/*nothing*/,  Layer_Padding_Test,
 /*input blob shape*/ testing::Values(
-            std::vector<int>{},
+
+            //scalars cannot be padded
+            //std::vector<int>{},
+
             std::vector<int>{1},
             std::vector<int>{1, 4},
             std::vector<int>{4, 1}
@@ -1414,30 +1426,33 @@ TEST_P(Layer_FullyConnected_Test, Accuracy_01D)
     lp.set("bias_term", false);
     lp.set("axis", 0);
 
-    std::vector<int> input_shape = get<0>(GetParam());
+    MatShape input_shape(get<0>(GetParam()));
 
     RNG& rng = TS::ptr()->get_rng();
     float inp_value = rng.uniform(0.0, 10.0);
-    Mat weights(std::vector<int>{total(input_shape), 1}, CV_32F, inp_value);
+    Mat weights({(int)input_shape.total(), 1}, CV_32F, inp_value);
     lp.blobs.push_back(weights);
 
     Ptr<Layer> layer = LayerFactory::createLayerInstance("InnerProduct", lp);
 
-    Mat input(input_shape.size(), input_shape.data(), CV_32F);
+    Mat input(input_shape, CV_32F);
     randn(input, 0, 1);
     Mat output_ref = input.reshape(1, 1) * weights;
-    output_ref.dims = input_shape.size();
+    output_ref.dims = input_shape.dims;
 
     std::vector<Mat> inputs{input};
     std::vector<Mat> outputs;
     runLayer(layer, inputs, outputs);
     ASSERT_EQ(1, outputs.size());
-    ASSERT_EQ(shape(output_ref), shape(outputs[0]));
+    MatShape ref_shape = output_ref.shape();
+    MatShape out_shape = outputs[0].shape();
+    ASSERT_EQ(ref_shape, out_shape) << "ref_shape " << ref_shape.str() << "does not match output shape " << out_shape.str();
     normAssert(output_ref, outputs[0]);
 }
 INSTANTIATE_TEST_CASE_P(/*nothting*/, Layer_FullyConnected_Test,
                         testing::Values(
-                            std::vector<int>({}),
+                            //only bias could be broadcasted from a scalar
+                            //std::vector<int>({}),
                             std::vector<int>({1}),
                             std::vector<int>({4})
 ));
@@ -1577,8 +1592,8 @@ TEST_P(Layer_Einsum_Test, Accuracy_01D)
     lp.set("equation", equation);
     lp.set("inputSize", 2);
     lp.set("outputSize", 1);
-    lp.set("inputShapes0", DictValue::arrayInt(&input_shape1[0], input_shape1.size()));
-    lp.set("inputShapes1", DictValue::arrayInt(&input_shape2[0], input_shape2.size()));
+    lp.set("inputShapes0", DictValue::arrayInt(input_shape1.data(), input_shape1.size()));
+    lp.set("inputShapes1", DictValue::arrayInt(input_shape2.data(), input_shape2.size()));
 
     Ptr<Layer> layer = EinsumLayer::create(lp);
 
@@ -1627,6 +1642,7 @@ TEST_P(Layer_Einsum_Test, Accuracy_01D)
     normAssert(output_ref, outputs[0]);
 }
 
+// BUG: https://github.com/opencv/opencv/issues/26193
 INSTANTIATE_TEST_CASE_P(/*nothing*/, Layer_Einsum_Test, testing::Values(
     std::make_tuple(std::vector<int>({}), std::vector<int>({}), ",->"),
     std::make_tuple(std::vector<int>({1}), std::vector<int>({}), "i,->i"),
diff --git a/modules/dnn/test/test_model.cpp b/modules/dnn/test/test_model.cpp
index d5b26701b4..64794ee021 100644
--- a/modules/dnn/test/test_model.cpp
+++ b/modules/dnn/test/test_model.cpp
@@ -100,7 +100,8 @@ public:
     void testSegmentationModel(const std::string& weights_file, const std::string& config_file,
                                const std::string& inImgPath, const std::string& outImgPath,
                                float norm, const Size& size = {-1, -1}, Scalar mean = Scalar(),
-                               double scale = 1.0, bool swapRB = false, bool crop = false, const std::string outname = "")
+                               double scale = 1.0, bool swapRB = false, bool crop = false,
+                               const std::vector<std::string>& outnames=std::vector<std::string>())
     {
         checkBackend();
 
@@ -115,8 +116,8 @@ public:
         model.setPreferableBackend(backend);
         model.setPreferableTarget(target);
 
-        if(!outname.empty())
-            model.setOutputNames({outname});
+        if(!outnames.empty())
+            model.setOutputNames(outnames);
 
         model.segment(frame, mask);
         normAssert(mask, exp, "", norm, norm);
@@ -669,9 +670,10 @@ TEST_P(Test_Model, Segmentation)
         applyTestTag(CV_TEST_TAG_DNN_SKIP_IE_MYRIAD, CV_TEST_TAG_DNN_SKIP_IE_NGRAPH, CV_TEST_TAG_DNN_SKIP_IE_VERSION);
 #endif
 
-    if ((backend == DNN_BACKEND_OPENCV && (target == DNN_TARGET_OPENCL_FP16 || target == DNN_TARGET_CPU_FP16))
-        || (backend == DNN_BACKEND_CUDA && target == DNN_TARGET_CUDA_FP16))
+    //if ((backend == DNN_BACKEND_OPENCV && (target == DNN_TARGET_OPENCL_FP16 || target == DNN_TARGET_CPU_FP16))
+    //    || (backend == DNN_BACKEND_CUDA && target == DNN_TARGET_CUDA_FP16))
     {
+        // let's always set it to 7 for now
         norm = 7.0f;  // l1 = 0.01 lInf = 7
     }
 
@@ -684,7 +686,7 @@ TEST_P(Test_Model, Segmentation)
     Scalar mean = Scalar(0.485*255, 0.456*255, 0.406*255);
     bool swapRB = true;
 
-    testSegmentationModel(weights_file, "", inp, exp, norm, size, mean, scale, swapRB, false, "out");
+    testSegmentationModel(weights_file, "", inp, exp, norm, size, mean, scale, swapRB, false);
 }
 
 TEST_P(Test_Model, TextRecognition)
@@ -751,7 +753,8 @@ TEST_P(Test_Model, TextRecognitionWithCTCPrefixBeamSearch)
     testTextRecognitionModel(weightPath, "", imgPath, seq, decodeType, vocabulary, size, mean, scale);
 }
 
-TEST_P(Test_Model, TextDetectionByDB)
+// BUG: https://github.com/opencv/opencv/issues/26246
+TEST_P(Test_Model, DISABLED_TextDetectionByDB)
 {
     applyTestTag(CV_TEST_TAG_DEBUG_VERYLONG);
 
diff --git a/modules/dnn/test/test_onnx_importer.cpp b/modules/dnn/test/test_onnx_importer.cpp
index d02a4db2f5..4c8bd9aea6 100644
--- a/modules/dnn/test/test_onnx_importer.cpp
+++ b/modules/dnn/test/test_onnx_importer.cpp
@@ -45,19 +45,34 @@ public:
     {
         std::vector<MatShape> inLayerShapes;
         std::vector<MatShape> outLayerShapes;
-        net.getLayerShapes(MatShape(), CV_32F, 0, inLayerShapes, outLayerShapes);
+        std::vector<MatShape> suggestedShapes;
+        std::vector<int> suggestedTypes;
+        for (const Mat& inp: inps) {
+            suggestedShapes.push_back(inp.shape());
+            suggestedTypes.push_back(inp.type());
+        }
+        net.getLayerShapes(suggestedShapes, suggestedTypes, 0, inLayerShapes, outLayerShapes);
         ASSERT_EQ(inLayerShapes.size(), inps.size());
 
         for (int i = 0; i < inps.size(); ++i) {
             bool hasDynamicShapes = inLayerShapes[i].empty();
+            MatShape inpshape_i = inps[i].shape();
             if (hasDynamicShapes)
                 continue;
+            if (inLayerShapes[i].size() == 0 && inpshape_i.dims == 1) {
+                // [TODO] sometimes sample .onnx models from ONNX conformance suit
+                // specify scalars as inputs, but we test them using 1D input.
+                // the tests need to be adjusted
+                continue;
+            }
             if (inLayerShapes[i].size() == 1) {  // 1D input
-                ASSERT_EQ(shape(inLayerShapes[i][0]), shape(inps[i]));
+                ASSERT_EQ(shape(inLayerShapes[i][0]), inpshape_i);
             } else {
                 // Compare all axes except batch dimension which is variable.
-                inLayerShapes[i][0] = inps[i].size[0];
-                ASSERT_EQ(inLayerShapes[i], shape(inps[i]));
+                inLayerShapes[i][0] = inpshape_i[0];
+                if (inLayerShapes[i] != inpshape_i) {
+                    ASSERT_EQ(inLayerShapes[i], shape(inps[i]));
+                }
             }
         }
     }
@@ -127,6 +142,8 @@ public:
             l1 = std::max(l1, 1.4e-3);
             lInf = std::max(lInf, 8e-3);
         }
+
+        EXPECT_EQ(ref.shape(), out.shape());
         normAssert(ref, out, basename.c_str(), l1 ? l1 : default_l1, lInf ? lInf : default_lInf);
         if (checkNoFallbacks)
             expectNoFallbacksFromIE(net);
@@ -311,7 +328,8 @@ TEST_P(Test_ONNX_layers, Deconvolution)
         testONNXModels("deconv_adjpad_2d", npy, 0, 0, false, false);
 }
 
-TEST_P(Test_ONNX_layers, Deconvolution3D)
+// BUG: https://github.com/opencv/opencv/issues/26307
+TEST_P(Test_ONNX_layers, DISABLED_Deconvolution3D)
 {
 #if defined(INF_ENGINE_RELEASE) && INF_ENGINE_VER_MAJOR_EQ(2022010000)
     if (backend == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH)
@@ -340,7 +358,8 @@ TEST_P(Test_ONNX_layers, Deconvolution3D)
     testONNXModels("deconv3d");
 }
 
-TEST_P(Test_ONNX_layers, Deconvolution3D_bias)
+// BUG: https://github.com/opencv/opencv/issues/26307
+TEST_P(Test_ONNX_layers, DISABLED_Deconvolution3D_bias)
 {
 #if defined(INF_ENGINE_RELEASE) && INF_ENGINE_VER_MAJOR_EQ(2022010000)
     if (backend == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH)
@@ -369,7 +388,8 @@ TEST_P(Test_ONNX_layers, Deconvolution3D_bias)
     testONNXModels("deconv3d_bias");
 }
 
-TEST_P(Test_ONNX_layers, Deconvolution3D_pad)
+// BUG: https://github.com/opencv/opencv/issues/26307
+TEST_P(Test_ONNX_layers, DISABLED_Deconvolution3D_pad)
 {
 #if defined(INF_ENGINE_RELEASE) && INF_ENGINE_VER_MAJOR_EQ(2022010000)
     if (backend == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH)
@@ -389,16 +409,17 @@ TEST_P(Test_ONNX_layers, Deconvolution3D_pad)
     }
 #endif
 
-    if (backend == DNN_BACKEND_OPENCV)
+    //if (backend == DNN_BACKEND_OPENCV)
         throw SkipTestException("OpenCV backend is not supported");  // FIXIT use tags
 
-    if (backend == DNN_BACKEND_VKCOM)
-        applyTestTag(CV_TEST_TAG_DNN_SKIP_VULKAN);
+    //if (backend == DNN_BACKEND_VKCOM)
+    //    applyTestTag(CV_TEST_TAG_DNN_SKIP_VULKAN);
 
-    testONNXModels("deconv3d_pad");
+    //testONNXModels("deconv3d_pad");
 }
 
-TEST_P(Test_ONNX_layers, Deconvolution3D_adjpad)
+// BUG: https://github.com/opencv/opencv/issues/26307
+TEST_P(Test_ONNX_layers, DISABLED_Deconvolution3D_adjpad)
 {
 #if defined(INF_ENGINE_RELEASE) && INF_ENGINE_VER_MAJOR_EQ(2022010000)
     if (backend == DNN_BACKEND_INFERENCE_ENGINE_NGRAPH)
@@ -1114,7 +1135,8 @@ TEST_P(Test_ONNX_layers, ResizeUnfusedTwoInputs)
         applyTestTag(CV_TEST_TAG_DNN_SKIP_IE_NGRAPH);
 #endif
     testONNXModels("upsample_unfused_two_inputs_opset9_torch1.4", npy, 0, 0, false, true, 2);
-    testONNXModels("upsample_unfused_two_inputs_opset11_torch1.4", npy, 0, 0, false, true, 2);
+    // BUG: https://github.com/opencv/opencv/issues/26291
+    // testONNXModels("upsample_unfused_two_inputs_opset11_torch1.4", npy, 0, 0, false, true, 2);
 }
 
 TEST_P(Test_ONNX_layers, MultyInputs)
@@ -2267,16 +2289,19 @@ TEST_P(Test_ONNX_nets, Googlenet)
     if (target == DNN_TARGET_CPU_FP16)
         net.enableWinograd(false);
 
-    std::vector<Mat> images;
+    std::vector<Mat> images, results;
     images.push_back( imread(_tf("../googlenet_0.png")) );
     images.push_back( imread(_tf("../googlenet_1.png")) );
-    Mat inp = blobFromImages(images, 1.0f, Size(), Scalar(), false);
     Mat ref = blobFromNPY(_tf("../googlenet_prob.npy"));
-    checkBackend(&inp, &ref);
-
-    net.setInput(inp);
-    ASSERT_FALSE(net.empty());
-    Mat out = net.forward();
+    for (int i = 0; i < 2; i++) {
+        Mat inp_i = blobFromImage(images[i], 1.0f, Size(), Scalar(), false);
+        net.setInput(inp_i);
+        ASSERT_FALSE(net.empty());
+        Mat out_i = net.forward();
+        results.push_back(out_i.clone());
+    }
+    Mat out;
+    vconcat(results, out);
 
     normAssert(ref, out, "", default_l1,  default_lInf);
     expectNoFallbacksFromIE(net);
@@ -2723,7 +2748,26 @@ static void testYOLO(const std::string& weightPath, const std::vector<int>& refC
 
     net.setInput(inp);
     std::vector<Mat> outs;
-    net.forward(outs, net.getUnconnectedOutLayersNames());
+    std::vector<std::string> out_names = net.getUnconnectedOutLayersNames();
+    net.forward(outs, out_names);
+    EXPECT_EQ(outs.size(), out_names.size());
+    if(outs.size() == 1)
+    {
+        // do nothing
+    }
+    else if (outs.size() == 2)
+    {
+        // sort outs by name. New and old DNN engines return otuput in different order!
+        if(out_names[0] > out_names[1])
+        {
+            std::swap(out_names[0], out_names[1]);
+            std::swap(outs[0], outs[1]);
+        }
+    }
+    else if (outs.size() > 2)
+    {
+        CV_Error(Error::StsUnsupportedFormat, "Too many Yolo network outputs!");
+    }
 
     // Retrieve
     std::vector<int> keep_classIds;
@@ -2760,6 +2804,8 @@ void yoloPostProcessing(
     }
 
     if (model_name == "yolonas"){
+        EXPECT_EQ(cv::MatShape({1, 8400, 80}), outs[0].shape());
+        EXPECT_EQ(cv::MatShape({1, 8400, 4}), outs[1].shape());
         // outs contains 2 elemets of shape [1, 8400, 80] and [1, 8400, 4]. Concat them to get [1, 8400, 84]
         Mat concat_out;
         // squeeze the first dimension
diff --git a/modules/java/generator/src/cpp/listconverters.cpp b/modules/java/generator/src/cpp/listconverters.cpp
index 19e8b5b9ca..bacb6422d5 100644
--- a/modules/java/generator/src/cpp/listconverters.cpp
+++ b/modules/java/generator/src/cpp/listconverters.cpp
@@ -110,7 +110,7 @@ void Copy_vector_string_to_List(JNIEnv* env, std::vector<std::string>& vs, jobje
 }
 
 #ifdef HAVE_OPENCV_DNN
-void Copy_vector_MatShape_to_List(JNIEnv* env, std::vector<cv::dnn::MatShape>& vs, jobject list)
+void Copy_vector_MatShape_to_List(JNIEnv* env, std::vector<cv::MatShape>& vs, jobject list)
 {
     static jclass juArrayList       = ARRAYLIST(env);
     jmethodID m_clear     = LIST_CLEAR(env, juArrayList);
diff --git a/modules/java/generator/src/cpp/listconverters.hpp b/modules/java/generator/src/cpp/listconverters.hpp
index 83635a5cb9..ec9ad74ad7 100644
--- a/modules/java/generator/src/cpp/listconverters.hpp
+++ b/modules/java/generator/src/cpp/listconverters.hpp
@@ -26,7 +26,7 @@ void Copy_vector_string_to_List(JNIEnv* env, std::vector<std::string>& vs, jobje
 #ifdef HAVE_OPENCV_DNN
 #include "opencv2/dnn.hpp"
 
-void Copy_vector_MatShape_to_List(JNIEnv* env, std::vector<cv::dnn::MatShape>& vs, jobject list);
+void Copy_vector_MatShape_to_List(JNIEnv* env, std::vector<cv::MatShape>& vs, jobject list);
 
 #endif // HAVE_OPENCV_DNN
 
diff --git a/modules/objc/generator/CMakeLists.txt b/modules/objc/generator/CMakeLists.txt
index bd8f8325b3..d33e998142 100644
--- a/modules/objc/generator/CMakeLists.txt
+++ b/modules/objc/generator/CMakeLists.txt
@@ -91,7 +91,7 @@ macro(ocv_add_objc_generated_target TARGET)
   file(MAKE_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${TARGET}")
   add_custom_command(
       OUTPUT ${objc_generated_files} "${objc_${TARGET}_generated_output_dependecy}"
-      COMMAND ${PYTHON_DEFAULT_EXECUTABLE} "${OBJC_SOURCE_DIR}/generator/gen_objc.py"
+      COMMAND ${PYTHON3_EXECUTABLE} "${OBJC_SOURCE_DIR}/generator/gen_objc.py"
               -p "${OBJC_SOURCE_DIR}/../python/src2/gen2.py"
               -c "${CONFIG_FILE}"
               -t "${TARGET}"
diff --git a/modules/objc/generator/gen_objc.py b/modules/objc/generator/gen_objc.py
index 79633fd0eb..861ac41098 100755
--- a/modules/objc/generator/gen_objc.py
+++ b/modules/objc/generator/gen_objc.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 
 from __future__ import print_function, unicode_literals
 import sys, re, os.path, errno, fnmatch
diff --git a/modules/ts/src/ts_func.cpp b/modules/ts/src/ts_func.cpp
index 69fa3353cb..c21c02edbf 100644
--- a/modules/ts/src/ts_func.cpp
+++ b/modules/ts/src/ts_func.cpp
@@ -1515,8 +1515,14 @@ double norm(InputArray _src1, InputArray _src2, int normType, InputArray _mask)
     normType = normType == NORM_L2SQR ? NORM_L2 : normType;
 
     CV_CheckTypeEQ(src1.type(), src2.type(), "");
-    CV_Assert(src1.size == src2.size);
-    CV_Assert( mask.empty() || (src1.size == mask.size && (mask.type() == CV_8U || mask.type() == CV_Bool)) );
+    MatShape shape1 = src1.shape();
+    MatShape shape2 = src2.shape();
+    if (shape1 != shape2) {
+        printf("shape1: %s\n", shape1.str().c_str());
+        printf("shape2: %s\n", shape2.str().c_str());
+        CV_Assert(shape1 == shape2 && "shapes of compared arrays must be the same");
+    }
+    CV_Assert( mask.empty() || (shape1 == mask.shape() && (mask.type() == CV_8U || mask.type() == CV_Bool)) );
     CV_Assert( normType == NORM_INF || normType == NORM_L1 || normType == NORM_L2 );
     const Mat *arrays[]={&src1, &src2, &mask, 0};
     Mat planes[3];
diff --git a/modules/video/src/tracking/tracker_dasiamrpn.cpp b/modules/video/src/tracking/tracker_dasiamrpn.cpp
index 57d8a15ef7..be98f7f767 100644
--- a/modules/video/src/tracking/tracker_dasiamrpn.cpp
+++ b/modules/video/src/tracking/tracker_dasiamrpn.cpp
@@ -60,10 +60,13 @@ public:
     TrackerDaSiamRPNImpl(const TrackerDaSiamRPN::Params& parameters)
         : params(parameters)
     {
-
-        siamRPN = dnn::readNet(params.model);
-        siamKernelCL1 = dnn::readNet(params.kernel_cls1);
-        siamKernelR1 = dnn::readNet(params.kernel_r1);
+        // the tracker uses DNN models in quite sophisticated way,
+        // so it's not supported yet by the new engine.
+        // BUG: https://github.com/opencv/opencv/issues/26201
+        dnn::EngineType engine = dnn::ENGINE_CLASSIC;
+        siamRPN = dnn::readNet(params.model, "", "", engine);
+        siamKernelCL1 = dnn::readNet(params.kernel_cls1, "", "", engine);
+        siamKernelR1 = dnn::readNet(params.kernel_r1, "", "", engine);
 
         CV_Assert(!siamRPN.empty());
         CV_Assert(!siamKernelCL1.empty());
diff --git a/platforms/apple/cv_build_utils.py b/platforms/apple/cv_build_utils.py
old mode 100644
new mode 100755
index eba258370d..1a5bb53bd8
--- a/platforms/apple/cv_build_utils.py
+++ b/platforms/apple/cv_build_utils.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 """
 Common utilities. These should be compatible with Python3.
 """
diff --git a/platforms/ios/build_docs.py b/platforms/ios/build_docs.py
index e5dc9b2e81..06f03735de 100755
--- a/platforms/ios/build_docs.py
+++ b/platforms/ios/build_docs.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 """
 This script builds OpenCV docs for iOS.
 """
diff --git a/platforms/ios/readme.txt b/platforms/ios/readme.txt
index 0c39e7213c..4ddcb84410 100644
--- a/platforms/ios/readme.txt
+++ b/platforms/ios/readme.txt
@@ -2,6 +2,6 @@ Building OpenCV from Source, using CMake and Command Line
 =========================================================
 
 cd ~/<my_working_directory>
-python opencv/platforms/ios/build_framework.py ios
+python3 opencv/platforms/ios/build_framework.py ios
 
 If everything's fine, a few minutes later you will get ~/<my_working_directory>/ios/opencv2.framework. You can add this framework to your Xcode projects.
diff --git a/platforms/osx/build_framework.py b/platforms/osx/build_framework.py
index 2dd5015ee5..c0f86fb8b9 100755
--- a/platforms/osx/build_framework.py
+++ b/platforms/osx/build_framework.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 """
 The script builds OpenCV.framework for OSX.
 """