the self-supervised
Posted: Mon Dec 23, 2024 5:05 am
Here the researchers refer to training the larger neural network as the outer loop ( ) and training within each layer as the inner loop ( ). The difference between them in terms of gradient calculation is that the inner loop is for (i. e. the parameters of the model) while the outer loop is for the parameters θ of the rest of the network. . Learning the self-supervised task Arguably the most important part is the self-supervised task because it determines the type of features learned from the test sequence. In the design of this task the researchers took a more end-to-end approach - directly optimizing the self-supervised task to achieve the ultimate goal of the next k predictions.
Specifically the researchers make learningStarting with the lithuania country code simple reconstruction task in the formula above, some outer loop parameters are added to make this task learnable. The latest self-supervised loss is: In the inner loop only is optimized and therefore written as the parameters of ℓ; the θs are the "hyperparameters" of this loss function.In the outer loop, θK, θ, θ are optimized together with θ and are just a hidden state and not a parameter. The figure illustrates this difference in code, where θK and θ are implemented as layer parameters similar to the K parameter in self-attention.
In general, all possible choices of θK, θ, θ constitute a series of multi-view reconstruction tasks. The outer loop can be understood as selecting a specific task from this task group. For simplicity, the researchers here design all views as linear projections. . - Parallelization The native layers currently developed are already very efficient in terms of the number of floating-point operations (). However, its update rule: cannot be parallelized because it depends on - in two positions: the negative sign and ▽. In response, the researchers proposed - gradient descent with the batch size.
Specifically the researchers make learningStarting with the lithuania country code simple reconstruction task in the formula above, some outer loop parameters are added to make this task learnable. The latest self-supervised loss is: In the inner loop only is optimized and therefore written as the parameters of ℓ; the θs are the "hyperparameters" of this loss function.In the outer loop, θK, θ, θ are optimized together with θ and are just a hidden state and not a parameter. The figure illustrates this difference in code, where θK and θ are implemented as layer parameters similar to the K parameter in self-attention.
In general, all possible choices of θK, θ, θ constitute a series of multi-view reconstruction tasks. The outer loop can be understood as selecting a specific task from this task group. For simplicity, the researchers here design all views as linear projections. . - Parallelization The native layers currently developed are already very efficient in terms of the number of floating-point operations (). However, its update rule: cannot be parallelized because it depends on - in two positions: the negative sign and ▽. In response, the researchers proposed - gradient descent with the batch size.