Skip to content

Conversation

gs-olive
Copy link
Collaborator

@gs-olive gs-olive commented Oct 28, 2022

Description

Resolves a CUDA 710 error Issue arising when compiling BERT models with 3+ inputs. The issue arises due to the role of the third tensor in inference computations. Specifically, as specified in the BERT model code linked here, the third argument, token_type_ids is of type torch.LongTensor, but can only take indices in $[0,1]$. This means that when values outside of this set are used, the input is invalid.

This becomes problematic when the inputs are, for example, indices in a dictionary or embedding - which seems to be the case here. Specifically, aten::embedding is used with Tensors which are the product of token_type_ids. The issue traces to one line in the shape_analysis code previewed below, which initializes a random tensor with values in the range $[0,4]$.

// shape_analysis.cpp [Line 23, Commit 5f3a5a3]
auto in = at::randint(5, shape, {at::kCUDA}).to(type);

This tensor is run through the forward function of the module to determine the shapes of outputs and causes the model compilation-time error, as featured here in the shape analysis code.

I have added a temporary fix by decreasing the range of allowed values to the random number generator for creating input tensors to 0-1, instead of 0-4, and am working on a more robust fix.

Fixes #1418

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • [ x ] My code follows the style guidelines of this project (You can use the linters)
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas and hacks
  • [ x ] I have made corresponding changes to the documentation
  • [ x ] I have added tests to verify my fix or my feature
  • [ x ] New and existing unit tests pass locally with my changes
  • [ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

@narendasan
Copy link
Collaborator

@bowang007 Make sure to review this

@narendasan
Copy link
Collaborator

From my perspective see nothing wrong with sampling between $[0,1)$

@gs-olive gs-olive added the release: v1.3 Tagged to be included in v1.3 label Nov 1, 2022
@gs-olive gs-olive self-assigned this Nov 1, 2022
- Issue arising when compiling BERT models with 3+ inputs
- Added temporary fix by decreasing the range of allowed values to the
random number generator for creating input tensors to [0,2), instead of [0,5)
- Used random float inputs in the range [0, 2) instead of int, then casted to desired
type. The ultimate effect of this change with regard to bug pytorch#1418, is
random floats are selected in the range [0, 2), then casted to Int, effectively making the
range of allowed ints {0, 1}, as required by the model
- More robust fix to follow

// Make the value range for input tensor a uniform (float) distribution
// over [LoValIncl, HiValExcl), then cast to the desired dtype
auto in = ((HiValExcl - LoValIncl) * at::rand(shape, {at::kCUDA}) + LoValIncl).to(type);
Copy link
Collaborator Author

@gs-olive gs-olive Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used float inputs in the range $[LoValIncl, HiValExcl)$, then casted to the desired type to avoid divide-by-zero errors potentially arising from only selecting integer random values (even for float tensors). Currently, $LoValIncl = 0$ and $HiValExcl = 2$, but this will be made optionally user-customizeable in a later PR, as discussed in RFC #1425.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a little bit hard-coded for this model only, but will be resolved once the input range is open to users by this RFC #1425.

Copy link
Collaborator

@bowang007 bowang007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gs-olive gs-olive merged commit 1951525 into pytorch:master Nov 9, 2022
@gs-olive gs-olive deleted the cuda_error_bugfix branch November 9, 2022 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: core Issues re: The core compiler component: partitioning release: v1.3 Tagged to be included in v1.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [Bug] Encountered cuda 710 error when apply Torch-TensorRT to BERT
4 participants