Skip to content

Escape quotes and control chars in DOT #14657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 18, 2025

Conversation

lukaszsamson
Copy link
Contributor

DOT syntax requires escaping quotes in quoted string Other control characters are not strictly required to be escaped but a DOT graph containing them will most likely not be parsed by other tools See reference DOT syntax https://graphviz.org/doc/info/lang.html

DOT syntax requires escaping quotes in quoted string
Other control characters are not strictly required to be escaped but a DOT graph containing them will most likely not be parsed by other tools
See reference DOT syntax https://graphviz.org/doc/info/lang.html
@josevalim
Copy link
Member

I am reading the docs and it says that:

In quoted strings in DOT, the only escaped character is double-quote ". That is, in quoted strings, the dyad " is converted to "; all other characters are left unchanged. In particular, \ remains \. Layout engines may apply additional escape sequences.

So it seems that we only need to escape quotes?

@lukaszsamson
Copy link
Contributor Author

I'm not certain what is the correct thing to do.

Is this still a valid graph when the string contains a newline?

digraph "graph" {
  "foo
bar"
}

Or suppose the string ends with backslash - then it will escape the quote

digraph "graph" {
  "foo\"
}

@josevalim
Copy link
Member

Is this still a valid graph when the string contains a newline? I tried it both online and in the CLI and it seems that yes, it is valid. And it keeps the newline.

@lukaszsamson
Copy link
Contributor Author

Then perhaps we should escape only backslash and quote?

@josevalim
Copy link
Member

According to the document above, from the page you shared, we should escape only quotes. They say escapes themselves don't need it.

@lukaszsamson
Copy link
Contributor Author

I did some testing online

  • quotes
"fo"o" - invalid
"fo\"o" - valid
"fo\\"o" - invalid
"fo\\\"o" - valid
  • backslash
"fo\o" - valid
"fo\\o" - valid
"fo\\\o" - valid
  • backslash at the end
"fo\" - invalid
"fo\\" - valid
"fo\\\" - invalid

So the string must not end in even number of backslashes and each quote char in the string must be proceeded by odd number of backslashes

@lukaszsamson
Copy link
Contributor Author

I changed the implementation to only escape quotes and append a backslash if the string has an odd number of trailing backslashes

Comment on lines 486 to 516
escaped_data = escape_dot_string(string)
[?", escaped_data, ?"]
end

# Escape a string for DOT format according to GraphViz specification https://graphviz.org/doc/info/lang.html
# - Only quotes need escaping
# - String must not end with an odd number of backslashes (would escape the closing quote)
defp escape_dot_string(string) do
escape_dot_string(string, [], 0)
end

defp escape_dot_string(<<>>, acc, backslash_count) do
if rem(backslash_count, 2) == 1 do
# Odd number of trailing backslashes - add one more to make it even
[acc, "\\"]
else
acc
end
end

defp escape_dot_string(<<?"::utf8, rest::binary>>, acc, _backslash_count) do
escape_dot_string(rest, [acc, "\\\""], 0)
end

defp escape_dot_string(<<"\\"::utf8, rest::binary>>, acc, backslash_count) do
escape_dot_string(rest, [acc, "\\"], backslash_count + 1)
end

defp escape_dot_string(<<char::utf8, rest::binary>>, acc, _backslash_count) do
escape_dot_string(rest, [acc, char], 0)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
escaped_data = escape_dot_string(string)
[?", escaped_data, ?"]
end
# Escape a string for DOT format according to GraphViz specification https://graphviz.org/doc/info/lang.html
# - Only quotes need escaping
# - String must not end with an odd number of backslashes (would escape the closing quote)
defp escape_dot_string(string) do
escape_dot_string(string, [], 0)
end
defp escape_dot_string(<<>>, acc, backslash_count) do
if rem(backslash_count, 2) == 1 do
# Odd number of trailing backslashes - add one more to make it even
[acc, "\\"]
else
acc
end
end
defp escape_dot_string(<<?"::utf8, rest::binary>>, acc, _backslash_count) do
escape_dot_string(rest, [acc, "\\\""], 0)
end
defp escape_dot_string(<<"\\"::utf8, rest::binary>>, acc, backslash_count) do
escape_dot_string(rest, [acc, "\\"], backslash_count + 1)
end
defp escape_dot_string(<<char::utf8, rest::binary>>, acc, _backslash_count) do
escape_dot_string(rest, [acc, char], 0)
end
escape_dot_string(string, <<?">>)
end
# Escape a string for DOT format according to GraphViz specification https://graphviz.org/doc/info/lang.html
# - Only quotes need escaping
# - The ending quote should not be escaped (which requires an even of trailing backslashes)
defp escape_dot_string(<<?\\, ?\\, rest::binary>>, acc) do
escape_dot_string(rest, <<acc::binary, ?\\, ?\\>>)
end
defp escape_dot_string(<<?", rest::binary>>, acc) do
escape_dot_string(rest, <<acc::binary, ?\\, ?">>)
end
defp escape_dot_string(<<char, rest::binary>>, acc) do
escape_dot_string(rest, <<acc::binary, char>>)
end
defp escape_dot_string(<<?\\>>, acc) do
<<acc::binary, ?\\, ?\\, ?">>
end
defp escape_dot_string(<<>>, acc) do
<<acc::binary, ?">>
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above should be more efficient too as it should apply the binary optimizations for traversal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Won't it create binary copies on each char? The caller already builds an iolist so the iolist was a natural choice.
  2. escape_dot_string(<<char, rest::binary>>, acc) will split multibyt utf8 chars into bytes, it may accidentally match on ?". Shouldn't that be escape_dot_string(<<char::utf8, rest::binary>>, acc)?

Copy link
Member

@josevalim josevalim Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, Erlang can optimize it, and mutate the binary in place. It is more efficient than iolists (see below)
  2. utf8 bytes can not have the lower 127 ASCII characters in them, so we are good. There is no way to partially match on something and have ?" as left over.
$ ERL_COMPILER_OPTIONS=bin_opt_info elixir lib/mix/lib/mix/utils.ex
    warning: redefining module Mix.Utils (current version loaded from lib/mix/ebin/Elixir.Mix.Utils.beam)
    │
  8 │ defmodule Mix.Utils do
    │ ~~~~~~~~~~~~~~~~~~~~~~
    │
    └─ lib/mix/lib/mix/utils.ex:8: Mix.Utils (module)

     warning: OPTIMIZED: match context reused
     │
 493 │     escape_dot_string(rest, <<acc::binary, ?\\, ?\\>>)
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/mix/lib/mix/utils.ex:493

     warning: OPTIMIZED: match context reused
     │
 497 │     escape_dot_string(rest, <<acc::binary, ?\\, ?">>)
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/mix/lib/mix/utils.ex:497

     warning: OPTIMIZED: match context reused
     │
 505 │     escape_dot_string(rest, <<acc::binary, char>>)
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/mix/lib/mix/utils.ex:505

@josevalim josevalim merged commit d4b0b0e into elixir-lang:main Jul 18, 2025
13 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@lukaszsamson
Copy link
Contributor Author

@josevalim this seems invalid. It will split multibyte utf8 chars into bytes

  defp escape_dot_string(<<char, rest::binary>>, acc) do
    escape_dot_string(rest, <<acc::binary, char>>)
  end

Won't the other clauses match on the rest of a split multibyte char?

@josevalim
Copy link
Member

I replied above: utf8 bytes can not have the lower 127 ASCII characters in them, so we are good. There is no way to partially match on something and have ?" as left over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants