0
0
mirror of https://github.com/nodejs/node.git synced 2024-11-24 03:07:54 +01:00
nodejs/src/embedded_data.cc
Joyee Cheung 72df124e38
build: encode non-ASCII Latin1 characters as one byte in JS2C
Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

PR-URL: https://github.com/nodejs/node/pull/51605
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
2024-02-17 17:09:24 +00:00

34 lines
1.1 KiB
C++

#include "embedded_data.h"
#include <vector>
namespace node {
std::string ToOctalString(const uint8_t ch) {
// We can print most printable characters directly. The exceptions are '\'
// (escape characters), " (would end the string), and ? (trigraphs). The
// latter may be overly conservative: we compile with C++17 which doesn't
// support trigraphs.
if (ch >= ' ' && ch <= '~' && ch != '\\' && ch != '"' && ch != '?') {
return std::string(1, static_cast<char>(ch));
}
// All other characters are blindly output as octal.
const char c0 = '0' + ((ch >> 6) & 7);
const char c1 = '0' + ((ch >> 3) & 7);
const char c2 = '0' + (ch & 7);
return std::string("\\") + c0 + c1 + c2;
}
std::vector<std::string> GetOctalTable() {
size_t size = 1 << 8;
std::vector<std::string> code_table(size);
for (size_t i = 0; i < size; ++i) {
code_table[i] = ToOctalString(static_cast<uint8_t>(i));
}
return code_table;
}
const std::string& GetOctalCode(uint8_t index) {
static std::vector<std::string> table = GetOctalTable();
return table[index];
}
} // namespace node