mirror of
https://github.com/nodejs/node.git
synced 2024-11-24 03:07:54 +01:00
72df124e38
Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: https://github.com/nodejs/node/pull/51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
34 lines
1.1 KiB
C++
34 lines
1.1 KiB
C++
#include "embedded_data.h"
|
|
#include <vector>
|
|
|
|
namespace node {
|
|
std::string ToOctalString(const uint8_t ch) {
|
|
// We can print most printable characters directly. The exceptions are '\'
|
|
// (escape characters), " (would end the string), and ? (trigraphs). The
|
|
// latter may be overly conservative: we compile with C++17 which doesn't
|
|
// support trigraphs.
|
|
if (ch >= ' ' && ch <= '~' && ch != '\\' && ch != '"' && ch != '?') {
|
|
return std::string(1, static_cast<char>(ch));
|
|
}
|
|
// All other characters are blindly output as octal.
|
|
const char c0 = '0' + ((ch >> 6) & 7);
|
|
const char c1 = '0' + ((ch >> 3) & 7);
|
|
const char c2 = '0' + (ch & 7);
|
|
return std::string("\\") + c0 + c1 + c2;
|
|
}
|
|
|
|
std::vector<std::string> GetOctalTable() {
|
|
size_t size = 1 << 8;
|
|
std::vector<std::string> code_table(size);
|
|
for (size_t i = 0; i < size; ++i) {
|
|
code_table[i] = ToOctalString(static_cast<uint8_t>(i));
|
|
}
|
|
return code_table;
|
|
}
|
|
|
|
const std::string& GetOctalCode(uint8_t index) {
|
|
static std::vector<std::string> table = GetOctalTable();
|
|
return table[index];
|
|
}
|
|
} // namespace node
|