Replacing a lot of text in browser's addon

Question

I'm trying to develop a Firefox add-on that transliterates the text on any page into specific language. Actually it's just a set of 2D arrays which I iterate and use this code

function escapeRegExp(str) {
    return str.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}

function replaceAll(find, replace) {
    return document.body.innerHTML.replace(new RegExp(escapeRegExp(find), 'g'), replace);
}

function convert2latin() {
    for (var i = 0; i < Table.length; i++) {
        document.body.innerHTML = replaceAll(Table[i][1], Table[i][0]);
    }
}

It works, and I can ignore HTML tags, as it can be in english only, but the problem is performance. Of course it's very very poor. As I have no experience in JS, I tried to google and found that maybe documentFragment can help.
Maybe I should use another approach at all?

Is your `Table` dynamic, or static? If dynamic, why does it change and how often does it change? What sort of contents does `Table` contain (e.g. a list of English and Latin characters (guess form `convert2latin`)?)? You are clearly attempting to accommodate the possibility that `Table` contains some special characters. What is the source for the contents of `Table`? Please provide some examples of input and output. We need to know what input and output you are expecting in order to evaluate other potential approaches to the problem. — Makyen, Jun 01 '16 at 09:41
@Makyen That's my python test script I used. JS table is just copy-paste of it with changes in syntax. https://github.com/Pugnator/JLPTrainer/blob/master/jpconvert.py The main idea is to iterate through text in russian and convert each syllable into japanese analogue. This addon targets to help people to remember japanese symbols in more easy way, without grinding — pugnator, Jun 01 '16 at 09:50
Your Python code uses `string.replace()` which does not use regular expressions. Is there a reason you are using a regular expression for the match? If you just use a normal string for the match it should be significantly faster. If you are going to use a RegExp, with what appears to be a static array of substitutions which you perform multiple times, you are usually better off creating each RegExp only once and storing it (e.g. in `Table[i][3]`) for use rather than re-creating each RegExp (expensive) each time. — Makyen, Jun 01 '16 at 10:12
@Makyen no, there is no specific reason for it. I'm new to web and JS. I was told the main reason for huge lags is DOM rebuild on each replace. And this way it will be always laggy — pugnator, Jun 01 '16 at 10:16
That is true, which should have given you enough information to significantly increase performance. — Makyen, Jun 01 '16 at 10:17
@Makyen post an answer, I think you gave me enough info to "redesign" the addon — pugnator, Jun 01 '16 at 10:29

Makyen · Accepted Answer · 2017-06-14T20:22:40.513

Based on your comments, you appear to have already been told that the most expensive thing is the DOM rebuild that happens when you completely replace the entire contents of the page (i.e. when you assign to document.body.innerHTML). You are currently doing that for each substitution. This results in Firefox re-rendering the entire page for each substitution you are making. You only need assign to document.body.innerHTML once, after you have made all of the substitutions.

The following should provide a first pass at making it faster:

function escapeRegExp(str) {
    return str.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}

function convert2latin() {
    newInnerHTML = document.body.innerHTML
    for (let i = 0; i < Table.length; i++) {
        newInnerHTML = newInnerHTML.replace(new RegExp(escapeRegExp(Table[i][1]), 'g'), Table[i][0]);
    }
    document.body.innerHTML = newInnerHTML
}

You mention in comments that there is no real need to use a RegExp for the match, so the following would be even faster:

function convert2latin() {
    newInnerHTML = document.body.innerHTML
    for (let i = 0; i < Table.length; i++) {
        newInnerHTML = newInnerHTML.replace(Table[i][1], Table[i][0]);
    }
    document.body.innerHTML = newInnerHTML
}

If you really need to use a RegExp for the match, and you are going to perform these exact substitutions multiple times, you are better off creating all of the RegExp prior to the first use (e.g. when Table is created/changed) and storing them (e.g. in Table[i][2]).

However, assigning to `document.body.innerHTML` is a bad way to do this:

As the8472 mentioned, replacing the entire content of document.body.innerHTML is a very heavy handed way to perform this task, which has some significant disadvantages including probably breaking the functionality of other JavaScript in the page and potential security issues. A better solution would be to change only the textContent of the text nodes.

One method of doing this is to use a TreeWalker. The code to do so, could be something like:

function convert2latin(text) {
    for (let i = 0; i < Table.length; i++) {
        text = text.replace(Table[i][1], Table[i][0]);
    }
    return text
}

//Create the TreeWalker
let treeWalker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT,{
    acceptNode: function(node) { 
        if(node.textContent.length === 0
            || node.parentNode.nodeName === 'SCRIPT' 
            || node.parentNode.nodeName === 'STYLE'
        ) {
            //Don't include 0 length, <script>, or <style> text nodes.
            return NodeFilter.FILTER_SKIP;
        } //else
        return NodeFilter.FILTER_ACCEPT;
    }
}, false );
//Make a list of nodes prior to modifying the DOM. Once the DOM is modified the TreeWalker
//  can become invalid (i.e. stop after the first modification). Doing so is not needed
//  in this case, but is a good habit for when it is needed.
let nodeList=[];
while(treeWalker.nextNode()) {
    nodeList.push(treeWalker.currentNode);
}
//Iterate over all text nodes, changing the textContent of the text nodes 
nodeList.forEach(function(el){

    el.textContent = convert2latin(el.textContent));
});

@pugnator, Added code showing a TreeWalker changing the `textContent` of all the text nodes in the document. — Makyen, Jun 01 '16 at 11:18
@pugnator, Firefox tends to have a large quantity of blank text nodes in the DOM. I have added skipping those so time is not spent trying to make changes to an empty string. This also shows a basic use of the `TreeWalker` `acceptNode` function to select the nodes you want to work on based on the properties of the node other than just the type selected by [`TreeWalker.whatToShow`](https://developer.mozilla.org/en-US/docs/Web/API/TreeWalker/whatToShow). — Makyen, Jun 01 '16 at 12:08
I just checked it and it works like a charm. even heavy wikipedia page refreshes very quick (it was about ~30secs before). TreeWalker did the trick. Now I can concentrate on making in more useful for students. Thank you again — pugnator, Jun 01 '16 at 12:38
*"However, doing so will incur the performance hit of doing a partial page re-layout for each node that is changed. "* - No. Modifying the DOM only marks the layout as as dirty, it will not immediately re-layout the page unless you *read* some style-related properties or the javascript yields back to the UI event loop. See http://www.phpied.com/rendering-repaint-reflowrelayout-restyle/ scroll down for the **Browsers are smart** section. And you're still missing the `setTimeout` part to reduce jank. — the8472, Jun 01 '16 at 14:32
Also, you might want to skip whitespace-only nodes, not just empty ones. — the8472, Jun 01 '16 at 14:38

the8472 · Answer 2 · 2016-06-01T10:37:00.677

Don't use innerhtml, it would destroy any javascript event handlers registered on the DOM nodes or make references to dom nodes held by the page's javascript obsolete. In other words, you could easily break a page with that. And of course it's inefficient.

You can use a treewalker instead and filter for text nodes only. The walking can be incrementalized by deferring the next step with window.setTimeout every 1000th text node or something like that.

If you register your addon script early enough you could also use a mutation observer to get notified about text nodes as soon as they get inserted and replace them incrementally which should make things less janky.

Replacing a lot of text in browser's addon

2 Answers2

However, assigning to `document.body.innerHTML` is a bad way to do this:

Linked

Replacing a lot of text in browser's addon

2 Answers2

However, assigning to document.body.innerHTML is a bad way to do this:

Linked

However, assigning to `document.body.innerHTML` is a bad way to do this: