4

I have the exact problem described here:

removing Hebrew "niqqud" using r

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

But in my case i have to do this in javascript.

Based of UTF-8 values table described here, I have tried this regex without success.

Community
  • 1
  • 1
Dorad
  • 3,413
  • 2
  • 44
  • 71

2 Answers2

13

Just a slight problem with your regex. Try the following:

const input = "הֻסְמַק";
console.log(input)
console.log(input.replace(/[\u0591-\u05C7]/g, ''));

/*
$ node index.js
הֻסְמַק
הסמק
*/
nj_
  • 2,219
  • 1
  • 10
  • 12
  • 1
    there is an issue with Maqaf (`u5BE`) in the text. It's removed and the two words are concatenated. For example _כִּי־טֽוֹב_ into _כיטוב_. – Avraham Jan 30 '18 at 22:16
5

nj_’s answer is great.

Just to add a bit (because I don’t have enough reputation points to comment directly) -

[\u0591-\u05C7] may be too broad a brush. See the relevant table here: https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet#Compact_table

Rows 059x and 05AX are for t'amim (accents/cantillation marks). Niqud per se is in rows 05Bx and 05Cx.

And as Avraham commented, you can run into an issues if 2 words are joined by a makaf (05BE), then by removing that you will end up with run-on words.

If you want to remove only t’amim but keep nikud, use /[\u0591-\u05AF]/g. If you want to avoid the issue raised by Avraham, you have 2 options - either keep the maqaf, or replace it with a dash:

//keep the original makafim
const input = "כִּי־טוֹב"
console.log(input)
console.log(input.replace(/([\u05B0-\u05BD]|[\u05BF-\u05C7])/g,""));

//replace makafim with dashes
console.log(input.replace(/\u05BE/g,"-").replace(/[\u05B0-\u05C7]/g,""))

/*
$ node index.js
כִּי־טֽוֹב
כי־טוב
כי-טוב
*/
zuchmir
  • 51
  • 1
  • 1