-3

My Sample Text:

I have a file, that contains the following lines(sample).

Manolito                     Mapi
MapleStory                   MEEBO
MEEBO_audio                  MEEBO_unknown
MEEBO_video                  MGCP
MGCP_control                 MGCP_rtp
Microsoft\ Exchange          Microsoft\ Exchange_generic
Mig33                        MMS
Mojo                         Move
MPEG                         MPlus

I want to remove the spaces between the words, Then i want output as:

Manolito
Mapi
MapleStory
MEEBO
MEEBO_audio
MEEBO_unknown
MEEBO_video
MGCP
MGCP_control
MGCP_rtp
Microsoft\ Exchange
Microsoft\ Exchange_generic
Mig33
MMS
Mojo
Move
MPEG
MPlus

Please note, There shuould not be any training spaces after each word.

Please suggest me any awk or other programming script to achieve this.

Thanks,

Kumar

mpapec
  • 50,217
  • 8
  • 67
  • 127
Kumar
  • 729
  • 3
  • 11
  • 26

5 Answers5

4

Some like this:

awk -F"  +"  '{print $1 RS $2}' file
Manolito
Mapi
MapleStory
MEEBO
MEEBO_audio
MEEBO_unknown
MEEBO_video
MGCP
MGCP_control
MGCP_rtp
Microsoft\ Exchange
Microsoft\ Exchange_generic
Mig33
MMS
Mojo
Move
MPEG
MPlus

Set field separator FS to two or more spaces. Then print out field 1 newline field 2


PS this part does not work, it eat one character...
If you space that should not be divided, is escaped, then this should do:

awk -F'[^\\\\] +'  '{print $1"\n"$2}' file
Manolit
Mapi
MapleStor
MEEBO
MEEBO_audi
MEEBO_unknown
MEEBO_vide
MGCP
MGCP_contro
MGCP_rtp
Microsoft\ Exchang
Microsoft\ Exchange_generic
Mig3
MMS
Moj
Move
MPE
MPlus

Or if there may be tab too:

awk -F'[^\\\\][ \t]+'  '{print $1"\n"$2}' file
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • 1
    Close, but OP showed in his example that stuff like "Microsoft\ Exchange" should stay together. – DarkDust Dec 08 '14 at 12:02
  • What if input is `Microsoft\ Exchange Microsoft\ Exchange_generic` – anubhava Dec 08 '14 at 12:06
  • 1
    @anubhava Then nothing would works, since only human would see that Microsoft should come before Exchange. Yours would fail too. – Jotne Dec 08 '14 at 12:08
  • No an unescaped space is delimiter as per my understanding. I added an answer to take care of this case. – anubhava Dec 08 '14 at 12:09
  • @anubhava I do see two escaped space, it does not need to mean that all space in the text is escaped. OP need to answer on that. – Jotne Dec 08 '14 at 12:22
  • 1
    @anubhava Updated my post to handle escaped space. – Jotne Dec 08 '14 at 13:52
  • @Jotne the new one cuts the last letter off most of the lines(looks like the second field of every line) –  Dec 08 '14 at 14:54
  • @Jidder Uffff. Thanks for pointing out. The `[^\\\\]` eats one character except `\`... – Jotne Dec 08 '14 at 15:33
  • @jotne maybe you could use FPAT instead of FS ? –  Dec 08 '14 at 15:53
  • @Jidder I did look at that, but did not find out how, and it then would need `gnu awk` – Jotne Dec 08 '14 at 16:18
  • @jotne don't you need gnu awk anyway for regex field separators ? –  Dec 08 '14 at 18:43
  • @Jidder That I am no sure about. Know the limitation to one character in `RS` – Jotne Dec 08 '14 at 19:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/66417/discussion-between-jidder-and-jotne). –  Dec 08 '14 at 19:18
2

I assumed that you're trying to replace two or more spaces with a newline character. If yes, then you could use the below sed command.

$ sed 's/[[:space:]]\{2,\}/\n/g' file
Manolito
Mapi
MapleStory
MEEBO
MEEBO_audio
MEEBO_unknown
MEEBO_video
MGCP
MGCP_control
MGCP_rtp
Microsoft\ Exchange
Microsoft\ Exchange_generic
Mig33
MMS
Mojo
Move
MPEG
MPlus

[[:space:]]\{2,\} matches two or more spaces. Replacing those matched spaces with newline character will give you the desired output.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • What if input is `Microsoft\ Exchange Microsoft\ Exchange_generic`? – anubhava Dec 08 '14 at 12:08
  • 1
    @anubhava: OP didn't specify any constraints (in fact, almost no details at all) so IMHO it's useless to speculate whether these cases are to be considered. Whether the backspace is used to escape something or not has not been specified. – DarkDust Dec 08 '14 at 12:10
  • @DarkDust: That is my understanding from the data provided in the question (of course I can be wrong too). I will let OP speak on this. – anubhava Dec 08 '14 at 12:16
2

In Python:

import re
with open("in.txt","r") as infile, open("out.txt", "w") as outfile:
    for line in infile.readlines():
        outfile.write('\n'.join(re.split("[^\\\\]\s+",line)))
bosnjak
  • 8,424
  • 2
  • 21
  • 47
  • I don't know the details of Python's `split()`, but I guess it's behaving like other languages and will split on any whitespace by default. This it would split "Microsoft\ Exchange" even though OP wants it to stay together, doesn't it? – DarkDust Dec 08 '14 at 12:06
  • Fixed in the answer. Thanks for the hint, I missed that detail from the question. – bosnjak Dec 08 '14 at 12:09
1

Using grep -oP you can do:

grep -oP '\w.*?\w(?= |$)' file
Manolito
Mapi
MapleStory
MEEBO
MEEBO_audio
MEEBO_unknown
MEEBO_video
MGCP
MGCP_control
MGCP_rtp
Microsoft\ Exchange
Microsoft\ Exchange_generic
Mig33
MMS
Mojo
Move
MPEG
MPlus
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Another awk way that will work with as many field as you want and does not need multiple space as long as the space you dont want to be changed to a newline is backslashed.

awk -vORS= '{for(i=1;i<=NF;i++)print $i ($i~/\\$/?" ":"\n")}' file

Or

awk -vRS=" +"  'ORS=/\\/?" ":"\n"' file

Output

Manolito
Mapi
MapleStory
MEEBO
MEEBO_audio
MEEBO_unknown
MEEBO_video
MGCP
MGCP_control
MGCP_rtp
Microsoft\ Exchange
Microsoft\ Exchange_generic
Mig33
MMS
Mojo
Move
MPEG
MPlus
Community
  • 1
  • 1