Reply to topic  [ 6 posts ] 
SED/regex help 
Author Message
I haven't seen my friends in so long
User avatar

Joined: Thu Apr 23, 2009 6:36 pm
Posts: 5150
Location: /dev/tty0
Reply with quote
Hi all,

I have a number of files with a line like this:
Quote:
<h2 class="title" style="clear: both"><a xmlns="http://www.w3.org/1999/xhtml" id="id2826402"/>Conclusion: From Custom to Customary Law</h2></div></div><p xmlns="http://www.w3.org/1999/xhtml">We have examined the customs which regulate the ownership and control


I have a statement doing this:
Code:
# Mod <p> and <h> tags to preserve them
cat $FILE | sed 's/<p.*>/[p]/g' | sed 's/<\/p>/[\/p]/g' | sed 's/<h\([1-3]\)\(.\)*></[h\1]</g' | sed 's/<\/h\([1-3]\)\(.\)*></[\/h\1]</g' > $tmp1


Cat the file, change <> tags around p and /p to [] and tags from h* and /h* to []. However, the statement does this:
Quote:
[h2][p]We have examined the customs which regulate the ownership and control


The third SED expression is matching everything until the last '>' (just before the [p] tag). How do I make it so it only matches up to the FIRST '>' it comes to? I.E. I want this:

Quote:
[h]<a xmlns="http://www.w3.org/1999/xhtml" id="id2826402"/>Conclusion: From Custom to Customary Law[/h2]</div></div>[p]We have examined the customs which regulate the ownership and control


Thanks,
Ben


Fri Jan 22, 2010 2:10 pm
Profile WWW
I haven't seen my friends in so long
User avatar

Joined: Thu Apr 23, 2009 6:36 pm
Posts: 5150
Location: /dev/tty0
Reply with quote
Solved :D

I split the statement into two lines in the end when I cleaned up the script, I also played around and got it to do what I wanted it to:
Code:
# Mod <p> and <h> tags to preserve them
cat $FILE | sed 's/<p.*>/[p]/g' | sed 's/<\/p>/[\/p]/g' > $tmp1
cat $tmp1 | sed 's/<h\([1-3]\)\([^>]\)*>/[h\1]/g' | sed 's/<\/h\([1-3]\)\(.\)*>/[\/h\1]/g' > $tmp2


Simple after I re-read my book on pattern matching, I had missed it the first few times I scanned through it.

Now after a few more scripts I've got a legal, up-to-date copy of the book "The Cathedral and the Bazaar" by Eric Raymond, it's a good read :D


Sat Jan 23, 2010 5:36 pm
Profile WWW
I haven't seen my friends in so long
User avatar

Joined: Thu Apr 23, 2009 9:40 pm
Posts: 5288
Location: ln -s /London ~
Reply with quote
My turn:

It's been a bit of a long day and my brain's clearly missing something obvious. I have a file containing lines of input. I want to use grep to select those that end in a forward slash (ultimately I want to select everything but them, but that's a simple flag). What regexp do I need? I thought I've tried everything obvious:

Code:
egrep "/$" < input


Edd

_________________
timark_uk wrote:
Gay sex is better than no sex

timark_uk wrote:
Edward Armitage is Awesome. Yes, that's right. Awesome with a A.


Tue Feb 02, 2010 4:27 pm
Profile
Spends far too much time on here
User avatar

Joined: Thu Apr 23, 2009 11:36 pm
Posts: 3527
Location: Portsmouth
Reply with quote
Argh my head has just imploded.

We had to write Sed in C last year. Absolute hell!!!!!!!

_________________
Image


Tue Feb 02, 2010 8:34 pm
Profile
I haven't seen my friends in so long
User avatar

Joined: Thu Apr 23, 2009 6:36 pm
Posts: 5150
Location: /dev/tty0
Reply with quote
EddArmitage wrote:
My turn:

It's been a bit of a long day and my brain's clearly missing something obvious. I have a file containing lines of input. I want to use grep to select those that end in a forward slash (ultimately I want to select everything but them, but that's a simple flag). What regexp do I need? I thought I've tried everything obvious:

Code:
egrep "/$" < input


Edd


I'd do something like:
Code:
egrep \/$ < input


Wed Feb 03, 2010 11:51 am
Profile WWW
I haven't seen my friends in so long
User avatar

Joined: Thu Apr 23, 2009 9:40 pm
Posts: 5288
Location: ln -s /London ~
Reply with quote
forquare1 wrote:
EddArmitage wrote:
Code:
egrep "/$" < input

I'd do something like:
Code:
egrep \/$ < input

It worked fine in the end as was, when the input was piped straight in from the previous stage. I swear there must be something installed that uses hamsters as line endings on these damn CSC machines!

_________________
timark_uk wrote:
Gay sex is better than no sex

timark_uk wrote:
Edward Armitage is Awesome. Yes, that's right. Awesome with a A.


Wed Feb 03, 2010 11:57 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 6 posts ] 

Who is online

Users browsing this forum: No registered users and 36 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software.