The UNIX Forums
"Join the Network of UNIX System Users"


 
Subject: copying text between two unique text patterns
-jay-
Newbie
Rank: 1



UID 26
Digest Posts 0
Credits 0
Posts 47
Reading Access 10
Registered Apr 25, 2007
Status Offline
Post at Jun 20, 2007 10:15 AM  Profile | P.M. 
copying text between two unique text patterns



copying text between two unique text patterns



dear colleagues:
i have .rtf files of a collection of newspaper articles. each newspaper article starts with a variation of the phrase "document * of 20" and is separated from the next article with the character string "==================="

i would like to be able to take the text composing each news article from between these two patterns and dump them into separate, uniquely named files. i've been playing around with sed, grep, cut and csplit, but nothing seems to be working. i have the regular expressions developed to capture the two lines "document * of 20" and "--------" independently, but i can't figure out how to capture and play with the text between the two lines. i hope you can help.

yours,
simon j. kiss


Top
mainienator
Newbie
Rank: 1



UID 245
Digest Posts 0
Credits 0
Posts 24
Reading Access 10
Registered Apr 25, 2007
Status Offline
Post at Jun 20, 2007 10:16 AM  Profile | P.M. 
hi simon,
though there could some other smarter solution,i have used the following approach to solve this problem.

assuming we have the contents of the file /tmp/mynewarticlefile.rtf as ,

cat /tmp/mynewarticlefile.rtf

html code:

times of india
edition-1
date:27 th may

document 1 of 20

all blah blah goes here
ad page
blah

================================

document 2 of 20

all blah blah goes here
ad page
blah

================================

document 3 of 20

all blah blah goes here
ad page
blah

================================
document 4 of 20

all blah blah goes here
ad page
blah

================================
end of the edition
thanks
editor

i have written the following script that process the above file to generate the output.
here the assumption is the document has 20 pages.

code:
#!/bin/ksh
let page=1
while [[ page -le 20 ]] ; do
sed -n /document\ $page/,/==========*/p /tmp/mynewarticlefile.rtf > /tmp/articlesplitpage-$page
((page=page+1))
done

upon execution of the above script i get 20 pages spilt according to the document no.

cat /tmp/articlespiltpage-1

html code:
document 1 of 20

all blah blah goes here
ad page
blah

================================

thanks,
Top
bokevoll
Newbie
Rank: 1



UID 74
Digest Posts 0
Credits 0
Posts 41
Reading Access 10
Registered Apr 25, 2007
Status Offline
Post at Jun 20, 2007 10:16 AM  Profile | P.M. 
hi.

for the sample data file "data1":

code:

document * of 20
hello

=====
document one of 20

world

=====
document 44 of 20

now is

=====
document "climatology review" in of 20

with no documents at the beginning of the time.

=====
i ran this script:

code:
#!/bin/sh

# @(#) s1       demonstrate csplit.

f=${1-data1}

csplit -k -s -z $f "/^document.*of/" {\*}

echo
for file in xx*
do
        echo
        echo "file: $file"
        head -3 $file |
        cat -n
done

exit 0
to produce this:

code:
% ./s1


file: xx00
     1  document * of 20
     2  hello
     3

file: xx01
     1  document one of 20
     2
     3  world

file: xx02
     1  document 44 of 20
     2
     3  now is

file: xx03
     1  document "climatology review" in of 20
     2
     3  with no documents at the beginning of the time.
this assumes that the lines "=====" are visual sugar ... cheers, drl
Top
 

 

All times are GMT, the time now is Jul 31, 2010 03:25 AM

Powered by Discuz! 5.0.0  © 2001-2006 UNIX Forums
Processed in 0.006141 second(s), 8 queries

Clear Cookies - Contact Us - UNIX Help - Archiver - WAP