Discussion:
[Tutor] Extracting data between strings
richard kappler
2015-05-27 13:26:15 UTC
Permalink
I'm writing a script that reads from an in-service log file in xml format
that can grow to a couple gigs in 24 hours, then gets zipped out and
restarts at zero. My script must check to see if new entries have been
made, find specific lines based on 2 different start tags, and from those
lines extract data between the start and end tags (hopefully including the
tags) and write it to a file. I've got the script to read the file, see if
it's grown, find the appropriate lines and write them to a file. I still
need to strip out just the data I need (between the open and close tags)
instead of writing the entire line, and also to reset eof when the nightly
zip / new log file creation occurs. I could use some guidance on stripping
out the data, at the moment I'm pretty lost, and I've got an idea about the
nightly reset but any comments about that would be welcome as well. Oh, and
the painful bit is that I can't use any modules that aren't included in the
initial Python install. My code is appended below.

regards, Richard


import time

while True:
#open the log file containing the data
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
#set the file size again
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
if neweof > eof:
#go back to last position...
file.seek(eof)
# open file to which the lines will be appended
f1 = open('newlog.txt', 'a')
# read new lines in log.txt
for line in file.readlines():
#check if line contains needed data
if "usertag1" in line or "SeMsg" in line:

############################################################
#### this should extract the data between usertag1 and ####
#### and /usertag1, and between SeMsg and /SeMsg, ####
#### writing just that data to the new file for ####
#### analysis. For now, write entire line to file ####
############################################################

# if yes, send line to new file
f1.write(line)
# update log.txt file size
eof = neweof

###########################################################
#### need an elif for when neweof < eof (nightly ####
#### reset of log file) that continues loop with reset ####
#### eof and neweof ####
###########################################################
file.close()
f1.close()
# wait x number of seconds until back to begining of loop
# set at ten for dev and test, set to 300 for production
time.sleep(10)
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Steven D'Aprano
2015-05-27 14:38:43 UTC
Permalink
Hi Richard,

I'm not sure how advanced you are, whether you have any experience or if
you're a total beginner. If anything I say below doesn't make sense,
please ask! Keep your replies on the list, and somebody will be happy to
answer.
Post by richard kappler
I'm writing a script that reads from an in-service log file in xml format
that can grow to a couple gigs in 24 hours, then gets zipped out and
restarts at zero.
Please be more specific -- does the log file already get zipped up each
day, and you have to read from it, or is your script responsible for
reading AND zipping it up?

The best solution to this is to use your operating system's logrotate
command to zip up and rotate the logs. On Linux, Mac OS X and many Unix
systems, logrotate is a standard utility. You should use that, it is
reliable, configurable, heavily tested and debugged and better than
anything you will write.

On Windows, I don't now if there is any equivalent to logrotate.
Probably not -- Windows standard utilities is painfully impoverished and
underpowered compared to what Linux and many Unixes provide as standard.

If you must write your own logrotate, just write a script that zips up
and renames the log file. At *most*, check whether the file is "big
enough". In pseudo-code:


# logrotate.py
if the file is more than X bytes in size:
rename the log to log.temp
create a new log file for the process to write to
zip up log.temp as log.date.zip
check for any old log.zip files that need deleting
(e.g. "keep maximum of five log files, deleting the oldest")


Then use your operating system's scheduler to schedule this script to
run every day at (say) 3am. Don't schedule anything between 1am and 3pm!
If you do, then when daylight savings changes, it may run twice, or not
run at all. Again, Linux and Unix have a standard scheduler, cron; I
expect that even Windows will have one of those too.

The point being, don't re-invent the wheel! If your system already has a
log rotator, use that; if it has a scheduler, use that; only if it lacks
both should you write your own.


To read the config settings (e.g. how big is X bytes?), use the
configparser module. To do the zipping, use the zipfile module. shutil
and os modules will also be useful. Do you need more pointers or is that
enough to get you started?

Now, on to the script that extracts data from the logs...
Post by richard kappler
My script must check to see if new entries have been
made, find specific lines based on 2 different start tags, and from those
lines extract data between the start and end tags (hopefully including the
tags) and write it to a file. I've got the script to read the file, see if
it's grown, find the appropriate lines and write them to a file.
It's probably better to use a directory listener rather than keep
scanning the file over and over again.

Perhaps you can adapt this?

http://code.activestate.com/recipes/577968-log-watcher-tail-f-log/
Post by richard kappler
I still
need to strip out just the data I need (between the open and close tags)
instead of writing the entire line, and also to reset eof when the nightly
zip / new log file creation occurs. I could use some guidance on stripping
out the data, at the moment I'm pretty lost, and I've got an idea about the
nightly reset but any comments about that would be welcome as well. Oh, and
the painful bit is that I can't use any modules that aren't included in the
initial Python install. My code is appended below.
regards, Richard
import time
#open the log file containing the data
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
#set the file size again
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
You can tell how big the file is without opening it:

import os
filesize = os.path.getsize('path/to/file')
Post by richard kappler
#go back to last position...
file.seek(eof)
# open file to which the lines will be appended
f1 = open('newlog.txt', 'a')
# read new lines in log.txt
For huge files, it is ****MUCH**** more efficient to write:

for line in file: ...

than to use file.readlines().
Post by richard kappler
#check if line contains needed data
############################################################
#### this should extract the data between usertag1 and ####
#### and /usertag1, and between SeMsg and /SeMsg, ####
#### writing just that data to the new file for ####
#### analysis. For now, write entire line to file ####
############################################################
You say "between usertag1 ... AND between SeMsg..." -- what if only one
set of tags are available? I'm going to assume that it's possible that
both tags could exist.

for line in file:
for tag in ("usertag1", "SeMsg"):
if tag in line:
start = line.find(tag)
end = line.find("/" + tag, start)
if end == -1:
print("warning: /%d found!" % tag)
else:
data = line[start:end+len(tag)+1]
f1.write(data)



Does this help?
--
Steve
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
richard kappler
2015-05-27 14:52:40 UTC
Permalink
Post by Steven D'Aprano
Hi Richard,
I'm not sure how advanced you are, whether you have any experience or if
you're a total beginner. If anything I say below doesn't make sense,
please ask! Keep your replies on the list, and somebody will be happy to
answer.

In between. I've been learning Python off and on as a hobby for about 4-5
years, learning what I need to know to do what I want to do, but I just
took a new job and part of it requires ramping up my Python knowledge as
I'm the only Python (and Linux) guy here.
Post by Steven D'Aprano
I'm writing a script that reads from an in-service log file in xml format
that can grow to a couple gigs in 24 hours, then gets zipped out and
restarts at zero.
Please be more specific -- does the log file already get zipped up each
day, and you have to read from it, or is your script responsible for
reading AND zipping it up?

No, all I have to do is read it and recognize when it gets emptied
(zipped). The Linux system its on already rotates and zips. My script just
needs to extract the specific data we're looking for.


Now, on to the script that extracts data from the logs...
Post by Steven D'Aprano
My script must check to see if new entries have been
made, find specific lines based on 2 different start tags, and from those
lines extract data between the start and end tags (hopefully including the
tags) and write it to a file. I've got the script to read the file, see if
it's grown, find the appropriate lines and write them to a file.
It's probably better to use a directory listener rather than keep
scanning the file over and over again.

Perhaps you can adapt this?

http://code.activestate.com/recipes/577968-log-watcher-tail-f-log/

I'll check it out, thanks.
Post by Steven D'Aprano
I still
need to strip out just the data I need (between the open and close tags)
instead of writing the entire line, and also to reset eof when the nightly
zip / new log file creation occurs. I could use some guidance on stripping
out the data, at the moment I'm pretty lost, and I've got an idea about the
nightly reset but any comments about that would be welcome as well. Oh, and
the painful bit is that I can't use any modules that aren't included in the
initial Python install. My code is appended below.
regards, Richard
import time
#open the log file containing the data
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
#set the file size again
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
You can tell how big the file is without opening it:

import os
filesize = os.path.getsize('path/to/file')

Brilliant! We were just discussing the potential interference and overhead
of opening and closing the file more frequently than 5 minutes, this solves
that problem.
Post by Steven D'Aprano
#go back to last position...
file.seek(eof)
# open file to which the lines will be appended
f1 = open('newlog.txt', 'a')
# read new lines in log.txt
For huge files, it is ****MUCH**** more efficient to write:

for line in file: ...

than to use file.readlines().

Ok.
Post by Steven D'Aprano
#check if line contains needed data
############################################################
#### this should extract the data between usertag1 and ####
#### and /usertag1, and between SeMsg and /SeMsg, ####
#### writing just that data to the new file for ####
#### analysis. For now, write entire line to file ####
############################################################
You say "between usertag1 ... AND between SeMsg..." -- what if only one
set of tags are available? I'm going to assume that it's possible that
both tags could exist.
Right, I should have been more specific, sorry. Each line will contain one
or the other or neither, never both.

for line in file:
for tag in ("usertag1", "SeMsg"):
if tag in line:
start = line.find(tag)
end = line.find("/" + tag, start)
if end == -1:
print("warning: /%d found!" % tag)
else:
data = line[start:end+len(tag)+1]
f1.write(data)



Does this help?

I think it does, I'll go give it a try and post back this afternoon. Thanks
Steven!
Post by Steven D'Aprano
Hi Richard,
I'm not sure how advanced you are, whether you have any experience or if
you're a total beginner. If anything I say below doesn't make sense,
please ask! Keep your replies on the list, and somebody will be happy to
answer.
Post by richard kappler
I'm writing a script that reads from an in-service log file in xml format
that can grow to a couple gigs in 24 hours, then gets zipped out and
restarts at zero.
Please be more specific -- does the log file already get zipped up each
day, and you have to read from it, or is your script responsible for
reading AND zipping it up?
The best solution to this is to use your operating system's logrotate
command to zip up and rotate the logs. On Linux, Mac OS X and many Unix
systems, logrotate is a standard utility. You should use that, it is
reliable, configurable, heavily tested and debugged and better than
anything you will write.
On Windows, I don't now if there is any equivalent to logrotate.
Probably not -- Windows standard utilities is painfully impoverished and
underpowered compared to what Linux and many Unixes provide as standard.
If you must write your own logrotate, just write a script that zips up
and renames the log file. At *most*, check whether the file is "big
# logrotate.py
rename the log to log.temp
create a new log file for the process to write to
zip up log.temp as log.date.zip
check for any old log.zip files that need deleting
(e.g. "keep maximum of five log files, deleting the oldest")
Then use your operating system's scheduler to schedule this script to
run every day at (say) 3am. Don't schedule anything between 1am and 3pm!
If you do, then when daylight savings changes, it may run twice, or not
run at all. Again, Linux and Unix have a standard scheduler, cron; I
expect that even Windows will have one of those too.
The point being, don't re-invent the wheel! If your system already has a
log rotator, use that; if it has a scheduler, use that; only if it lacks
both should you write your own.
To read the config settings (e.g. how big is X bytes?), use the
configparser module. To do the zipping, use the zipfile module. shutil
and os modules will also be useful. Do you need more pointers or is that
enough to get you started?
Now, on to the script that extracts data from the logs...
Post by richard kappler
My script must check to see if new entries have been
made, find specific lines based on 2 different start tags, and from those
lines extract data between the start and end tags (hopefully including
the
Post by richard kappler
tags) and write it to a file. I've got the script to read the file, see
if
Post by richard kappler
it's grown, find the appropriate lines and write them to a file.
It's probably better to use a directory listener rather than keep
scanning the file over and over again.
Perhaps you can adapt this?
http://code.activestate.com/recipes/577968-log-watcher-tail-f-log/
Post by richard kappler
I still
need to strip out just the data I need (between the open and close tags)
instead of writing the entire line, and also to reset eof when the
nightly
Post by richard kappler
zip / new log file creation occurs. I could use some guidance on
stripping
Post by richard kappler
out the data, at the moment I'm pretty lost, and I've got an idea about
the
Post by richard kappler
nightly reset but any comments about that would be welcome as well. Oh,
and
Post by richard kappler
the painful bit is that I can't use any modules that aren't included in
the
Post by richard kappler
initial Python install. My code is appended below.
regards, Richard
import time
#open the log file containing the data
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
#set the file size again
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
import os
filesize = os.path.getsize('path/to/file')
Post by richard kappler
#go back to last position...
file.seek(eof)
# open file to which the lines will be appended
f1 = open('newlog.txt', 'a')
# read new lines in log.txt
for line in file: ...
than to use file.readlines().
Post by richard kappler
#check if line contains needed data
############################################################
#### this should extract the data between usertag1 and ####
#### and /usertag1, and between SeMsg and /SeMsg, ####
#### writing just that data to the new file for ####
#### analysis. For now, write entire line to file ####
############################################################
You say "between usertag1 ... AND between SeMsg..." -- what if only one
set of tags are available? I'm going to assume that it's possible that
both tags could exist.
start = line.find(tag)
end = line.find("/" + tag, start)
print("warning: /%d found!" % tag)
data = line[start:end+len(tag)+1]
f1.write(data)
Does this help?
--
Steve
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor
--
Windows assumes you are an idiot…Linux demands proof.
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription option
Laura Creighton
2015-05-27 15:09:48 UTC
Permalink
Post by richard kappler
Windows assumes you are an idiot…Linux demands proof.
We're all rolling on the floor laughing from this one. Is it
original with you?

Thank you.
Laura

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/list

Peter Otten
2015-05-27 16:43:52 UTC
Permalink
Post by richard kappler
I'm writing a script that reads from an in-service log file in xml format
that can grow to a couple gigs in 24 hours, then gets zipped out and
restarts at zero. My script must check to see if new entries have been
made, find specific lines based on 2 different start tags, and from those
lines extract data between the start and end tags (hopefully including the
tags) and write it to a file. I've got the script to read the file, see if
it's grown, find the appropriate lines and write them to a file. I still
need to strip out just the data I need (between the open and close tags)
instead of writing the entire line, and also to reset eof when the nightly
zip / new log file creation occurs. I could use some guidance on stripping
out the data, at the moment I'm pretty lost, and I've got an idea about
the nightly reset but any comments about that would be welcome as well.
Oh, and the painful bit is that I can't use any modules that aren't
included in the initial Python install. My code is appended below.
You'll probably end up with something closer to your original idea and
Steven's suggestions, but let me just state that the idea of a line and xml
exist in two distinct worlds. Here's my (sketchy) attempt to treat xml as
xml rather than lines in a text file. (I got some hints here:
http://effbot.org/zone/element-iterparse.htm).

import os
import sys
import time

from xml.etree.ElementTree import iterparse, tostring

class Rollover(Exception):
pass

class File:
def __init__(self, filename, sleepinterval=1):
self.size = 0
self.sleepinterval = sleepinterval
self.filename = filename
self.file = open(filename, "rb")

def __enter__(self):
return self

def __exit__(self, etype, evalue, traceback):
self.file.close()

def read(self, size=0):
while True:
s = self.file.read(size)
if s:
return s
else:
time.sleep(self.sleepinterval)
self.check_rollover()

def check_rollover(self):
newsize = os.path.getsize(self.filename)
if newsize < self.size:
raise Rollover()
self.size = newsize

WANTED_TAGS = {"usertag1", "SeMsg"}

while True:
try:
with File("log.txt") as f:
context = iterparse(f, events=("start", "end"))
event, root = next(context)
wanted_count = 0

for event, element in context:
if event == "start" and element.tag in WANTED_TAGS:
wanted_count += 1
else:
assert event == "end"
if element.tag in WANTED_TAGS:
wanted_count -= 1
print("LOGGING")
print( tostring(element))
if wanted_count == 0:
root.clear()
except Rollover as err:
print("doing a rollover", file=sys.stderr)


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Loading...