[Tutor] parsing sendmail logs

Discussion:

nibudh

2008-07-14 07:29:16 UTC

Hi List,

I'm looking for some support libraries that will help me to parse sendmail
logs.

I'm confused about whether i need a "parser" per se, and if i do which
parser to use. I found this website
http://nedbatchelder.com/text/python-parsers.html which compares a slew of
python parsers.

Initially I'm wanting to be able to report on who the recipients of a
particular email where based on an email address or host.
Another report I'm interested in creating is who sent email to a particular
email address.

These simple reports i have written already using unix tools like grep,
sort, awk :

---
1. grep 'email_address' ../maillog* |awk '{print $6}' |sort -u |awk -F:
'{print $1}' >phis.txt
2. for i in `cat ./phis.txt` ; do grep $i ../maillog* >>./maillog; done
3. grep "to=<" maillog |awk '{print $7}' |sort -u >recipients

'email _address' is user supplied and it would be nice to default to just
maillog but to let the user specify maillog.* or maillog.[1..6]

I whipped these up in a few minutes.
'phis.txt' contains a unique list of message ID's
'maillog' is a filtered raw log of matching lines based on the message ID's.
'recipients' gives me a list of email addresses, sometimes with multiple
email addresses on one line comma separated.
---

I really want to just tidy this up into a python script as a programming
exercise.

so that's the background.

How do i go about representing the structure of the sendmail log file to my
script. I'm imagining having to filter through the logs and building up some
kind of data structure which i can use to report from. should this just be
as simple as a dash of regex and str.split() ? or are there better tools
that provide a richer framework to work within?

I would love to extend or write further scripts to analyze the logs and pick
up things like someone suddenly emailing to 500 people. but crawling before
running seems like the order of the day.

Cheers,

nibudh.

Kent Johnson

2008-07-14 10:47:15 UTC

Permalink

Post by nibudh
Hi List,
I'm looking for some support libraries that will help me to parse sendmail
logs.

Lire looks like it might be useful:
http://www.logreport.org/lire.html

Post by nibudh
Initially I'm wanting to be able to report on who the recipients of a
particular email where based on an email address or host.
Another report I'm interested in creating is who sent email to a particular
email address.
These simple reports i have written already using unix tools like grep,
---
'{print $1}' >phis.txt
2. for i in `cat ./phis.txt` ; do grep $i ../maillog* >>./maillog; done
3. grep "to=<" maillog |awk '{print $7}' |sort -u >recipients
'email _address' is user supplied and it would be nice to default to just
maillog but to let the user specify maillog.* or maillog.[1..6]
How do i go about representing the structure of the sendmail log file to my
script. I'm imagining having to filter through the logs and building up some
kind of data structure which i can use to report from. should this just be
as simple as a dash of regex and str.split() ? or are there better tools
that provide a richer framework to work within?

These should all be pretty straightforward to program in Python using
standard facilities for file reading and writing, lists and
dictionaries, and string operations. You probably don't need regular
expressions unless you want to build a regex that will find all mail
ids of interest.

To begin with I don't think you need to be parsing the log lines into
a structure. If you decide that is useful, you might want to look at
the loghetti project as a starting point. It parses web server logs
but it might be a useful model.
http://code.google.com/p/loghetti/

Kent

Monika Jisswel

2008-07-14 18:10:49 UTC

Permalink

lire & logethy are an option.
but if you want to go on your own I believe awk, grep, sort are extremely
extremely extremely (yes 3 times !) powerfulI tools, so giving them up is a
bad decision I guess either talking about thier speed or what they would
allow you to do in few lines of code. so what I would advice is to write
a python program that uses them thru subprocess module, this way you have
the best of both worlds, finaly you should set up some sort of database to
hold your data & to have a real-time view of whats going on.
so what modules you need ? :

subprocess : running awk & grep.
csv : for loading data from the resulting files.
MySQLdb : to connect to mysql db

nibudh

2008-07-15 00:25:08 UTC

Permalink

Post by Monika Jisswel
lire & logethy are an option.
but if you want to go on your own I believe awk, grep, sort are extremely
extremely extremely (yes 3 times !) powerfulI tools, so giving them up is a
bad decision I guess either talking about thier speed or what they would
allow you to do in few lines of code.

Hi monika,

You are right. awk, grep and sort etc. are extremely powerful. As a unix
sysadmin i use them everyday. I guess i'm looking for a couple of simple
projects to strengthen my python scripting. so whilst i usually look for the
best tool for the job in this case python is my hammer and everything looks
like a nail <grin>

Post by Monika Jisswel
so what I would advice is to write a python program that uses them thru
subprocess module, this way you have the best of both worlds, finaly you
should set up some sort of database to hold your data & to have a real-time
view of whats going on.

Initially i was thinking of writing some python scripts to do some of the
automation tasks that i have a need to do. I'll still do this because i want
to write more code to keep the practice up.

But what I'd really like to do is write some scripts that analyze my email
logs and catch anomalies and report them to me. Like someone emailing 500
recipients in a day or one external person emailing 500 of my users.

so thinking it through, my first thought was how do i get the data from the
mail logfiles into usable state for analysis?

It seems some people just break down the data with regex.

I made an assumption that because i wanted to parse (in a generic sense) the
sendmail logs then perhaps using a "parser" would be of some benefit. But
from researching this angle, there are a lot of choices and "parser land"
has lots of terminology that i just simply don't understand yet.

I guess I'm trying to figure out what i don't know.

Any pragmatic advice on building or working with a framework to get to the
point where i can do analysis on my logs would be cool.

Cheers,

nibudh.

Monika Jisswel

2008-07-15 11:15:58 UTC

Permalink

Post by Alan Gauld
but in that case use bash or ksh

Hi Alan,

to say the truth I never thought about "additional overhead of getting the
input/output data transferred" because the suprocess itself will contain the
(bash)pipe to redirect output to the next utility used not the python
subprocess.PIPE pipe so it will be like one subprocess with each utility
piping stdout to the next as if run from the shell, what python comes in for
? well, its always sweet to work with python as it will allow you to make
whatever logic you have in yoru head into real life with ease and at the end
of the subprocess you can always parse the stdout using python this time &
load results to some database.

I have to say that I have seen awk, grep & sort, wc, work on files of
handreds of Mbytes in a matter of 1 or 2 seconds ... why would I replace
such a fast tools ?

Alan do you think python can beat awk in speed when it comes to replacing
text ? I always wanted to know it !

Any pragmatic advice on building or working with a framework to get to the

Post by Alan Gauld
point where i can do analysis on my logs would be cool.

ok ! so your program parses sendmail log, returns some data, your python
program will then parse this data & depending on results will either send an
email saying 'everything OK' or will take measures like run a
subprocess.Popen('/etc/init.d/sendmail stop', shell = 1) or add some email
address or hostname or ip to spamassassin lack list & reload it ...

but some problems will rise when we talk about the frequency you will run
your scans ? every 1 minute ? every 5 minutes ? 500 mails would be
recieved by then so it would be of no use.

this said this program cannot be used for real time problem detection
effectively ... but it will be very effective for end-of-day statistics ...
that's why programs such as spamassassin were creatd i guess ...

Martin Walsh

2008-07-15 13:26:16 UTC

Permalink

Post by Monika Jisswel
to say the truth I never thought about "additional overhead of getting
the input/output data transferred" because the suprocess itself will
contain the (bash)pipe to redirect output to the next utility used not
the python subprocess.PIPE pipe so it will be like one subprocess with
each utility piping stdout to the next as if run from the shell, what

I agree with Alan. Personally, I find trying to replace shell scripts
with python code, just plain awk-ward ... ahem, please forgive the pun.
:) No doubt the subprocess module is quite handy. But in this case it
would be hard to beat a shell script, for simplicity, with a chain of
subprocess.Popen calls. I realize this is subjective, of course.

Post by Monika Jisswel
python comes in for ? well, its always sweet to work with python as it
will allow you to make whatever logic you have in yoru head into real
life with ease and at the end of the subprocess you can always parse the
stdout using python this time & load results to some database.

If you follow the unix philosophy(tm) it might make more sense to pipe
the result (of a shell pipeline) to a python script that does only the
database load.

Post by Monika Jisswel
I have to say that I have seen awk, grep & sort, wc, work on files of
handreds of Mbytes in a matter of 1 or 2 seconds ... why would I replace
such a fast tools ?

I can think of a few reasons, not the least of which is the OP's -- as
"a programming exercise".

Post by Monika Jisswel
Alan do you think python can beat awk in speed when it comes to
replacing text ? I always wanted to know it !

Well, maybe. But IMHO, the the question should really be is python 'fast
enough'. Especially when you consider how the OP is using awk in the
first place. But the only way to know, is to try it out.

Post by Monika Jisswel
Any pragmatic advice on building or working with a framework to get
to the point where i can do analysis on my logs would be cool.

As an exercise, I think it would be a reasonable approach to write
python derivatives of the shell commands being used, perhaps tailored to
the data set, to get a feel for working with text data in python. Then
ask questions here if you get stuck, or need optimization advice. I
think you'll find you can accomplish this with just a few lines of
python code for each (sort -u, grep, awk '{print $n}', etc), given your
use of the commands in the examples provided. Write each as a function,
and you'll end up with code you can reuse for other log analysis
projects. Bonus!

HTH,
Marty

nibudh

2008-07-18 04:56:24 UTC

Permalink

Post by Martin Walsh

Post by nibudh
Any pragmatic advice on building or working with a framework to get
to the point where i can do analysis on my logs would be cool.

Hi Marty,

Thanks for the input. I like the idea of writing tailored versions of the
standard unix tools. I think what has held me back from writing more than a
few scripts based on someone elses code is a lack of clarity about what i'm
wanting to write.

I've done a bit of web programming in ASP,PHP and Perl mostly in the
presentation layer, but the shift in domain to sysadmin tasks has for some
reason has been difficult.

To essentially re-write "sort -u" in python has the advantage for me that i
use sort _alot_ so i'm familiar with the concept and it's a small enough
tasks to feel doable. :-)

Your suggestion also provides an insight into how to program that i find
easy to forget. which is to break things down into smaller pieces.

Cheers,

nibuh.

Alan Gauld

2008-07-15 19:58:36 UTC

Permalink

Post by Monika Jisswel
to say the truth I never thought about "additional overhead of getting the
input/output data transferred" because the suprocess itself will contain the
(bash)pipe to redirect output to the next utility used not the
python
subprocess.PIPE pipe so it will be like one subprocess with each utility
piping stdout to the next as if run from the shell,

If you run a pipeline chain from within subprocess every part of the
chain will be a separate process, thats a lot of overhead. Thats why
admins tend to prefer writing utilities in Perl rather than bash these
days.
Also for the pipeline to work every element must work with text
which may notr be the native data type so we have to convert it
to/from ints or floats etc.

But mostly I was thinking about the I/O to//from the Python program.
If the sublprocess or pipeline is feeding data into Python it will
usually
be much more efficient to store the data directly in Python variables
that to write it to stdout and read it back from stdin (which is what
happens in the pipeline).

Post by Monika Jisswel
I have to say that I have seen awk, grep & sort, wc, work on files of
handreds of Mbytes in a matter of 1 or 2 seconds ... why would I replace
such a fast tools ?

awk is not very fast. It is an interpreted language in the old sense
of interpreted language, it literally ijnterprets line by line. I have
not
compared awk to Python directly but I hhave compared it to perl
which is around 3 times faster in my experience and more if you
run awk from Perl rather than doing the equivalent in Perl.. Now
Python is usually a littlebit slower than Perl - especially when
using regex - but not that much slower so I'd expect Python to
be about 2 times faster than awk. (Not sure how nawk or gawk
compare, they may be compiled to byte code like perl/python.)

But as you say even awk if gast enough for normal sized use
so unless you are processing large numbers of files spawning
an awk process is probably not a killer, it just seems redundant
given that Python is just as capable for processing text and much
better at processing collections.

Post by Monika Jisswel
Alan do you think python can beat awk in speed when it
comes to replacing text ? I always wanted to know it !

It would be interesting to try but I'd expect it to be significantly
faster, yes.

--
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld

Jeff Younker

2008-07-18 17:52:35 UTC

Permalink

Post by Alan Gauld
If you run a pipeline chain from within subprocess every part of the
chain will be a separate process, thats a lot of overhead. Thats why
admins tend to prefer writing utilities in Perl rather than bash
these days.
...

Nobody I know uses perl for systems administration because it doesn't
have the cost of forking a process. They use it because you have real
data structures, functions, access to system calls, etc. with one
*cough*
consistent syntax. One of perl's big selling points for system
administration is that it is *so* easy to fork off a shell and read
its output.

The argument that processes are too expensive might have had some
truth back in '92 or '93 when machines had 8M or 16M of memory, had
500M hard drives, and ran at a fraction of the speed they do now, but
that was a long time ago.

-jeff

Alan Gauld

2008-07-15 07:47:13 UTC

Permalink

Post by Monika Jisswel
but if you want to go on your own I believe awk, grep, sort are extremely
extremely extremely (yes 3 times !) powerfulI tools, so giving them up is a
...
a python program that uses them thru subprocess module,

I am a big fan of awk but I'd never write a Python script
to call an awk script vias subprocess unless the awk script already
existed and was very big. There is nothing that awk can do that
Python cannot and the amount of code is about the same in each case.
And the overhead of starting a new awk process from subprocess
is quite high and then you have the additional overhead of getting
the input/output data transferred. It is all a very cumbersome process
that is not justified IMHO unless the awk code already exists
and is too big to rewrite. (The same applies to sort/grep if the
Python
script has to use the data otherwise sort is a valid option).

Does that mean you can't combine awk/python - no. If you are
not using the awk data in the python script then run awk separately
to Python but in that case use bash or ksh to coordinate not Python.
But I struggle to think of a good case for using awk via subprocess
(or os.system/popen etc).

Alan G.

nibudh

2008-07-14 23:57:04 UTC

Permalink

Post by Kent Johnson
http://www.logreport.org/lire.html

I have seen lire, of course it's Perl, but looking over the source would
give me some ideas. In fact from a practical point of view if it does the
job i should look into it more :-)

-- snip

Post by Kent Johnson
How do i go about representing the structure of the sendmail log file to my

Post by nibudh
script. I'm imagining having to filter through the logs and building up

some

Post by nibudh
kind of data structure which i can use to report from. should this just

Post by nibudh
as simple as a dash of regex and str.split() ? or are there better tools
that provide a richer framework to work within?

I will probably want to do something like this, extract the mail ids and use
those filter the logs further.

Post by Kent Johnson
To begin with I don't think you need to be parsing the log lines into
a structure. If you decide that is useful, you might want to look at
the loghetti project as a starting point. It parses web server logs
but it might be a useful model.
http://code.google.com/p/loghetti/
Kent

thanks for the pointer. loghetti looks interesting I've downloaded the
source and from a quick review i can imagine extending this to be sendmail
aware.

Cheers,

nibudh.

Jeff Younker

2008-07-15 18:21:56 UTC

Permalink

Post by nibudh
Hi List,
I'm looking for some support libraries that will help me to parse
sendmail logs.
I'm confused about whether i need a "parser" per se, and if i do
which parser to use. I found this website http://nedbatchelder.com/text/python-parsers.html
which compares a slew of python parsers.

Parsers as referenced in your link are intended for heavy-duty lifting
such as parsing programming languages.

Sendmail logs are very simple, so those parsers are vast overkill.
Split
on whitespace combined with regular expressions should be up to the job.

-jeff

nibudh

2008-07-18 04:59:42 UTC

Permalink

Post by Jeff Younker
Parsers as referenced in your link are intended for heavy-duty lifting
such as parsing programming languages.
Sendmail logs are very simple, so those parsers are vast overkill. Split
on whitespace combined with regular expressions should be up to the job.
-jeff

Jeff,

Thanks for the clarification, I was looking at these parsers thinking they
were overkill.

split and regex for version 0.0.0.1 of my script looks like the order of the
day :-)

Cheers,

ram,