Differences

This shows you the differences between the selected revision and the current version of the page.

daily_email_statistics 2008/01/02 01:06 daily_email_statistics 2008/07/02 18:59 current
Line 1: Line 1:
-  Date: Tue, 1 Jan 2008 13:42:02 +0100+  Date: Sat, 3 May 2008 16:22:19 +0200
  From: Uwe Scholz <turboscholz@xxx>   From: Uwe Scholz <turboscholz@xxx>
-  To: crm114-general@lists.sourceforge.net +  Subject: daily email statistics 
-  Message-ID: <20080101124202.GA6268@localhost> +  Message-ID: <20080503142219.GA4791@uwe-desktop>
-  Subject: [Crm114-general] Daily spam summary (and auto-delete) HOWTO+
   
-  Hello!+  Hello Paolo,
   
-  A few weaks ago I have seen that there is an empty entry in the +  today I've made some changes to my daily mail summary script. Now it is 
-  HOWTO-section of the wiki, namely the "Daily email statistics" thing.+  possible to chose, if the summary of mails is sorted by date of 
 +  incomming or by spam possibility. For this purpose, I had to include two 
 +  new files (scan.spam_dat and scan.spam_pos) and delete an old 
 +  (scan.spam).
   
-  So I decided to translate and publish the stuff I have written about +  I'm still using 20041231.BlameSanAndreas (TRE 0.7.5 (LGPL)), because 
-  this issue some time ago and I hope that some people can check this out, +  crm maintainers from gentoo didn't mark later version of 64bit-crm as   
-  search for program (and spelling) mistakes. If all is ok, we can copy +  stable till now. So I don't know, if this script works with later 
-  the text and the belonging scripts into the wiki.+  versions, too.
   
-  It would be great, if anyone can look through and try the scripts I have +  What I want to say is, that I changed the files on my 
-  written. They all work well and I can't complain about anything.+  university-directory. Now, there is only one archive to all the script 
 +  files:
   
-  Ciao and a happy new year! +  http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam-summary.tar.gz
-  Uwe +
- +
------------------------------------------------------------------------- +
-  Because of a newspaper article about CRM114 as a programming language +
-  wich can work as a spam filter too I decided to have a look at it. How +
-  it works is described in great detail in the CRM114_Mailfilter_HOWTO +
-  inside of the CRM114 package. The problem described here is not limited +
-  to CRM114, but can be used even when using SpamAssassin or just only +
-  Procmail. All you have to do than is only to adjust your procmailrc and +
-  all the stuff with nmh, but just read on...+
   
-  Everything ran great and my spam was from now on filtered by CRM114 and +  I have also included your comments from 
-  Procmail. But something was missing ... I was sooo lazy and had no +  http://crm114.sourceforge.net/wiki/doku.php?id=daily_email_statistics 
-  desire to check the spam folder every day whether wrong sorted mail +  into the README.txt. So I think, you can delete the page entry and just 
-  landed in the spam folder. What to do? There had to be a script, which +  insert a small link to the archive above with a short explanation or 
-  sends me a daily summary of new spam mails. Therefore I wrote this+  so. In this way it is easier for me to update the script as fast as 
 +  possible on my web-directory.
   
-   +  Thanks in advance  
-  SPAM-MAIL-STATISTIC- (AND AUTO-DELETE)-HOWTO +  Uwe
-  ******************************************** +
-   +
-   +
-  1: Installing the Spam-Filter and Getting the Summary Script Working +
-  ******************************************************************** +
-   +
-   +
-  If you have a working spam filter installed (no matter whether +
-  SpamAssassin, CRM114 or whatever) and seperate your mail with Procmail +
-  in a spam folder with mh-Format, you can read on at section 1b. +
-   +
-  1a) Sorting out Spam with Procmail and CRM114 +
-  ********************************************* +
-   +
-  First, I installed CRM114 as discribed in the +
-  CRM114_Mailfilter_HOWTO. As spam recognized mails are now sorted out by +
-  Procmail. The necessary entries in the .procmailrc are (look at "Step 5" +
-  of the mentioned HOWTO: +
-   +
-    MAILDIR=$HOME/Mail +
-    :0fw: .crm114.lock +
-    | /usr/bin/crm -u /home/uwe/.crm114 mailfilter.crm +
-   +
-    :0: +
-    * ^X-CRM114-Status: SPAM.* +
-    | rcvstore +Spam +
-   +
-   +
-  The second part is different from the .procmailrc-configuration of the +
-  CRM114-HOWTO given by the option  +
-   +
-    | rcvstore +Spam +
-   +
-  For an explanation a small excerpt about the mh-Mailbox-Format from +
-  http://www.linux-user.de: +
-   +
-  ---  +
-  "Mh" is the abbreviation for "mail handler" (mh/nmh). It is a system of +
-  command-line tools for modular dealing with email -- but here we are +
-  only interested in the structure of a mh-folder. It contains +
-  consecutively numbered files, there every file is one mail. When a new +
-  mail arives just the next free number is used. +
-  ---  +
-   +
-  Such a mh-folder is very comfortable and practical for our problem, +
-  because we can identify individual spam messages simply by their file +
-  name. +
-   +
-  When mail is deleted inside the mh-folder by the MUA the mail file still +
-  exists after this but the file name is preceded by a comma. Now these +
-  files can be deleted by a cron job. The comma-thing is nice, because +
-  with it you have the possibility to undelete mails. +
- +
-//that's like "T" flag/suffix for files in maildir format --paolo// +
- +
-  What does rcvstore do?  Rcvstore is a program out of the nmh-packet (a +
-  package of mail processing programs for the console). In order to +
-  execute programs of this package, it is advisable that the paths +
-  /usr/lib/mh and /usr/bin/mh are added to the PATH variable. +
-   +
-  Rcvstore stores mail messages coming from procmail in the specified +
-  folder "Spam" in the mail directory with mh-Format. To let rcvstore know +
-  where this directory is, you have to type the command +
-   +
-    install-mh +
-   +
-  after the installation of nmh. It will ask you for your standard mail +
-  directory. Usually, this is "Mail" in your home directory. "rcvstore +
-  +Spam" will then save the mail in the folder "~/Mail/Spam". (so please +
-  do "mkdir ~"Mail/Spam" +
-   +
-  For MUA such as "mutt" you have to create the file ".mh_sequences" in +
-  the Spam-folder, so it knows, that this folder is in mh-format: +
-   +
-    touch ~/Mail/Spam/.mh_sequences  +
-   +
-  The .procmailrc-entry +
-   +
-    :0: +
-    * ^X-CRM114-Status: SPAM.* +
-    | rcvstore +Spam +
- +
-//and for maildir, we'd use:// +
-    | safecat .Spam/tmp .Spam/new +
-//since procmail sets its CWD to Mail, and maildir folder start with dot (usually) --paolo// +
- +
-  thus stores spam marked mails in the directory ~/Mail/Spam with +
-  mh-format. This gives each of mail a separate file with consecutive +
-  number as file names, starting at 1. +
-   +
-  Half of the problem we have solved now. +
-   +
-  1b) Sending the Summary Message +
-  ******************************* +
-   +
-  New spam mails move now into the folder ~/Mail/Spam. The mails have +
-  consecutive numbers: 1, 2, 3, ... +
-   +
-  How do we force Linux to send us a daily summary of the new spam mails? +
-   +
-  Logically there must be a file for the responsible script, in +
-  which it stores which mails already have been reviewed once. +
-   +
-    touch ~/Mail/Spam/.spam_check +
-   +
-  The script written by me for this task can be downloaded [1]. In it, you +
-  have to give user-specific settings: the standard MAIL (default: +
-  Mail) directory, the spam folder SPAM inside of the Mail directory +
-  (default: Spam) and the spam-check file SPAMCHECK that we have just +
-  created. +
-   +
-  Functioning of the script: +
-   +
-  The principle is quite simple: for each mail file in the SPAMDIR folder +
-  it is checked whether there exists an entry for the mail in the file +
-  ".spam_check". If not, then the mail file is new, and a "check" entry +
-  with file name and the current time (we need it later) is written in the +
-  SPAMCHECK file. At the next review there is already a check-entry and +
-  the associated mail is not considered new. +
-   +
-  The format of a line in the SPAMCHECK file is of the following form: +
-  FILENAME:TIME, where the time is the moment when the entry was created. +
-   +
-  The date, the sender and the subject of every new spam mail (exactly as +
-  many as new entries in the SPAMCHECK file), are read out with the +
-  program "scan" from the nmh-packet. Summarized the are send by mail to +
-  the user. With a cronjob the script could runed once a day, and from now +
-  on we get a nice and concise summary of new spam mails. +
-   +
-  I didn't like the overview of the scan program so much, so I wrote my +
-  own file [2] for the format of the output. This file can be used with the +
-  option "-form scan.spam" to the scan command. Examples of such files are +
-  given in the nmh-packet. +
-   +
-  You can see how a summary mail looks like in this screenshot [4] (The +
-  subject is German because I didn't translated the script only for +
-  the screenshot). +
-   +
-  A little note for those who want to know it exactly: +
-  **************************************************** +
-   +
-  New mails will always receive the highest free number. Unfortunately, we +
-  get a problem, if one deletes the newest mail with the MUA. For example, +
-  let us consider we have three mails: 1, 2 and 3. All mails have been +
-  reviewed by the spam_check script. Now we delete the newest mail with +
-  the MUA, so we get: 1, 2, and ",3". After this a new mail arrives +
-  getting the highest available number: "3". Now, the spam_check-script +
-  could "think" that the mail numbered 3 already has been checked, because +
-  its entry in the SPAMCHECK file exists. But this 3 is a new file! In +
-  order to prevent this error the spam_check script compares the age of +
-  the mail file with the corresponding time in the SPAMCHECK file: Is the +
-  mail file younger than the check entry than we have a new mail by force: +
-  It is scanned too and the check entry gets updated. +
- +
-//no such problems with maildir instead, since filenames are unique for sure --paolo// +
- +
-  2) Automatically Deleting Spam +
-  ****************************** +
-   +
-  That was already great, right? But ... +
-   +
-  Now the problem bugged me that the number of stupid spam mails on my +
-  hard disk grows more and more, so I decided to write a script to +
-  automatically delete old spam [3]. +
-   +
-  What in any case should be avoided is the possibility to delete mails +
-  that a user never had seen. Just imagine the following situation: A user +
-  receives a summary of new spam mails every morning. In the course of the +
-  day, new e-mails are collected by fetchmail. One of them is mistakenly +
-  marked as spam and therefore landed in the "Spam" folder. In the evening +
-  the user switchs off the PC. Now he wents half a month on holiday. After +
-  the relaxing(!) Days with his family he returned home, and turns on the +
-  PC again. Because of his settings in his spam_delete script spam mails +
-  beeing older than a week have to be deleted and so even the "wrong" spam +
-  mail gets deleted, although it was no spam. ERROR! +
-   +
-//a simple/safe rule could be: set the grace period twice the longer expected full user vacation (no mail checking) time  --paolo// +
-   +
-  How we can prevent this? +
-   +
-  In the SPAMCHECK file the spam_delete script saves, when mails have been +
-  registered. When we now check, when wich mail in the SPAMCHECK file has +
-  been registered, we get a fairly safe concept: The maximum number of +
-  deleted mails in the spam folder is the number of mails, that the user +
-  already has seen in the summary-mail, and the review was already a +
-  certain time ago. Mails which haven't been checked by the spam_check +
-  script couldn't be deleted. +
-   +
-  If the check entry in the SPAMCECK file is older than the specified age +
-  MAXAGE (in seconds), the script deletes the file and the corresponding +
-  entry in the SPAMCHECK file. +
-   +
-  BUT PLEASE be careful and have a look at your summary mail every day. If +
-  you use the spam_delete script, mail is deleted some day after it has +
-  been send to you in a summary. If you don't look into the summary you +
-  can lose important mail! +
-   +
-//to be even safer, one could keep track of the spamreport msg/file, check it it's been read, then mark OK-to-delete entries in that summary, after their grace period expired. --paolo// +
-   +
-  If the user deletes mail out of the MUA from the spam folder, it still +
-  exists due to the mh-mailformat (as mentioned above). The file name is +
-  changed only. From file "1" we get ",1" through deleting with the +
-  MUA. Files starting with commas are also deleted by the +
-  spam_delete-script after the exceedance of the specified maximum age. +
-   +
-  Again it is necessary to set some important things: the users MAIL +
-  folder, the spam directory SPAMDIR, the SPAMCHECK file and additionally +
-  the maximum age of spam mails after review in seconds (default is 7 +
-  days). +
-   +
-  The erasing option of the script is commented out initially by default, +
-  so that the functioning of the script can be tested first). If +
-  everything runs satisfactorily to us, we can simply uncomment the lines +
-  with "rm" (delete "#"). +
-   +
-  Now we can set up a cron job for the spam_delete script to let it run +
-  every day or week or so.+
-  - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_check +-----------------------------------------------------------------------------------
-  - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/.scan.spam +
-  - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_delete +
-  - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_check.png+
 +//sorry for  delay; also, Uwe's mailserver is refusing mail mails... -- p//
 +
------------------------------------------------------------------------- -------------------------------------------------------------------------
[[https://lists.sourceforge.net/lists/listinfo/crm114-general]] [[https://lists.sourceforge.net/lists/listinfo/crm114-general]]
 
 
daily_email_statistics.txt · Last modified: 2008/07/02 18:59 by oopla
 
Recent changes RSS feed Creative Commons License Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki