|
daily_email_statistics 2008/01/02 01:06 |
daily_email_statistics 2008/07/02 18:59 current |
| - | Date: Tue, 1 Jan 2008 13:42:02 +0100 | + | Date: Sat, 3 May 2008 16:22:19 +0200 |
| | From: Uwe Scholz <turboscholz@xxx> | | From: Uwe Scholz <turboscholz@xxx> |
| - | To: crm114-general@lists.sourceforge.net | + | Subject: daily email statistics |
| - | Message-ID: <20080101124202.GA6268@localhost> | + | Message-ID: <20080503142219.GA4791@uwe-desktop> |
| - | Subject: [Crm114-general] Daily spam summary (and auto-delete) HOWTO | + | |
| | | | |
| - | Hello! | + | Hello Paolo, |
| | | | |
| - | A few weaks ago I have seen that there is an empty entry in the | + | today I've made some changes to my daily mail summary script. Now it is |
| - | HOWTO-section of the wiki, namely the "Daily email statistics" thing. | + | possible to chose, if the summary of mails is sorted by date of |
| | + | incomming or by spam possibility. For this purpose, I had to include two |
| | + | new files (scan.spam_dat and scan.spam_pos) and delete an old |
| | + | (scan.spam). |
| | | | |
| - | So I decided to translate and publish the stuff I have written about | + | I'm still using 20041231.BlameSanAndreas (TRE 0.7.5 (LGPL)), because |
| - | this issue some time ago and I hope that some people can check this out, | + | crm maintainers from gentoo didn't mark later version of 64bit-crm as |
| - | search for program (and spelling) mistakes. If all is ok, we can copy | + | stable till now. So I don't know, if this script works with later |
| - | the text and the belonging scripts into the wiki. | + | versions, too. |
| | | | |
| - | It would be great, if anyone can look through and try the scripts I have | + | What I want to say is, that I changed the files on my |
| - | written. They all work well and I can't complain about anything. | + | university-directory. Now, there is only one archive to all the script |
| | + | files: |
| | | | |
| - | Ciao and a happy new year! | + | http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam-summary.tar.gz |
| - | Uwe | + | |
| - | | + | |
| - | ------------------------------------------------------------------------ | + | |
| - | Because of a newspaper article about CRM114 as a programming language | + | |
| - | wich can work as a spam filter too I decided to have a look at it. How | + | |
| - | it works is described in great detail in the CRM114_Mailfilter_HOWTO | + | |
| - | inside of the CRM114 package. The problem described here is not limited | + | |
| - | to CRM114, but can be used even when using SpamAssassin or just only | + | |
| - | Procmail. All you have to do than is only to adjust your procmailrc and | + | |
| - | all the stuff with nmh, but just read on... | + | |
| | | | |
| - | Everything ran great and my spam was from now on filtered by CRM114 and | + | I have also included your comments from |
| - | Procmail. But something was missing ... I was sooo lazy and had no | + | http://crm114.sourceforge.net/wiki/doku.php?id=daily_email_statistics |
| - | desire to check the spam folder every day whether wrong sorted mail | + | into the README.txt. So I think, you can delete the page entry and just |
| - | landed in the spam folder. What to do? There had to be a script, which | + | insert a small link to the archive above with a short explanation or |
| - | sends me a daily summary of new spam mails. Therefore I wrote this | + | so. In this way it is easier for me to update the script as fast as |
| | + | possible on my web-directory. |
| | | | |
| - | | + | Thanks in advance |
| - | SPAM-MAIL-STATISTIC- (AND AUTO-DELETE)-HOWTO | + | Uwe |
| - | ******************************************** | + | |
| - | | + | |
| - | | + | |
| - | 1: Installing the Spam-Filter and Getting the Summary Script Working | + | |
| - | ******************************************************************** | + | |
| - | | + | |
| - | | + | |
| - | If you have a working spam filter installed (no matter whether | + | |
| - | SpamAssassin, CRM114 or whatever) and seperate your mail with Procmail | + | |
| - | in a spam folder with mh-Format, you can read on at section 1b. | + | |
| - | | + | |
| - | 1a) Sorting out Spam with Procmail and CRM114 | + | |
| - | ********************************************* | + | |
| - | | + | |
| - | First, I installed CRM114 as discribed in the | + | |
| - | CRM114_Mailfilter_HOWTO. As spam recognized mails are now sorted out by | + | |
| - | Procmail. The necessary entries in the .procmailrc are (look at "Step 5" | + | |
| - | of the mentioned HOWTO: | + | |
| - | | + | |
| - | MAILDIR=$HOME/Mail | + | |
| - | :0fw: .crm114.lock | + | |
| - | | /usr/bin/crm -u /home/uwe/.crm114 mailfilter.crm | + | |
| - | | + | |
| - | :0: | + | |
| - | * ^X-CRM114-Status: SPAM.* | + | |
| - | | rcvstore +Spam | + | |
| - | | + | |
| - | | + | |
| - | The second part is different from the .procmailrc-configuration of the | + | |
| - | CRM114-HOWTO given by the option | + | |
| - | | + | |
| - | | rcvstore +Spam | + | |
| - | | + | |
| - | For an explanation a small excerpt about the mh-Mailbox-Format from | + | |
| - | http://www.linux-user.de: | + | |
| - | | + | |
| - | --- | + | |
| - | "Mh" is the abbreviation for "mail handler" (mh/nmh). It is a system of | + | |
| - | command-line tools for modular dealing with email -- but here we are | + | |
| - | only interested in the structure of a mh-folder. It contains | + | |
| - | consecutively numbered files, there every file is one mail. When a new | + | |
| - | mail arives just the next free number is used. | + | |
| - | --- | + | |
| - | | + | |
| - | Such a mh-folder is very comfortable and practical for our problem, | + | |
| - | because we can identify individual spam messages simply by their file | + | |
| - | name. | + | |
| - | | + | |
| - | When mail is deleted inside the mh-folder by the MUA the mail file still | + | |
| - | exists after this but the file name is preceded by a comma. Now these | + | |
| - | files can be deleted by a cron job. The comma-thing is nice, because | + | |
| - | with it you have the possibility to undelete mails. | + | |
| - | | + | |
| - | //that's like "T" flag/suffix for files in maildir format --paolo// | + | |
| - | | + | |
| - | What does rcvstore do? Rcvstore is a program out of the nmh-packet (a | + | |
| - | package of mail processing programs for the console). In order to | + | |
| - | execute programs of this package, it is advisable that the paths | + | |
| - | /usr/lib/mh and /usr/bin/mh are added to the PATH variable. | + | |
| - | | + | |
| - | Rcvstore stores mail messages coming from procmail in the specified | + | |
| - | folder "Spam" in the mail directory with mh-Format. To let rcvstore know | + | |
| - | where this directory is, you have to type the command | + | |
| - | | + | |
| - | install-mh | + | |
| - | | + | |
| - | after the installation of nmh. It will ask you for your standard mail | + | |
| - | directory. Usually, this is "Mail" in your home directory. "rcvstore | + | |
| - | +Spam" will then save the mail in the folder "~/Mail/Spam". (so please | + | |
| - | do "mkdir ~"Mail/Spam" | + | |
| - | | + | |
| - | For MUA such as "mutt" you have to create the file ".mh_sequences" in | + | |
| - | the Spam-folder, so it knows, that this folder is in mh-format: | + | |
| - | | + | |
| - | touch ~/Mail/Spam/.mh_sequences | + | |
| - | | + | |
| - | The .procmailrc-entry | + | |
| - | | + | |
| - | :0: | + | |
| - | * ^X-CRM114-Status: SPAM.* | + | |
| - | | rcvstore +Spam | + | |
| - | | + | |
| - | //and for maildir, we'd use:// | + | |
| - | | safecat .Spam/tmp .Spam/new | + | |
| - | //since procmail sets its CWD to Mail, and maildir folder start with dot (usually) --paolo// | + | |
| - | | + | |
| - | thus stores spam marked mails in the directory ~/Mail/Spam with | + | |
| - | mh-format. This gives each of mail a separate file with consecutive | + | |
| - | number as file names, starting at 1. | + | |
| - | | + | |
| - | Half of the problem we have solved now. | + | |
| - | | + | |
| - | 1b) Sending the Summary Message | + | |
| - | ******************************* | + | |
| - | | + | |
| - | New spam mails move now into the folder ~/Mail/Spam. The mails have | + | |
| - | consecutive numbers: 1, 2, 3, ... | + | |
| - | | + | |
| - | How do we force Linux to send us a daily summary of the new spam mails? | + | |
| - | | + | |
| - | Logically there must be a file for the responsible script, in | + | |
| - | which it stores which mails already have been reviewed once. | + | |
| - | | + | |
| - | touch ~/Mail/Spam/.spam_check | + | |
| - | | + | |
| - | The script written by me for this task can be downloaded [1]. In it, you | + | |
| - | have to give user-specific settings: the standard MAIL (default: | + | |
| - | Mail) directory, the spam folder SPAM inside of the Mail directory | + | |
| - | (default: Spam) and the spam-check file SPAMCHECK that we have just | + | |
| - | created. | + | |
| - | | + | |
| - | Functioning of the script: | + | |
| - | | + | |
| - | The principle is quite simple: for each mail file in the SPAMDIR folder | + | |
| - | it is checked whether there exists an entry for the mail in the file | + | |
| - | ".spam_check". If not, then the mail file is new, and a "check" entry | + | |
| - | with file name and the current time (we need it later) is written in the | + | |
| - | SPAMCHECK file. At the next review there is already a check-entry and | + | |
| - | the associated mail is not considered new. | + | |
| - | | + | |
| - | The format of a line in the SPAMCHECK file is of the following form: | + | |
| - | FILENAME:TIME, where the time is the moment when the entry was created. | + | |
| - | | + | |
| - | The date, the sender and the subject of every new spam mail (exactly as | + | |
| - | many as new entries in the SPAMCHECK file), are read out with the | + | |
| - | program "scan" from the nmh-packet. Summarized the are send by mail to | + | |
| - | the user. With a cronjob the script could runed once a day, and from now | + | |
| - | on we get a nice and concise summary of new spam mails. | + | |
| - | | + | |
| - | I didn't like the overview of the scan program so much, so I wrote my | + | |
| - | own file [2] for the format of the output. This file can be used with the | + | |
| - | option "-form scan.spam" to the scan command. Examples of such files are | + | |
| - | given in the nmh-packet. | + | |
| - | | + | |
| - | You can see how a summary mail looks like in this screenshot [4] (The | + | |
| - | subject is German because I didn't translated the script only for | + | |
| - | the screenshot). | + | |
| - | | + | |
| - | A little note for those who want to know it exactly: | + | |
| - | **************************************************** | + | |
| - | | + | |
| - | New mails will always receive the highest free number. Unfortunately, we | + | |
| - | get a problem, if one deletes the newest mail with the MUA. For example, | + | |
| - | let us consider we have three mails: 1, 2 and 3. All mails have been | + | |
| - | reviewed by the spam_check script. Now we delete the newest mail with | + | |
| - | the MUA, so we get: 1, 2, and ",3". After this a new mail arrives | + | |
| - | getting the highest available number: "3". Now, the spam_check-script | + | |
| - | could "think" that the mail numbered 3 already has been checked, because | + | |
| - | its entry in the SPAMCHECK file exists. But this 3 is a new file! In | + | |
| - | order to prevent this error the spam_check script compares the age of | + | |
| - | the mail file with the corresponding time in the SPAMCHECK file: Is the | + | |
| - | mail file younger than the check entry than we have a new mail by force: | + | |
| - | It is scanned too and the check entry gets updated. | + | |
| - | | + | |
| - | //no such problems with maildir instead, since filenames are unique for sure --paolo// | + | |
| - | | + | |
| - | 2) Automatically Deleting Spam | + | |
| - | ****************************** | + | |
| - | | + | |
| - | That was already great, right? But ... | + | |
| - | | + | |
| - | Now the problem bugged me that the number of stupid spam mails on my | + | |
| - | hard disk grows more and more, so I decided to write a script to | + | |
| - | automatically delete old spam [3]. | + | |
| - | | + | |
| - | What in any case should be avoided is the possibility to delete mails | + | |
| - | that a user never had seen. Just imagine the following situation: A user | + | |
| - | receives a summary of new spam mails every morning. In the course of the | + | |
| - | day, new e-mails are collected by fetchmail. One of them is mistakenly | + | |
| - | marked as spam and therefore landed in the "Spam" folder. In the evening | + | |
| - | the user switchs off the PC. Now he wents half a month on holiday. After | + | |
| - | the relaxing(!) Days with his family he returned home, and turns on the | + | |
| - | PC again. Because of his settings in his spam_delete script spam mails | + | |
| - | beeing older than a week have to be deleted and so even the "wrong" spam | + | |
| - | mail gets deleted, although it was no spam. ERROR! | + | |
| - | | + | |
| - | //a simple/safe rule could be: set the grace period twice the longer expected full user vacation (no mail checking) time --paolo// | + | |
| - | | + | |
| - | How we can prevent this? | + | |
| - | | + | |
| - | In the SPAMCHECK file the spam_delete script saves, when mails have been | + | |
| - | registered. When we now check, when wich mail in the SPAMCHECK file has | + | |
| - | been registered, we get a fairly safe concept: The maximum number of | + | |
| - | deleted mails in the spam folder is the number of mails, that the user | + | |
| - | already has seen in the summary-mail, and the review was already a | + | |
| - | certain time ago. Mails which haven't been checked by the spam_check | + | |
| - | script couldn't be deleted. | + | |
| - | | + | |
| - | If the check entry in the SPAMCECK file is older than the specified age | + | |
| - | MAXAGE (in seconds), the script deletes the file and the corresponding | + | |
| - | entry in the SPAMCHECK file. | + | |
| - | | + | |
| - | BUT PLEASE be careful and have a look at your summary mail every day. If | + | |
| - | you use the spam_delete script, mail is deleted some day after it has | + | |
| - | been send to you in a summary. If you don't look into the summary you | + | |
| - | can lose important mail! | + | |
| - | | + | |
| - | //to be even safer, one could keep track of the spamreport msg/file, check it it's been read, then mark OK-to-delete entries in that summary, after their grace period expired. --paolo// | + | |
| - | | + | |
| - | If the user deletes mail out of the MUA from the spam folder, it still | + | |
| - | exists due to the mh-mailformat (as mentioned above). The file name is | + | |
| - | changed only. From file "1" we get ",1" through deleting with the | + | |
| - | MUA. Files starting with commas are also deleted by the | + | |
| - | spam_delete-script after the exceedance of the specified maximum age. | + | |
| - | | + | |
| - | Again it is necessary to set some important things: the users MAIL | + | |
| - | folder, the spam directory SPAMDIR, the SPAMCHECK file and additionally | + | |
| - | the maximum age of spam mails after review in seconds (default is 7 | + | |
| - | days). | + | |
| - | | + | |
| - | The erasing option of the script is commented out initially by default, | + | |
| - | so that the functioning of the script can be tested first). If | + | |
| - | everything runs satisfactorily to us, we can simply uncomment the lines | + | |
| - | with "rm" (delete "#"). | + | |
| - | | + | |
| - | Now we can set up a cron job for the spam_delete script to let it run | + | |
| - | every day or week or so. | + | |
| | | | |
| - | - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_check | + | ----------------------------------------------------------------------------------- |
| - | - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/.scan.spam | + | |
| - | - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_delete | + | |
| - | - http://wwwstud.rz.uni-leipzig.de/~pge03gej/daten/spam/spam_check.png | + | |
| | | | |
| | + | //sorry for delay; also, Uwe's mailserver is refusing mail mails... -- p// |
| | + | |
| | ------------------------------------------------------------------------- | | ------------------------------------------------------------------------- |
| | | | |
| | [[https://lists.sourceforge.net/lists/listinfo/crm114-general]] | | [[https://lists.sourceforge.net/lists/listinfo/crm114-general]] |