summaryrefslogtreecommitdiff
path: root/Administrator_Guide/en-US/Combating_Spam.xml
blob: 81e32256e5fbe4c8b77bf067c183a55f46499fa8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
<?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Administrator_Guide.ent">
%BOOK_ENTITIES;
]>
<chapter id="chap-Administrator_Guide-Combating_Spam">
    <title>Combating Spam</title>
    <para>
        Kolab Groupware includes <application>SpamAssassin</application>, a fast, well-established anti-spam solution with a large community of supporters contributing not only to the code, but to rulesets as well.
    </para>
    <para>
        Combating spam is always a tricky situation. On the organizational level, a strategy has to be formulated to combat spam in order to achieve the maximum flexibility and effectiveness for individual users, separate organizations, and the deployment as a whole.
    </para>
    <para>
        A common deployment is to define deployment-wide user preferences and to use a single, deployment-wide set of rules for <application>SpamAssassin</application> to operate with -including Bayes database(s).
    </para>
    <para>
        The problems start when individual users mark legitimate email as spam, most notably the company newsletter or correspondation they have opted in some time ago, but wish to no longer receive. Users tend to ignore the long-term effects of marking these message as spam, if at all they are aware of any, and just want such messages out of their way.
    </para>
    <para>
        Common examples of the sort of messages that are often marked as spam while being legitimate traffic include:
    </para>
    <para>
        <itemizedlist>
            <listitem>
                <para>
                    Newsletters, where users, rather then unsubscribe, mark legitimate messages as spam,
                </para>

            </listitem>
            <listitem>
                <para>
                    Notifications from social networks such as from Google+, Facebook, Twitter, etc., where users, rather then adjust their notification preferences, mark legitimate messages as spam,
                </para>

            </listitem>
            <listitem>
                <para>
                    Notifications from forums and/or services,
                </para>

            </listitem>

        </itemizedlist>

    </para>
    <para>
        If enough users mark these messages as spam, the system will start to recognize these messages as spam, and other users may be prevented from receiving the same or similar messages in their INBOX.
    </para>
    <para>
        <application>Amavis</application>, the default content filter performing anti-virus and anti-spam, wraps around <application>SpamAssassin</application> to achieve this flexibility.
    </para>
    <para>
        Separate Bayes database(s) can be created on a per-recipient and per-policy-bank <application>SpamAssassin</application> configuration files and SQL Bayes usernames.
    </para>
    <para>
        Without over-complicating things, a common scenario sufficiently serving the anti-spam effort, includes the following aspects;
    </para>
    <para>
        <itemizedlist>
            <listitem>
                <para>
                    A <code>shared/Spam</code> folder is created, with permissions for all users to lookup, read, and insert messages. It is the intention users move or copy messages they think are spam into this folder.
                </para>
                <note>
                    <para>
                        Note that, optionally, the permissions for users to maintain the 'seen' state of messages could not be granted, which in combination with <literal>sharedseen</literal> could provide a mechanism that would allow users to view which messages have been learned as spam in the past.
                    </para>

                </note>

            </listitem>

        </itemizedlist>

    </para>
    <section id="sect-Administrator_Guide-Combating_Spam-Learning_About_New_Spam">
        <title>Learning About New Spam</title>
        <para>
            Optionally, find all folders named "Spam" or "Junk":
        </para>
        <para>

<screen>$ <userinput>find /var/spool/imap/ -type d -name "Spam" -o -name "Junk"</userinput></screen>

        </para>
        <note>
            <para>
                Finding all folders called "Spam" or "Junk" can potentially take a long time, depending on the size of the spool.
            </para>

        </note>
        <remark> We would like to do the very same through IMAP, with annotations / SPECIAL-USE. </remark>
        <para>

<screen>$ <userinput>sa-learn --spam /path/to/folder/[0-9]*.</userinput></screen>

        </para>
        <note>
            <para>
                <application>SpamAssassin</application> will not learn about messages it has learned about before. There's no requirement of purging or deleting the messages that <application>SpamAssassin</application> has learned about already, and purging or deleting those messages only helps to speed up the learning process run.
            </para>

        </note>
        <warning>
            <para>
                Do NOT delete the messages from the filesystem directly. Please refer to <xref linkend="sect-Administrator_Guide-Combating_Spam-Expiring_Messages_from_SpamHam_Shared_Folders" /> for ways to purge, expire and/or delete messages from spam folders in a sustainable way.
            </para>

        </warning>

    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Preseeding_the_Bayes_Database">
        <title>Preseeding the Bayes Database</title>
        <para>
            As Bayes is only effective after it has learned about 200 messages, it is recommended to preseed the Bayes database with some high-quality ham and spam. Preseeding the Bayes database with some ham, and some spam, is done using the <emphasis>SpamAssassin Public Corpus</emphasis>. The public corpus consists of many messages qualified as ham and spam, collected from a variety of sources.
        </para>
        <para>
            The SpamAssassin Public Corpus can be found at <ulink url="http://spamassassin.apache.org/publiccorpus/" />.
        </para>
        <procedure id="proc-Administrator_Guide-Preseeding_the_Bayes_Database-Preseeding_the_Bayes_Database_using_SpamAssassin_Public_Corpus">
            <title>Preseeding the Bayes Database using SpamAssassin Public Corpus</title>
            <step>
                <para>
                    Obtain the ham and spam archives:
                </para>
                <para>

<screen># mkdir -p /tmp/salearn
# cd /tmp/salearn
# wget --recursive --timestamping --no-directories --level=1 --reject=gif,png,html,=A,=D http://spamassassin.apache.org/publiccorpus/</screen>

                </para>

            </step>
            <step>
                <para>
                    Extract the archives:
                </para>
                <para>

<screen># for archive in `ls -1 *.tar.bz2`; do tar jxf $archive; done</screen>

                </para>

            </step>
            <step>
                <para>
                    For all files in the ham directories, learn those messages as ham:
                </para>
                <para>

<screen># sa-learn --progress --ham *ham*/*</screen>

                </para>
                <note>
                    <para>
                        The total number of messages is about 7000. Learning about all of them may take quite a while. We recommend running the command in a screen.
                    </para>

                </note>

            </step>
            <step>
                <para>
                    For all files in the spam directories, learn those messages as spam:
                </para>
                <para>

<screen># sa-learn --progress --spam *spam*/*</screen>

                </para>
                <note>
                    <para>
                        The total number of messages is about 2500. Learning about all of them may take quite a while. We recommend running the command in a screen.
                    </para>

                </note>

            </step>

        </procedure>


    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Trapping_Massive_Amounts_of_Spam">
        <title>Trapping Massive Amounts of Spam</title>
        <para>
            To learn about spam quickly, allow the Cyrus IMAP postuser to post into a shared folder that will be included in the regular <command>sa-learn</command> run.
        </para>
        <procedure id="proc-Administrator_Guide-Trapping_Massive_Amounts_of_Spam-Setting_a_Trap_for_Spam">
            <title>Setting a Trap for Spam</title>
            <step>
                <para>
                    Set up the Cyrus IMAP postuser, using the <literal>postuser</literal> setting in <filename>/etc/imapd.conf</filename>.
                </para>
                <para>
                    If, for example, the <literal>postuser</literal> is set to <literal>bb</literal>, the mail to <literal>bb+shared.blah</literal> will be delivered to the <literal>shared.blah</literal> folder.
                </para>

            </step>
            <step>
                <para>
                    Create a folder such as <literal>shared/Spam</literal>
                </para>

            </step>
            <step>
                <para>
                    Set permissions:
                </para>
                <para>

<screen>cyradm&gt; sam shared/Spam &lt;postuser&gt; p</screen>

                </para>

            </step>
            <step>
                <para>
                    Submit / subscribe to known spam aggregators (search Google for "free email offers")
                </para>

            </step>
            <step>
                <para>
                    Optionally, set the <literal>luser_relay</literal> option in Postfix, to trap all messages sent to non-existent recipients.
                </para>

            </step>

        </procedure>


    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Tweaking_Bayes_Scores">
        <title>Tweaking Bayes' Scores</title>
        <para>
            Bayes' score is dependent on the probability Bayes attaches of a message being spam. The rules used to match a message's probability of being spam are systematically prefixed with <literal>BAYES_</literal>, followed by the percentage of likelihood of the message being spam.
        </para>
        <para>
            Because there is rarely a 100% certainty of a message being spam, the highest percentage is 99%. By default, the configuration attaches a 3.5 score to this probability. Depending on the configuration value for <literal>$sa_tag2_level_deflt</literal> supplied in the Amavis configuration file, <literal>6.31</literal> by default, it is unlikely spam will reach the cut-off point of actually being marked as spam solely on the basis of Bayes' probability score.
        </para>
        <para>
            It is therefor recommended to increase the score attached to messages with a 99% probability of being spam to at least <literal>6.308</literal>, if not simply <literal>6.31</literal>. Using <literal>6.308</literal>, you configure spam to be tagged not solely on the basis of Bayes' 99% probability score, but request that in addition the message is recognized to be in HTML (and HTML only), and perhaps uses a big font &ndash;or similar patterns with a very small <literal>0.01</literal> score.
        </para>
        <para>
            Some spam has been submitted through systems listed at <ulink url="http://dnswl.org" />, a collaborative false positive protection mechanism, a default score of 1 is substracted from the total score. If this spam reaches you, consider increasing the score on <literal>BAYES_99</literal> with another one point.
        </para>
        <procedure id="proc-Administrator_Guide-Tweaking_Bayes_Scores-Adjusting_the_Score_for_BAYES_99">
            <title>Adjusting the Score for <literal>BAYES_99</literal></title>
            <step>
                <para>
                    Edit <filename>/etc/mail/spamassassin/local.cf</filename>, and make sure the following line is present:
                </para>
                <para>

<screen>score BAYES_99 7.308</screen>

                </para>

            </step>
            <step>
                <para>
                    Reload or restart the <literal>amavisd-new</literal> service:
                </para>
                <para>

<screen># service amavisd-new restart</screen>

                </para>

            </step>

        </procedure>


    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Learning_about_Ham">
        <title>Learning about Ham</title>
        <para>
            It is important to not just learn about spam, but learn about ham, legitimate email messages, as well. When not learning about ham, the anti-spam system will become heavily biased towards spam, and ultimately classify all email messages as such.
        </para>
        <para>
            Learning about ham follows a slightly different doctrine then learning about spam. Most importantly, ham is not to be posted to a shared folder that everyone else can read the contents from. It is most commonly a "Not Junk" or "Ham" folder in one's personal namespace users are instructed to copy or move messages to, that have been classified as spam but are actually ham.
        </para>
        <para>
            It is recommended users are both instructed to use ham folders, as well as create them by default &mdash;regardless whether they are called "Ham" or "Not Junk" or equivalent localized version of such.
        </para>
        <para>
            Alternatively, you could learn about ham from people's INBOX folders.
        </para>

    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Expiring_Messages_from_SpamHam_Shared_Folders">
        <title>Expiring Messages from Spam/Ham (Shared) Folders</title>
        <para>
            When you share folders to which users can move or copy ham and/or spam messages, it is sensible to purge the contents of those folders regularly, or the folder size continues to increase indefinitely. Run the expiry after <command>sa-learn</command> has been run.
        </para>
        <note>
            <para>
                Running <command>ipurge</command> to purge mail folder messages occurs independent from setting <literal>expunge_mode</literal>, and independent from the <literal>expire</literal> annotation as well.
            </para>
            <para>
                Using the <literal>expire</literal> annotation is sufficient to purge the contents of the folder, as, with or without the <literal>expunge_mode</literal> setting having been set to delayed, rather then immediate, the Bayes database will only be updated with messages Bayes has not learned about before.
            </para>

        </note>
        <para>
            To purge the contents of a mailfolder, run ipurge:
        </para>
        <para>

<screen>$ <userinput>/usr/lib/cyrus-imapd/ipurge -d 1 user/folder/name@domain.tld</userinput>
(...output abbreviated...)
$ <userinput>/usr/lib/cyrus-imapd/ipurge -i -d 1 user/folder/name@domain.tld</userinput></screen>

        </para>

    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Updating_the_Spam_Rules">
        <title>Updating the Spam Rules</title>
        <para>
            As part of the <application>SpamAssasin</application> package, a utility is provided to update the rulesets from channels configured.
        </para>
        <para>
            For systems on which either of the SpamAssassin daemon or Amavis daemon is running, the software packages automatically install a nightly cronjob to ensure the rules are updated frequently.
        </para>
        <para>
            The spam rulesets are updated using the <command>sa-update</command> command, supplying one or more <literal>--channel</literal> options specifying the names of the ruleset channels to update, and (optionally) one or more <literal>--gpgkey</literal> options specifying the Pretty Good Privacy keys to allow signatures on the rulesets to have been signed with.
        </para>
        <para>
            The cronjob that is installed by default, executes <command>sa-update</command> for all channels defined in <filename>/etc/mail/spamassassin/channels.d/</filename> with one file per channel.
        </para>

    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Bayes_SQL_Database_for_Distributed_Systems">
        <title>Bayes SQL Database for Distributed Systems</title>
        <para>
            If more then one system needs to make use of the Bayes database, consider using a network SQL Bayes database.
        </para>
        <procedure id="proc-Administrator_Guide-Bayes_SQL_Database_for_Distributed_Systems-Setting_Up_the_Bayes_Database">
            <title>Setting Up the Bayes Database</title>
            <step>
                <para>
                    Create the database:
                </para>
                <para>

<screen># mysql -e 'create database <emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with desired database name
                    </para>
                    </footnote>;'</screen>

                </para>

            </step>
            <step>
                <para>
                    Create a user and grant the appropriate privileges:
                </para>
                <para>

<screen># mysql -e "CREATE USER '<emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with desired username
                    </para>
                    </footnote>'@'%' IDENTIFIED BY PASSWORD '<emphasis>Welcome2KolabSystems</emphasis><footnote> <para>
                        Replace with desired password
                    </para>
                    </footnote>';"</screen>

                </para>

            </step>
            <step>
                <para>
                    Grant the appropriate privileges:
                </para>
                <para>

<screen># mysql -e "GRANT USAGE ON * . * TO '<emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with desired username
                    </para>
                    </footnote>'@'%' IDENTIFIED BY PASSWORD '<emphasis>Welcome2KolabSystems</emphasis><footnote> <para>
                        Replace with desired password
                    </para>
                    </footnote>' WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNECTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0;"
# mysql -e "GRANT ALL PRIVILEGES on `<emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with desired database name
                    </para>
                    </footnote>` . * TO '<emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with desired username
                    </para>
                    </footnote>'@'%';"</screen>

                </para>

            </step>
            <step>
                <para>
                    Reload the privileges:
                </para>
                <para>

<screen># mysql -e 'FLUSH PRIVILEGES;'</screen>

                </para>

            </step>
            <step>
                <para>
                    Download the latest Bayes Database SQL schema file from <ulink url="http://svn.apache.org/repos/asf/spamassassin/trunk/sql/bayes_mysql.sql">bayes_mysql.sql</ulink> (when using MySQL):
                </para>
                <para>

<screen># wget <ulink url="http://svn.apache.org/repos/asf/spamassassin/trunk/sql/bayes_mysql.sql" /></screen>

                </para>

            </step>
            <step>
                <para>
                    Insert this schema into the database:
                </para>
                <para>

<screen># mysql <emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with database name used.
                    </para>
                    </footnote> &lt; bayes_mysql.sql</screen>

                </para>

            </step>

        </procedure>

        <procedure id="proc-Administrator_Guide-Bayes_SQL_Database_for_Distributed_Systems-Migrating_Existing_Bayes_Databases">
            <title>Migrating Existing Bayes Database(s)</title>
            <step>
                <para>
                    First, export any existing Bayes databases, run the following command (on each server with a Bayes database):
                </para>
                <para>

<screen># sa-learn --backup &gt; <filename>/path/to/sa_backup.txt</filename></screen>

                </para>

            </step>
            <step>
                <para>
                    A recommended step, but completely optional, is to expire the current copy of the database:
                </para>
                <para>

<screen># sa-learn --clear</screen>

                </para>

            </step>
            <step>
                <para>
                    Modify <filename>/etc/mail/spamassassin/local.cf</filename> to contain the following settings:
                </para>
                <para>

<screen>bayes_store_module        Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn            DBI:mysql:<emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with database name
                    </para>
                    </footnote>:<emphasis>mysql.domain.tld</emphasis><footnote> <para>
                        Replace with MySQL server host address
                    </para>
                    </footnote>
bayes_sql_username        <emphasis>kolab_bayes</emphasis><footnote> <para>
                        Replace with the username for database access
                    </para>
                    </footnote>
bayes_sql_password        <emphasis>Welcome2KolabSystems</emphasis><footnote> <para>
                        Replace with the user password
                    </para>
                    </footnote>
bayes_sql_override_username  root</screen>

                </para>

            </step>
            <step>
                <para>
                    Import the exported Bayes database(s):
                </para>
                <para>

<screen># sa-learn --restore <filename>/path/to/sa_backup.txt</filename></screen>

                </para>

            </step>

        </procedure>


    </section>

    <section id="sect-Administrator_Guide-Combating_Spam-Ensuring_Availability_of_Messages_Spam_Score">
        <title>Ensuring Availability of Messages' Spam Score</title>
        <para>
            For the purpose of troubleshooting, or in deployments with clients that have spam filtering capabilities, it is sensible to always insert the spam headers into email messages, both to avoid clients scanning the message again, as well as troubleshooting why mail may or may not have been filtered.
        </para>
        <para>
            To always insert the spam score into the message headers, find the line in <filename>/etc/amavisd/amavisd.conf</filename> that starts with <literal>$sa_tag_level_deflt</literal> and replace it with:
        </para>
        <para>

<screen>$sa_tag_level_deflt  = -10;</screen>

        </para>
        <para>
            While the score is available from the log files should the level of logging verbosity have been increased, in some scenarios it is necessary to include the spam score regardless of the traffic being inbound or outbound. An example is a mail gateway for an unknown number of, or regularly changing, or dynamic, or large list of domain name spaces with both inbound and outbound traffic, which needs to be protected senders as well as receivers from spam.
        </para>
        <para>
            Normally only inbound traffic is tagged &ndash;if at all&ndash;, by recognizing the recipient domain name space as local. The setting <literal>@local_domains</literal> or, in later versions of Amavis, <literal>@local_domains_acl</literal> is used.
        </para>
        <para>
            To recognize all domain name spaces as local, edit <filename>/etc/amavisd/amavisd.conf</filename> and make sure the following settings are not configured:
        </para>
        <para>
            <itemizedlist>
                <listitem>
                    <para>
                        <literal>@local_domains</literal>,
                    </para>

                </listitem>
                <listitem>
                    <para>
                        <literal>@local_domains_acl</literal>, and
                    </para>

                </listitem>
                <listitem>
                    <para>
                        any <literal>read_hash(\%local_domains)</literal> setting.
                    </para>

                </listitem>

            </itemizedlist>

        </para>
        <para>
            Ensure that the following setting is configured like so:
        </para>
        <para>

<screen language="Perl">$local_domains_re = new_RE( qr'.*' );</screen>

        </para>

    </section>


</chapter>