<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Agile Scoring   </title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi</link>
    <description>Practical predictive modeling.</description>
    <language>en</language>

  <item>
    <title>More Data Beats Better Algorithms</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2008/04/07#more_data</link>
    <description>&lt;p&gt;Anand Rajaraman argues that &lt;b&gt;more data&lt;/b&gt; &amp;gt; &lt;b&gt;better algorithm&lt;/b&gt;.  Hear, hear.
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://anand.typepad.com/datawocky/2008/03/more-data-usual.html&quot;&gt;Part 1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html&quot;&gt;Part 2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good software is still required to crunch all this additional data, though.</description>
  </item>
  <item>
    <title>New site design</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2008/04/02#new_web_design</link>
    <description>&lt;p&gt;We're rolling out the redesigned ArrowModel web site today.</description>
  </item>
  <item>
    <title>ArrowModel 0.2</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2008/01/11#beta2</link>
    <description>
&lt;p&gt;Second beta of ArrowModel is out.  Registered users can 
&lt;a href=&quot;/download.html&quot;&gt;download it now&lt;/a&gt;.

&lt;p&gt;If you are not a registered user, but would like to give 
ArrowModel a try, please &lt;a href=&quot;/beta.html&quot;&gt;sign up&lt;/a&gt;.

&lt;p&gt;Highlights of the new version include:

&lt;ul&gt;
  &lt;li&gt;CVS import on Windows is at least 2x faster&lt;/li&gt;
  &lt;li&gt;Help files and Assistant (help viewer) GUI translations improved&lt;/li&gt;
  &lt;li&gt;Can check/uncheck all predictors using main menu or keyboard shortcuts&lt;/li&gt;
  &lt;li&gt;Lots of small usability improvements and bugfixes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ARM file format remains unchanged.  You will be getting
warning messages when opening models created with previous
versions, but they should work.

&lt;p&gt;Thank you for the feedback and support.</description>
  </item>
  <item>
    <title>Wir sprechen Deutsch</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/09/17#deutsch</link>
    <description>&lt;p&gt;&lt;a href=&quot;http://arrowmodel.com&quot;&gt;ArrowModel site&lt;/a&gt;  
&lt;a href=&quot;http://arrowmodel.com/de/index.html&quot;&gt;in German&lt;/a&gt;.</description>
  </item>
  <item>
    <title>Second beta (build 888)</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/07/19#odbc</link>
    <description>&lt;p&gt;Second beta of ArrowModel is out.  The biggest new feature is ODBC connectivity.</description>
  </item>
  <item>
    <title>New hosting</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/05/09#new_hosting</link>
    <description>ArrowModel moved to new hosting.  We apologize for the downtime and inconvenience.</description>
  </item>
  <item>
    <title>SAS language idiosyncrasies</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/05/08#sas</link>
    <description>&lt;p&gt;Bjarne Stroustrup once said that there are only two kinds of
programming languages: &lt;br&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;those people always bitch about and &lt;/li&gt;
&lt;li&gt;those nobody uses. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I searched SAS-L for any criticism of SAS and found almost
none! That's kind of strange since I know that SAS is widely used. &lt;br&gt;
&lt;/p&gt;
&lt;p&gt;I have used SAS since it came out on the market in the early 70's.
At the time, I was delighted with the DATA step which saved me from
writing silly
little FORTRAN programs to manipulate my data into the form expected by
BMDP. That DATA step is the main reason SAS blew all its competitors
out of the water. The rest, as the saying goes, is history. Alas, when
a product becomes dominant, it often endows its developers with an
undesirable arrogance and a tendancy to respond
&quot;That's the way we do it!&quot; to all suggestions for improvement.
&lt;/p&gt;
&lt;p&gt;I only realized the problem with SAS many years later when I studied
closely other programming languages:
&lt;span style=&quot;font-weight: bold;&quot;&gt;&lt;br&gt;
SAS is probably among the worst widely used languages I know of.&lt;/span&gt;
&lt;/p&gt;
&lt;p&gt;Here are just a few examples: &lt;/p&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;There are two ways to write &lt;span style=&quot;font-weight: bold;&quot;&gt;comments&lt;/span&gt; 
in SAS:
&lt;br&gt;
&lt;/p&gt;
&lt;pre&gt;/* C-like */&lt;br&gt;* and Fortran-like (with trailing semicolon);&lt;br&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;/ul&gt;
&lt;p&gt; Neither of those can be nested. To comment out a block of code,
one needs to resort to the following trick: &lt;/p&gt;
&lt;pre&gt; %macro skip;&lt;br&gt;&lt;br&gt; Stuff here is not executed &lt;br&gt;&lt;br&gt; %mend skip;&lt;br&gt;&lt;/pre&gt;
&lt;p&gt;Why? We know it can be fixed. But SAS won't do it. &lt;/p&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;There's a concept of &lt;span style=&quot;font-weight: bold;&quot;&gt;NULL &lt;/span&gt;(missing
value) in SAS, but it is not universally 
applied.  For example, a logical operation between a missing value and
anything else results in a missing value, which is perfectly logical. But if you
compare a missing value to 
a numeric variable &lt;b&gt;&amp;mdash; surprise &amp;mdash;&lt;/b&gt; the result
is &lt;b&gt;NOT&lt;/b&gt; a missing value. &lt;/p&gt;
&lt;p&gt;Got that? In a comparison with a number, a missing value is
treated as if it is, of 
all things, minus infinity. Why? &lt;br&gt;
&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;/ul&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;The notion of &lt;span style=&quot;font-weight: bold;&quot;&gt;naming convention&lt;/span&gt;
seems to elude SAS language
designers. Compare &lt;tt&gt;proc import&lt;/tt&gt; and &lt;tt&gt;proc export&lt;/tt&gt;:
&lt;pre&gt;
proc import 
     datafile='/somewhere/myfile.csv' 
     out=mydataset
     dbms=csv;
     run; 
proc export 
     data=mydataset
     outfile='/somewhere/myfile.csv' 
     dbms=csv;
     run;
&lt;/pre&gt;

&lt;p&gt;But why not this:&lt;/p&gt;
&lt;pre&gt;proc import 
    in='/somewhere/myfile.csv' 
    out=mydataset 
    dbms=csv;
    run; 
proc export 
    in=mydataset
    out='/somewhere/myfile.csv'
    dbms=csv;
    run;
&lt;/pre&gt;
&lt;p&gt;Which one is easier to remember?&lt;/p&gt;

&lt;hr width=&quot;30&quot;&gt;

&lt;p&gt;And speaking of &lt;tt&gt;proc import&lt;/tt&gt;, SAS will never finish if launched
on UNIX from the command line. Why? &lt;a
href=&quot;http://support.sas.com/techsup/unotes/SN/003/003610.html&quot;&gt;SAS
note SN-003610&lt;/a&gt;, says: &lt;/p&gt;
&lt;blockquote&gt;&quot;When trying to use PROC EXPORT or PROC IMPORT in batch
mode on UNIX systems, you may receive the following error:&lt;span
style=&quot;font-family: monospace;&quot;&gt;&lt;br&gt;
&lt;br&gt;
&lt;/span&gt;ERROR: Cannot open X display. Check the name/server access
authorization.&lt;br&gt;
&lt;p&gt;This happens because, even in batch mode, these procedures try
to display the SAS SESSION MANAGER icon, which requires a valid X
display. For any version 8 procedures that you want to run in batch
mode without a terminal present you will need to use the -NOTERMINAL
option when invoking SAS. &lt;/p&gt;

&lt;p&gt;For example: &lt;/p&gt;
&lt;pre&gt; sas myprogram.sas -noterminal&lt;br&gt;&lt;/pre&gt;
This will prevent the session manager icon from trying to display.&quot; &lt;/blockquote&gt;
&lt;p&gt;Translation: &quot;&lt;b&gt;SAS will hang forever&lt;/b&gt; on &lt;tt&gt;proc export&lt;/tt&gt;, and you
won't even see the error message in the log, because the log is not
flushed to disk until you kill SAS, and this is not a bug, it's been
like that since the dawn of days, and we won't fix it because 
it's not a bug, it's perfectly OK to hang, but as a workaround, you can use
the -noterminal option.&quot;&lt;/p&gt;

&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;Here is a third example of a problem with &lt;tt&gt;proc import&lt;/tt&gt;: 
when using it to read Titanic3.csv, a public dataset describing the
1,309 passengers of the Titanic, &lt;b&gt;SAS truncates hundreds of values&lt;/b&gt; of name, cabin and home destination &lt;b&gt;without any warning or error.&lt;/b&gt; You can get the file
here. 
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;a
href=&quot;http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/DataSets&quot;&gt;http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/DataSets&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Of course, it is easy to fix; and it will not 
affect your analysis, but still, is this what you expect from a leading product? &lt;br&gt;
&lt;/p&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;&lt;b&gt;Arbitrary limits are everywhere.&lt;/b&gt; You create a string variable
and by default its length is limited to 8. You assign something to it
and it gets silently truncated. You import a 
file and the line length is limited to 256. Of course you can change 
it by using the &lt;tt&gt;lrecl=&lt;/tt&gt; 
option, but why can't SAS do it? &lt;br&gt;
&lt;/p&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;&lt;tt&gt;Proc sql&lt;/tt&gt; is just like SQL, but not always. &lt;br&gt;
&lt;tt&gt;GROUP
BY&lt;/tt&gt; a variable works as expected. &lt;br&gt;
But can you guess what &lt;tt&gt;GROUP BY&lt;/tt&gt; any expression does? &lt;br&gt;
&lt;br&gt;
&lt;b&gt;Nothing!!!&lt;/b&gt; &lt;br&gt;
&lt;br&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;Error messages are not always helpful in identifying the
problem. If logistic regression fails to provide any output except for
cryptic
&lt;pre&gt;&quot;Error: There are no valid observations&quot;, &lt;/pre&gt;
what exactly does it mean?
Why not just say 
&lt;pre&gt;&quot;Warning: all values of variable FOO are missing&quot;&lt;/pre&gt;
exclude it from the list of predictors, and go on? &lt;br&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;You are sorting a dataset in-place, and it's taking too long.
You decide
to cancel it. 
The dataset is still there, but it's now empty. Not
unsorted,
but &lt;b&gt;empty&lt;/b&gt;. As in 
&lt;b&gt;no observations!&lt;/b&gt; Of course, everyone
knows that you should
have used the &lt;tt&gt;out=&lt;/tt&gt; 
option to redirect the output to another dataset, so
that your data can take twice as much disk space. 

&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;&lt;tt&gt;Proc sql&lt;/tt&gt; again. Guess what will be the name of the
second variable in the &lt;tt&gt;new_table&lt;/tt&gt;: &lt;/p&gt;
&lt;pre&gt;proc sql;
    CREATE TABLE new_table AS
        SELECT foo, 
               COUNT(*) &quot;cnt&quot;
        FROM old_table
        GROUP BY foo;
    quit;
&lt;/pre&gt;
&lt;p&gt;Of course it's &lt;tt&gt;_TEMA001&lt;/tt&gt;, because &lt;tt&gt;cnt&lt;/tt&gt; is the &lt;b&gt;label&lt;/b&gt;,
not the variable name.  Bizarre, but you can make it work with&lt;br&gt;
&lt;/p&gt;
&lt;pre&gt;proc sql;
    CREATE TABLE new_table AS 
         (SELECT * FROM 
             (SELECT foo, 
                     COUNT(*)
              FROM old_table
              GROUP BY foo
             ) x ( foo, cnt )
          );
    quit;
&lt;/pre&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;I can go on like this for a while, but I think you get the idea. 
&lt;/p&gt;
&lt;p&gt;The strange thing is that people who use SAS on a day to day basis
tend not to see 
how unnatural it is. It looks like the&lt;span style=&quot;font-weight: bold;&quot;&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Sapir-Whorf_hypothesis&quot;&gt;Sapir-Whorf
hypothesis&lt;/a&gt;&lt;/span&gt;
in action.&lt;br&gt;
&lt;/p&gt;
&lt;hr width=&quot;30&quot;&gt;
&lt;p&gt;&lt;span style=&quot;font-weight: bold;&quot;&gt;But isn't it true that all the old
languages have their
quirks? &lt;/span&gt;&lt;br&gt;
&lt;br&gt;
&lt;b&gt;No!&lt;/b&gt; While it s true that Cobol and Basic will rot your brain 
because of the paucity of  
their features, many old languages were either done right from the start, 
or evolved into coherent ones:
LISP, for instance, SQL, C or R (an interesting alternative to SAS for doing statistics).
Together with more recent  
languages like Java, or Ruby, they are much more consistent than SAS.</description>
  </item>
  <item>
    <title>Why look at histograms?</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/05/07#whyhist</link>
    <description>&lt;p&gt;Statisticians look at histograms, the way generals look at maps. There
is just no way around it.
But if you're not a statistician, what are you supposed to look for?
&lt;p&gt;It's hard to answer, but it is easy to look at a few simple examples.
&lt;p&gt;The first dataset we'll use contains data from the real customers of a
bank.  One variable, SAVBAL, contains the balance of the customers' savings
account.
&lt;p&gt;This is what the histogram of the raw variable looks like:
&lt;p&gt; &lt;center&gt;&lt;img alt=&quot;log ( savbal)&quot; src=&quot;/blog/jc/savbal.png&quot;
style=&quot;width: 605px; height: 390px;&quot;&gt;&lt;/center&gt;

&lt;p&gt;We clearly have lots of zero values, more than half in fact, since the
median is zero.
&lt;p&gt;We want to remember that the huge majority of&amp;nbsp; savings is below $
20,000.
&lt;p&gt;We also have a very long tail to the right, which 
immediately makes us want to take the logarithm of SAVBAL + 1. 
We do this in SQL with one line after the &amp;quot;select&amp;quot;

&lt;p&gt;
&lt;pre&gt; log(savbal + 1) &quot;lsav&quot; &lt;/pre&gt;

&lt;p&gt;Why the + 1 ? Because we won't have to deal with Log(0), which you
may remember is minus infinity.
This way, since log(1) = 0 , a zero remains a zero.
&lt;p&gt;
&lt;center&gt;&lt;img alt=&quot;log( savbal )&quot; src=&quot;/blog/jc/lsav.png&quot;
style=&quot;width: 606px; height: 388px;&quot;&gt;&lt;/center&gt;

&lt;p&gt;Again, we have the same large number of zero values. 
But now, we can see them more easily as a completely separate bunch. 
And that takes us to the main point worth remembering: 

&lt;br&gt;&lt;br&gt;&lt;center&gt;&lt;span
style=&quot;font-weight: bold;&quot;&gt;When a histogram has two distinct humps, do
not continue! &lt;br&gt;
Break the population in two and conduct two analyses, one for each
population. &lt;/span&gt;&lt;/center&gt;

&lt;p&gt;In our case, this means separating the customers who 
save anything from those who do not save at all,
an easy step to take with this SQL line  

&lt;p&gt;
&lt;pre&gt;CASE WHEN savbal &amp;gt; 0 THEN log(savbal) ELSE NULL END &quot;lsav2&quot;&lt;br&gt;
&lt;/pre&gt;

&lt;p&gt;&lt;center&gt;&lt;img alt=&quot;log (nonzero-savbal)&quot; src=&quot;/blog/jc/lsav2.png&quot;
style=&quot;width: 605px; height: 390px;&quot;&gt;&lt;/center&gt;

&lt;p&gt;Now, even though the resulting graph is not completely symmetric, nor
very close to normal, it is much better than the raw SAVBAL, and this is what we want to use.

&lt;p&gt;If we invested a lot more time, we'd notice that savbal raised to the
power 0.1 gives a better approximation of the normal distribution. The SQL needed is
&lt;p&gt;
&lt;pre&gt;CASE WHEN savbal &amp;gt; 0 THEN pow(savbal, 0.1) ELSE NULL END &quot;s01&quot;&lt;br&gt;&lt;/pre&gt;
&lt;p&gt;
&lt;center&gt;&lt;img alt=&quot;savbal to the power 0.1&quot;
src=&quot;/blog/jc/sav01.png&quot; style=&quot;width: 605px; height: 387px;&quot;&gt;&lt;/center&gt;

&lt;p&gt;But the added work needed to find 0.1 is not worth it. 

&lt;p&gt;In conclusion, we have identified two populations, savers and
non-savers,  and the savings are log-normal.
</description>
  </item>
  <item>
    <title>ArrowModel beta FAQ</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/05/04#beta_faq</link>
    <description>&lt;p&gt;ArrowModel goes through its first beta testing.  Here are some of the frequently asked questions so far:

&lt;ul&gt;
&lt;li&gt;&lt;b&gt;How can I get my data from SAS to ArrowModel?&lt;/b&gt;
&lt;p&gt;
In SAS, export the data to CSV:
&lt;pre&gt;
proc export data=mydataset outfile='c:\temp\mydataset.csv'
  dbms=csv replace;
run;
&lt;/pre&gt;
&lt;p&gt;Then import the CSV file in ArrowModel.
&lt;li&gt;&lt;b&gt;Why is my KS so low, and why does the ROC look like a bow string (left picture) rather than like a bow (right picture)?&lt;/b&gt;
&lt;p&gt;
&lt;center&gt;
&lt;table border=0 cellspacing=6&gt;
&lt;tr align=center&gt;
&lt;td&gt;&lt;img src=&quot;/blog/jeff/bowstring.png&quot;&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/blog/jeff/bow.png&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;You probably decided to override ArrowModel's recommendation 
in the Stratify step and told it
to keep 100% of events and non-events.  After all, it is usually a good idea to use 
all the available data rather than to throw it away.
&lt;p&gt;But there's also rounding.  The output of logistic regression 
is the estimated probability of the event in the training sample (or non-event,
if you choose the high value of score to indicate low probability of event,
but it does not matter in the end).  If the event 
is rare, this value is going to be close to zero for most observations.
To get a score, which in the case of ArrowModel is an integer between 0 and 99, the
output of logistic regression is multiplied by 100 and then rounded down to
the nearest integer.  For many observations small differences in estimated
probability will be lost due to this rounding.
&lt;p&gt;Check the score distribution in the Test step.  If it is severely skewed, try
going back to the Stratify step and restoring the defaults.
&lt;li&gt;&lt;b&gt;How can I insert a chart from ArrowModel in my presentation?&lt;/b&gt;
&lt;p&gt; Right-click on a chart, select &amp;quot;Save image as...&amp;quot; from the
pop-up menu, then use the resulting PNG file.
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;[UPDATE 5/4]&lt;/b&gt; Pictures added to illustrate the differences in ROC curves.
</description>
  </item>
  <item>
    <title>Not Quite Normal</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/04/19#not_quite_normal</link>
    <description>
&lt;p&gt;A lot of statistical magic relies on the premise that 
stuff is normally distributed.

&lt;p&gt;&lt;center&gt;&lt;img src=&quot;/blog/jeff/normal.png&quot; alt=&quot;Normal distribution&quot;&gt;
&lt;br&gt;Normal distribution&lt;/center&gt;

&lt;p&gt;The normal distribution has nice properties that make 
things easy analytically, but chances are that, most of 
the time, you'll see distributions that look like this:

&lt;p&gt;&lt;center&gt;&lt;img src=&quot;/blog/jeff/not_quite_normal.png&quot; 
alt=&quot;Not quite normal distribution&quot;&gt;&lt;br&gt;Not quite normal 
distribution&lt;/center&gt;

&lt;p&gt;Of course I'm generalizing and there are exceptions, but 
it's clear that the good old normal distribution belongs 
on the endangered species list.

&lt;p&gt;There are several reasons why:

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Counts&lt;/em&gt;.  We often count events and the count cannot 
be negative, hence not really normal either: 
the number of accidents somewhere and some place will tend to be
Poisson distributed, the number of an account number will tend 
to be uniformly distributed and the waiting time for your next 
e-mail will tend to be exponential. 
  &lt;li&gt;&lt;em&gt;Long tail&lt;/em&gt; aka &lt;em&gt;outliers&lt;/em&gt;.  If income was normally
  distributed there would be no Bill Gates or Warren Buffett.
  &lt;li&gt;&lt;em&gt;Six degrees of separation&lt;/em&gt;, or 
  &lt;a href=&quot;http://www.stat.columbia.edu/~cook/movabletype/archives/2007/04/nobody_knows_an.html&quot;&gt;everything is connected&lt;/a&gt;,
  making the law of large numbers and central limit theorem not 
  really applicable.
&lt;/ul&gt;

&lt;p&gt;So what is the poor modeler to do?

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Look at the distributions&lt;/em&gt; before plugging your data 
  straight into the model.  Even if you don't have time.
  &lt;li&gt;&lt;em&gt;Winsorize&lt;/em&gt;.  ArrowModel does it automatically for you.
  &lt;li&gt;&lt;em&gt;Transform variables&lt;/em&gt; when needed, e.g., use 
  &lt;tt&gt;log(income+K)&lt;/tt&gt;, where K is a constant, or 
  &amp;radic;&amp;radic;&lt;tt&gt;income&lt;/tt&gt; instead of &lt;tt&gt;income&lt;/tt&gt;.
  I dislike &lt;tt&gt;log(income+K)&lt;/tt&gt; because of the arbitrariness of K.
  &lt;li&gt;&lt;em&gt;Look for a natural break&lt;/em&gt; in the distribution to see 
  if a continuous variable can be transformed into an indicator 
  (dummy variable) like this: 
&lt;pre&gt;CASE WHEN foo &gt; 9 THEN 1 ELSE 0 END&lt;/pre&gt;
&lt;/ul&gt;

&lt;p&gt;There are more elaborate ways of dealing with not quite normally
distributed data such as Johnson's SU functions and multivariate 
adaptive regression splines (MARS) which this margin is too narrow 
to contain.</description>
  </item>
  <item>
    <title>Spanish translation</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/04/16#es</link>
    <description>&lt;p&gt;ArrowModel &lt;a href=&quot;http://arrowmodel.com/es/index.html&quot;&gt;speaks Spanish&lt;/a&gt;, too.</description>
  </item>
  <item>
    <title>Site translations</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/04/01#site_translations</link>
    <description>&lt;p&gt;The main &lt;a href=&quot;http://arrowmodel.com&quot;&gt;ArrowModel site&lt;/a&gt; now has  
&lt;a href=&quot;http://arrowmodel.com/fr/index.html&quot;&gt;French&lt;/a&gt; and 
&lt;a href=&quot;http://arrowmodel.com/ru/index.html&quot;&gt;Russian&lt;/a&gt; translations.
Really.  More i18n is underway. </description>
  </item>
  <item>
    <title>Information Value</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/02/25#information_value</link>
    <description>
&lt;p&gt;Deciding which predictors to use is one of the key steps
in model building.  A good place to start is to examine predictors 
individually to see how good they are in a univariate sense.

&lt;p&gt;Information value is a metric that is often used to tell how good
a predictor is.  Let's follow the calculations step by step.

&lt;ol&gt;
&lt;li&gt;Start by ranking the data by the predictor in question.  The number of
ranks is not very critical and, in most cases, deciles will do just fine.

&lt;li&gt;Calculate the total number of goods (&lt;tt&gt;total_good_ct&lt;/tt&gt;)&lt;br /&gt; and
the total number of bads (&lt;tt&gt;total_bad_ct&lt;/tt&gt;);

&lt;li&gt;For each rank
   &lt;ul&gt;
    &lt;li&gt;Calculate the number of goods (&lt;tt&gt;good_ct&lt;/tt&gt;)&lt;br /&gt;
    and the number of bads (&lt;tt&gt;bad_ct&lt;/tt&gt;);

    &lt;li&gt;&lt;tt&gt;good_pct = good_ct / total_good_ct&lt;/tt&gt;,&lt;br /&gt;
    &lt;tt&gt;bad_pct = bad_ct / total_bad_ct&lt;/tt&gt;;

    &lt;li&gt;&lt;tt&gt;diff_pct = good_pct - bad_pct&lt;/tt&gt;;

    &lt;li&gt;&lt;tt&gt;info_odds = good_pct / bad_pct&lt;/tt&gt;;

    &lt;li&gt;Weight of evidence: &lt;tt&gt;woe = log(info_odds)&lt;/tt&gt;;

    &lt;li&gt;Information value: &lt;tt&gt;inf_val = diff_pct * woe&lt;/tt&gt;;
  &lt;/ul&gt;

&lt;li&gt;Finally, sum up &lt;tt&gt;inf_val&lt;/tt&gt; for all the ranks.  This is 
the predictor's information value.

&lt;/ol&gt;

&lt;p&gt;As you can see, the information value for each rank reflects 
log odds, but the order of ranks does not have any effect.
This nicely takes care of nonlinearity and outliers.

&lt;p&gt;Ordering predictors by information value and taking the 
top N is a tempting strategy, but not a very prudent approach.  The 
predictors selected this way can turn out to be redundant, 
regression is rather sensitive to outliers, and we haven't 
done anything about nonlinearity yet.  But it's a good way 
to screen out the least likely candidates.
</description>
  </item>
  <item>
    <title>Receiver Operating Characteristic</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/02/24#roc</link>
    <description>
&lt;p&gt;ROC curves were first used during World War II to graphically show 
the separation of radar signals from background noise. They are commonly used to 
graphically show the added value of any predictive model.
To plot 
the &lt;I&gt;receiver operating characteristic&lt;/I&gt;, or ROC curve, one  
plots &lt;I&gt;B(s)&lt;/I&gt; vs. &lt;I&gt;G(s)&lt;/I&gt; for all values of &lt;I&gt;s&lt;/I&gt;.  This 
curve goes from (0, 0) to (1, 1).  The curve of an ideal model (complete separation) goes through 
(0, 1), while the curve of a totally useless model (no separation) is a straight 
diagonal line.  The curve looks like a banana, hence the nickname
&lt;i&gt;banana chart&lt;/i&gt;.

&lt;center&gt;
&lt;table border=0 cellspacing=6&gt;
&lt;tr align=center&gt;
&lt;td&gt;&lt;img src=&quot;/blog/jeff/roc_strong.png&quot;&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/blog/jeff/roc_weak.png&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr align=center&gt;
&lt;td&gt;Very strong separation&lt;/td&gt;
&lt;td&gt;Weak separation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr align=center&gt;
&lt;td&gt;Excellent model&lt;/td&gt;
&lt;td&gt;Mediocre model&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;
  
&lt;P&gt;The KS query from  
&lt;a href=&quot;/cgi-bin/blosxom.cgi/jeff/index.html#ks&quot;&gt;this post&lt;/a&gt; 
can be easily modified to return coordinates of the points on the 
ROC curve:

&lt;PRE&gt;
SELECT s
     , cdf.b &quot;Sensitivity&quot;
     , cdf.g &quot;1-Specificity&quot;
FROM ( SELECT a.s                                          &quot;s&quot;
            , SUM(distr.bad_cnt) /
              ( SELECT COUNT(*) FROM t WHERE outcome = 1 ) &quot;b&quot;
            , SUM(distr.good_cnt) /
              ( SELECT COUNT(*) FROM t WHERE outcome = 0 ) &quot;g&quot;
       FROM ( SELECT DISTINCT s FROM t ) a
       JOIN (
              SELECT s                &quot;s&quot;
                   , SUM(outcome)     &quot;bad_cnt&quot;
                   , SUM(1 - outcome) &quot;good_cnt&quot;
              FROM t
              GROUP BY s 
            ) distr
         ON distr.s &lt;= a.s
         GROUP BY a.s 
     ) cdf
;
&lt;/PRE&gt;

&lt;P&gt;In the context of an ROC plot, &lt;I&gt;B(s)&lt;/I&gt; is often called &lt;I&gt;sensitivity&lt;/I&gt;
or &lt;I&gt;true positive fraction&lt;/I&gt;, and &lt;I&gt;G(s)&lt;/I&gt; is called &lt;I&gt;1-specificity&lt;/I&gt;
or &lt;I&gt;false positive fraction&lt;/I&gt;.

</description>
  </item>
  <item>
    <title>Kolmogorov-Smirnov Test</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/02/23#ks</link>
    <description>
&lt;P&gt;One of the most widely (mis)used measures of scorecard performance 
is the &lt;I&gt;Kolmogorov-Smirnov test&lt;/I&gt; (KS), colloquially known as
&lt;i&gt;the vodka test&lt;/i&gt;.  In this post, I'll explain what KS is, and show 
a way to calculate it in SQL.

&lt;P&gt;Given two samples of a continuous random variable, the two sample 
K-S test is used answer the following question: did these two samples come 
from the same distribution or didn't  they?
The idea is simply to compute the largest absolute difference between the 
two empirical cumulative distributions and to conclude that there is a significant 
difference if the difference is large enough.

&lt;P&gt;Consider a risk score that predicts the probability of a customer
defaulting (we'll call that 'going bad').  KS is the greatest difference 
between the cumulative distribution functions of the scores of the 
good and the bad populations:

&lt;CENTER&gt;
&lt;I&gt;KS = max&lt;SUB&gt;s&lt;/SUB&gt;&lt;/I&gt;|&lt;I&gt;B(s) - G(s)&lt;/I&gt;|,
&lt;/CENTER&gt;

&lt;P&gt;where
&lt;UL&gt;
  &lt;LI&gt;&lt;I&gt;s&lt;/I&gt; is the score,&lt;/LI&gt;
  &lt;LI&gt;&lt;I&gt;B(s)&lt;/I&gt; is the number of bads with a score less than or equal
  to &lt;I&gt;s&lt;/I&gt; divided by the total number of bads, and&lt;/LI&gt;
  &lt;LI&gt;&lt;I&gt;G(s)&lt;/I&gt; is the number of goods with a score less than or equal
  to &lt;I&gt;s&lt;/I&gt; divided by the total number of goods.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;KS is often multiplied by 100 for convenience.  In many contexts 
40 is considered to be a good KS. 
   
&lt;P&gt;Let's try an example.  Start with the table &lt;TT&gt;t&lt;/TT&gt; 
that contains initial data:

&lt;CENTER&gt;
&lt;P&gt;
&lt;TABLE BORDER=1&gt;
  &lt;TR&gt;
    &lt;TD&gt;&lt;B&gt;Column&lt;/B&gt;&lt;/TD&gt;
    &lt;TD&gt;&lt;B&gt;Description&lt;/B&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;&lt;TT&gt;id&lt;/TT&gt;&lt;/TD&gt;
    &lt;TD&gt;Unique identifier&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;&lt;TT&gt;s&lt;/TT&gt;&lt;/TD&gt;
    &lt;TD&gt;Score&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
  &lt;TD&gt;&lt;TT&gt;outcome&lt;/TT&gt;&lt;/TD&gt;
    &lt;TD&gt;1 is bad, 0 is good&lt;/TD&gt;
  &lt;/TR&gt;
&lt;/TABLE&gt;
&lt;/CENTER&gt;

&lt;P&gt;The following query calculates the KS:

&lt;PRE&gt;
SELECT MAX(cdf.b - cdf.g) * 100                            &quot;KS&quot;
FROM ( SELECT a.s                                          &quot;s&quot;
            , SUM(distr.bad_cnt) /
              ( SELECT COUNT(*) FROM t WHERE outcome = 1 ) &quot;b&quot;
            , SUM(distr.good_cnt) /
              ( SELECT COUNT(*) FROM t WHERE outcome = 0 ) &quot;g&quot;
       FROM ( SELECT DISTINCT s FROM t ) a
       JOIN (
              SELECT s                &quot;s&quot;
                   , SUM(outcome)     &quot;bad_cnt&quot;
                   , SUM(1 - outcome) &quot;good_cnt&quot;
              FROM t
              GROUP BY s 
            ) distr
         ON distr.s &lt;= a.s
         GROUP BY a.s 
     ) cdf
;
&lt;/PRE&gt;

&lt;P&gt;The easiest way to understand how the query works is by 
decomposing it into smaller pieces. In this case there are five
uncorrelated subqueries.

&lt;P&gt;This subquery returns distribution of goods and bads by score:

&lt;PRE&gt;
SELECT s                &quot;s&quot;
     , SUM(outcome)     &quot;bad_cnt&quot;
     , SUM(1 - outcome) &quot;good_cnt&quot;
FROM t
GROUP BY s
&lt;/PRE&gt;

&lt;P&gt;Note how it relies on the fact that &lt;TT&gt;outcome&lt;/TT&gt; can be 
either &lt;TT&gt;0&lt;/TT&gt; or &lt;TT&gt;1&lt;/TT&gt;.

&lt;P&gt;This subquery returns the list of all possible score values:

&lt;PRE&gt;
SELECT DISTINCT s FROM t
&lt;/PRE&gt;

&lt;P&gt;This subquery returns the total number of bads:

&lt;PRE&gt;
SELECT COUNT(*) FROM t WHERE outcome = 1
&lt;/PRE&gt;

&lt;P&gt;This subquery returns the total number of goods:

&lt;PRE&gt;
SELECT COUNT(*) FROM t WHERE outcome = 0
&lt;/PRE&gt;

&lt;P&gt;Finally, this subquery (abbreviated for clarity) makes the 
distributions cumulative:

&lt;PRE&gt;
SELECT a.s                              &quot;s&quot;
     , SUM(distr.bad_cnt) / total_bad   &quot;b&quot;
     , SUM(distr.good_cnt) / total_good &quot;g&quot;
FROM a
JOIN distr
  ON distr.s &lt;= a.s
GROUP BY a.s
&lt;/PRE&gt;

&lt;P&gt;Note that it is rather inefficient because the join results in 
a partial Cartesian product. There's a better way to do the cumulation
if your flavor of SQL supports online analytical processing (OLAP) 
functions:

&lt;PRE&gt;
SELECT s                                                  &quot;s&quot;
     , SUM(FLOAT(bad_cnt)) OVER (ORDER BY s) / total_bad  &quot;b&quot;
     , SUM(FLOAT(good_cnt)) OVER (ORDER BY s) / total_bad &quot;g&quot;
FROM distr
&lt;/PRE&gt;

&lt;P&gt;Now the only thing left to do is to pick the maximum difference.  
This is the KS.</description>
  </item>
  <item>
    <title>I'm new at predictive modeling.  Help!</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/02/22#beginners_resources</link>
    <description>
&lt;p&gt;It's true that there does not seem to be a lot of information
on scoring and predictive modeling available online, and that many
articles are written in rather heavy language, peppered with statistical
jargon.  But don't panic.  To help you navigate the unchartered waters, 
here are some good places to start.

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://clickz.com/showPage.html?page=3504911&quot;&gt;
Using Predictive Models&lt;/a&gt; by Brian Teasley 
(&lt;a href=&quot;http://clickz.com/showPage.html?page=3508091&quot;&gt;part 2&lt;/a&gt;, 
&lt;a href=&quot;http://clickz.com/showPage.html?page=3512076&quot;&gt;part 3&lt;/a&gt;)
gives a very nice non-technical introduction;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://en.wikipedia.org&quot;&gt;Wikipedia&lt;/a&gt; 
is hard to beat when you need to know what 
&lt;a href=&quot;http://en.wikipedia.org/wiki/Generalized_linear_model&quot;&gt;generalized 
linear model&lt;/a&gt; or
&lt;a href=&quot;http://en.wikipedia.org/wiki/Logistic_regression&quot;&gt;logistic 
regression&lt;/a&gt; is.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also a few exceptionally good books.  My favorites are:

&lt;ul&gt;
&lt;li&gt;&lt;a 
href=&quot;http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS&quot;&gt;Regression 
Modeling Strategies&lt;/a&gt; by Frank E. Harrell, Jr.&lt;/li&gt;
&lt;li&gt;&lt;a 
href=&quot;http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471356328.html&quot;&gt;Applied 
Logistic Regression (Second Edition)&lt;/a&gt; by David Hosmer and Stanley
Lemeshow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, these two classes by the SAS Institute are worth taking:

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://support.sas.com/training/us/crs/pmlr.html&quot;&gt;Predictive 
Modeling Using Logistic Regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.sas.com/apps/wtraining2/coursedetails.jsp?course_code=pmadv&amp;ctry=us&quot;&gt;Advanced 
Predictive Modeling Using SAS Enterprise Miner&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
  </item>
  <item>
    <title>First post</title>
    <link>http://arrowmodel.com/cgi-bin/blosxom.cgi/2007/02/22#first_post</link>
    <description>&lt;p&gt;Welcome to the newly-minted practical predictive modeling blog.
We'll share tips, tricks, and techniques to make your life as a modeler easier.</description>
  </item>
  </channel>
</rss>