Applying Forensic Accounting Techniques Using SQL And R

Kevin Feasel (@feaselkl)

http://CSmore.info/on/outliers

Who Am I? What Am I Doing Here?

Catallaxy Services
@feaselkl
Curated SQL
We Speak Linux

A Case Study

Wake County transportation fraud.

A Case Study

Quick Story

An auto parts company, working in conjunction with several county employees, defrauded Wake County of at least $5 million over a multi-year period.

Participants took advantage of lax vendor policy for small purchases and submitted fraudulent invoices, all under the cutoff for further scrutiny.

A Case Study

Fraud went up to the top, to the head of the transportation department.

A Case Study

The end result was several guilty pleas and jail time for participants.

A Case Study

Wake County was able to claw back $5 million and Barnes Motor & Parts paid back $3 million, including a $2.5 million criminal fine.

A Case Study

As we go over techniques forensic accountants use to find potentially fraudulent data, we will return to this case several times.

Motivation

Most of us don't spend time hunting down fraudsters using our data. But the practices and principles these forensic accountants use can still be valuable for us.

We can use these to find dirty data in our own environments. The end goal isn't to find recalcitrant users and toss them into jail; it is to improve our overall data quality, as you can't find bad data if you don't know what it looks like.

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Know Your Data

This is a quick overview of data samples used in this talk. Ideally, you want to know what the data elements represent and have as detailed an understanding of your data as possible.

Patterns on their own are interesting, but are not indicative of malicious or irresponsible behavior.

Know Your Data

Data Sets

  • Raleigh police incidents, 2005-2014
  • Durham crime, 2013-2017
  • Wake County transactions, FY 16-17
  • Durham city budget, FY 2010-2015
  • North Carolina population by city, 2013
  • My HOA's budgets, 2013-2015

Know Your Data

For your data sets, you should ideally have a detailed understanding of what data points represent and how valid the data is.

For these open data sets, sometimes the data is purposefully incomplete or inaccurate. For example, police incident data does not include the acting officer or exact incident location, to protect all parties involved.

Also, data pulled from Kaggle competitions and the like will have relevant features masked.

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Basic Analysis

Our basic analysis will focus on a few important areas:

  • Growth Analysis
  • Gaps In Data
  • Duplicates & Cardinality
  • Regression Analysis

Basic Analysis

Growth Analysis

Growth analysis looks at changes in ratios over time. Our goal is to look for major changes which look to be far outside the norm.

Typically, we will perform growth analysis against important additive measures: company growth, revenue, cost, number of customers, etc.

Basic Analysis

Growth Analysis

Demo Time

Basic Analysis

Gaps

Sometimes, it's the lack of a value which is interesting. We want to look for sequences which should be complete but aren't--examples include receipts, purchase orders, check numbers, invoice numbers, etc.

There are several T-SQL constructs which can help us find missing values, including the Gaps and Islands solutions and using a tally table.

Basic Analysis

Gaps

Demo Time

Basic Analysis

Gaps

Warning: if you generate invoice numbers (for example) using a SQL Server identity integer, gaps may occur when transactions get rolled back. It may be better to use a surrogate key for storage but generate an invoice number only after the new transaction saves successfully.

Basic Analysis

Duplicates

True duplicates are exact matches on relevant columns. For true duplicates, are there one-time codes or unenforced unique keys which show duplicates? Ex: internal tracking IDs.

"Mostly" duplicates are cases in which there are sometimes-unexpected differences. For "mostly" duplicates, are there combinations of fields which ought to be unique but aren't? Ex: both "Fundation" and "Foundation" showing up in company name.

Basic Analysis

Duplicates

There are several measures for the similarity of two text phrases, including Levenshtein distance, soundex, and even pattern matching using LIKE. Depending upon the level of sophistication necessary, getting the similarity of two items based on text descriptors can take quite a bit of effort!

Basic Analysis

Duplicates

Tying back to the Wake County fraud case, there were 24 separate days in which the auto supplier submitted 50+ invoices under $2500 to the county. Here, the duplication is in vendor and date and is well outside the expected behavior for a vendor on the up-and-up.

Basic Analysis

Regression Analysis

Regression analysis can be extremely complicated, given issues like multicollinearity, serial correlation, and heteroskedasticity.

Still, a basic ordinary least squares linear regression can tell us a lot.

Basic Analysis

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Cohort Analysis

Looking at top-level aggregates or a single slice (customer, vendor, group) might help us find problems, and looking at a broad-based comparison of these slices may give us a clearer picture.

A data set can be sliced many ways, so there could be a large number of relevant cohorts.

Cohort Analysis

Box Plots

Box plots are valuable for comparing cohorts, as they show, by group, the median, 25th percentile, 75th percentile, 1.5 * interquartile ranges, and outliers.

Cohort Analysis

Outliers

Plotting a trend and looking for outliers (defined either in a technical sense or in the general sense of "well outside the norm") can dig up strange behavior. Real data sets will always have outliers, though, so the existence of outliers is not in itself suspicious.

Cohort Analysis

Demo Time

Cohort Analysis

Clustering Behavior

Something interesting when analyzing a cohort is to look for clustering around certain values. Examples: expense reports and invoices.

Going back to our Wake County example, there was a cluster around $2500, as that was the cutoff point after which a second person needed to sign off on a payment order.

Cohort Analysis

Difference Analysis

When comparing different members of a cohort, remember that there are two types of difference: differences between groups and differences within a group.

Differences between groups may indicate suspicious behavior if one member of the group acts well outside the norm, such as in the number of invoices submitted per day.

Looking at differences within a group can help explain anomalous aggregate behavior.

Cohort Analysis

Demo Time

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Market Baskets

Market basket analysis (AKA an association study) is an analysis of groups of items generally purchased together. For example, beer and diapers.

Credit card companies use association studies to help determine if there is fraudulent card use. For example, purchasing an airline ticket is lower-risk if the cardholder also purchased a hotel room and rental car the same day.

Other industries use association rules as well for fraud analysis, including the insurance industry.

Market Baskets

When building association rules, the more common path is to look for sets of items together: government employee X always signs invoices under $2500 for vendor Y.

It may also be interesting (though computationally expensive!) to look for sets of items not expected together: for example, a purchase in one city at the same time as a purchase in another city.

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Time Series Analysis

Simple Time Series

The easiest time series analysis is a count of items per day.

Going back to the Wake County example, auditors could have found something suspicious with the Barnes company pretty easily: there were 24 separate days with at least 50 invoices, certainly a suspicious scenario.

Other time series examples include the sums of measures by day or month, such as revenue, cost, number of orders, and amounts of refunds.

Time Series Analysis

Balances

For semi-additive measures such as account balances, look at the state of the measure at fixed points in time, such as looking at account balances every day at 9 PM UTC.

Time Series Analysis

Date Checks

Look for "odd" date combinations: order date later than ship date? Adjustments after a journal closes?

Look at behavior based on date: focus on day of week, day of month, weekdays vs weekends, mid-week versus early week vs late week. Sum and count measures grouped on these date slices.

Time Series Analysis

Holidays

Holidays can lead to different behavior like more sales of higher amounts. They can also lead to less activity, like how when invoices are due on a holiday, they may be submitted the next working day.

Be aware of floating holidays like Easter, as well as observed holidays like Christmas Day observed. It's easiest to store holidays in a date table to avoid trying to calculate these rules on the fly.

Time Series Analysis

Demo Time

Agenda

  1. Know Your Data
  2. Basic Analysis
  3. Cohort Analysis
  4. Market Basket Analysis
  5. Time Series Analysis
  6. Numeral Analysis

Numerals

Round Numbers

One high-level analysis of values is to get a count of "round numbers." We group into types based on the number of 0s at the end (ignoring cents): $5000 (type 3), $10,200 (type 2), $180 (type 1), $17,999 (type 0).

Numerals

Demo Time

Numerals

Benford's Law

Benford's Law is an interesting phenomenon, where the first digit (or two digits, or three digits) of a set of numbers tends to follow a specific pattern.

Numerals

Demo Time

Numerals

Last-Digit Analysis

As we saw in the last demo, Benford's Law doesn't apply to the last digit in a sequence; instead, we assume those are uniform unless there is a reason to believe otherwise.

Wrapping Up

This has been a quick survey of forensic accounting and outlier detection techniques. You can apply these techniques to a number of data sets, not only to find potential instances of fraud, but also to search for potential data integrity issues and even to gain more insight into your data.

Wrapping Up

To learn more, go here: http://CSmore.info/on/outliers

And for help, contact me: feasel@catallaxyservices.com | @feaselkl