|We Speak Linux|
Wake County transportation fraud.
An auto parts company, working in conjunction with several county employees, defrauded Wake County of at least $5 million over a multi-year period.
Participants took advantage of lax vendor policy for small purchases and submitted fraudulent invoices, all under the cutoff for further scrutiny.
Fraud went up to the top, to the head of the transportation department.
The end result was several guilty pleas and jail time for participants.
Wake County was able to claw back $5 million and Barnes Motor & Parts paid back $3 million, including a $2.5 million criminal fine.
As we go over techniques forensic accountants use to find potentially fraudulent data, we will return to this case several times.
Most of us don't spend time hunting down fraudsters using our data. But the practices and principles these forensic accountants use can still be valuable for us.
We can use these to find dirty data in our own environments. The end goal isn't to find recalcitrant users and toss them into jail; it is to improve our overall data quality, as you can't find bad data if you don't know what it looks like.
This is a quick overview of data samples used in this talk. Ideally, you want to know what the data elements represent and have as detailed an understanding of your data as possible.
Patterns on their own are interesting, but are not indicative of malicious or irresponsible behavior.
For your data sets, you should ideally have a detailed understanding of what data points represent and how valid the data is.
For these open data sets, sometimes the data is purposefully incomplete or inaccurate. For example, police incident data does not include the acting officer or exact incident location, to protect all parties involved.
Also, data pulled from Kaggle competitions and the like will have relevant features masked.
Our basic analysis will focus on a few important areas:
Growth analysis looks at changes in ratios over time. Our goal is to look for major changes which look to be far outside the norm.
Typically, we will perform growth analysis against important additive measures: company growth, revenue, cost, number of customers, etc.
Sometimes, it's the lack of a value which is interesting. We want to look for sequences which should be complete but aren't--examples include receipts, purchase orders, check numbers, invoice numbers, etc.
There are several T-SQL constructs which can help us find missing values, including the Gaps and Islands solutions and using a tally table.
Warning: if you generate invoice numbers (for example) using a SQL Server identity integer, gaps may occur when transactions get rolled back. It may be better to use a surrogate key for storage but generate an invoice number only after the new transaction saves successfully.
True duplicates are exact matches on relevant columns. For true duplicates, are there one-time codes or unenforced unique keys which show duplicates? Ex: internal tracking IDs.
"Mostly" duplicates are cases in which there are sometimes-unexpected differences. For "mostly" duplicates, are there combinations of fields which ought to be unique but aren't? Ex: both "Fundation" and "Foundation" showing up in company name.
There are several measures for the similarity of two text phrases, including Levenshtein distance, soundex, and even pattern matching using LIKE. Depending upon the level of sophistication necessary, getting the similarity of two items based on text descriptors can take quite a bit of effort!
Tying back to the Wake County fraud case, there were 24 separate days in which the auto supplier submitted 50+ invoices under $2500 to the county. Here, the duplication is in vendor and date and is well outside the expected behavior for a vendor on the up-and-up.
Regression analysis can be extremely complicated, given issues like multicollinearity, serial correlation, and heteroskedasticity.
Still, a basic ordinary least squares linear regression can tell us a lot.
Looking at top-level aggregates or a single slice (customer, vendor, group) might help us find problems, and looking at a broad-based comparison of these slices may give us a clearer picture.
A data set can be sliced many ways, so there could be a large number of relevant cohorts.
Box plots are valuable for comparing cohorts, as they show, by group, the median, 25th percentile, 75th percentile, 1.5 * interquartile ranges, and outliers.
Plotting a trend and looking for outliers (defined either in a technical sense or in the general sense of "well outside the norm") can dig up strange behavior. Real data sets will always have outliers, though, so the existence of outliers is not in itself suspicious.
Something interesting when analyzing a cohort is to look for clustering around certain values. Examples: expense reports and invoices.
Going back to our Wake County example, there was a cluster around $2500, as that was the cutoff point after which a second person needed to sign off on a payment order.
When comparing different members of a cohort, remember that there are two types of difference: differences between groups and differences within a group.
Differences between groups may indicate suspicious behavior if one member of the group acts well outside the norm, such as in the number of invoices submitted per day.
Looking at differences within a group can help explain anomalous aggregate behavior.
Market basket analysis (AKA an association study) is an analysis of groups of items generally purchased together. For example, beer and diapers.
Credit card companies use association studies to help determine if there is fraudulent card use. For example, purchasing an airline ticket is lower-risk if the cardholder also purchased a hotel room and rental car the same day.
Other industries use association rules as well for fraud analysis, including the insurance industry.
When building association rules, the more common path is to look for sets of items together: government employee X always signs invoices under $2500 for vendor Y.
It may also be interesting (though computationally expensive!) to look for sets of items not expected together: for example, a purchase in one city at the same time as a purchase in another city.
The easiest time series analysis is a count of items per day.
Going back to the Wake County example, auditors could have found something suspicious with the Barnes company pretty easily: there were 24 separate days with at least 50 invoices, certainly a suspicious scenario.
Other time series examples include the sums of measures by day or month, such as revenue, cost, number of orders, and amounts of refunds.
For semi-additive measures such as account balances, look at the state of the measure at fixed points in time, such as looking at account balances every day at 9 PM UTC.
Look for "odd" date combinations: order date later than ship date? Adjustments after a journal closes?
Look at behavior based on date: focus on day of week, day of month, weekdays vs weekends, mid-week versus early week vs late week. Sum and count measures grouped on these date slices.
Holidays can lead to different behavior like more sales of higher amounts. They can also lead to less activity, like how when invoices are due on a holiday, they may be submitted the next working day.
Be aware of floating holidays like Easter, as well as observed holidays like Christmas Day observed. It's easiest to store holidays in a date table to avoid trying to calculate these rules on the fly.
One high-level analysis of values is to get a count of "round numbers." We group into types based on the number of 0s at the end (ignoring cents): $5000 (type 3), $10,200 (type 2), $180 (type 1), $17,999 (type 0).
Benford's Law is an interesting phenomenon, where the first digit (or two digits, or three digits) of a set of numbers tends to follow a specific pattern.
As we saw in the last demo, Benford's Law doesn't apply to the last digit in a sequence; instead, we assume those are uniform unless there is a reason to believe otherwise.
This has been a quick survey of forensic accounting and outlier detection techniques. You can apply these techniques to a number of data sets, not only to find potential instances of fraud, but also to search for potential data integrity issues and even to gain more insight into your data.
To learn more, go here: http://CSmore.info/on/outliers