• How Machine Learning Products are Different (Part 2, Entity Resolution Checklists)

    Last time I talked a bit about the context of my experience with Machine Learning products and the high level issues we had getting customers to switch to what was clearly a better product. This time I get into some technical examples from these checklists and try to demonstrate the conflict.

    The first and most important distinction between our product and our competitors is we did “Entity Resolution” while they did mostly “Name Matching with Rules”. What does this look like in practice?

    When we built our system our competitors were using name matching systems with rules layered on top. These systems would first create matches based on the textual similarity of the name (based on some fast text matching data structure like a suffix tree). Then, this huge volume of matches would be slimmed down with rules like “only hit if the last name starts with the same letter and the dates of birth are the same”. A typical customer would have hundreds of these rules to manage and would have a large team dedicated to changing them and running simulations of those changes and reviewing huge numbers of false positives for potential matches.

    Our Entity Resolution system would take into account many different facets of a person at once: name, date of birth, address, phone number etc, and decide based on the the total evidence available, including frequency information, if they were likely to be the same person. For example, a match on a very common name like “John Smith” does not merit as much evidence as a rare one like “Richard Minerich”. Similarly Dick vs Richard can indicate a match but it’s not as strong as “Richard vs Richard”. There are variations on this theme for every data element. Once two records were considered to have enough evidence of similarity they could be linked and considered to represent the same entity. The highest probability records would be linked with no human intervention, while in the middle some amount of human review would be required. The resulting impact to the customer is there are many times fewer false positives to look at and, as there are fewer knobs to turn, fewer simulations required.

    In both cases “is this match relevant” is the next most important question which maybe I’ll get into down the road. It doesn’t matter if a match is true if no one cares, but no one knows if tomorrow it might suddenly become relevant. This dimension of relevancy is worth talking about as it’s missed by far too many projects.

    Now that you have the background let’s get on to some examples from those checklists. I hope you’ll understand I’ve only included examples from long enough ago to not conflict with my time at my previous employer.

    Can your product match all of the following names (see attached list of name pairs)?

    Ronald McDonaldMonald DcDonald
    John SmithJ1o2h3n4 S5m6i7t8h9
    Osama bin Mohammed bin Awad bin LadenMuhammed
    George BushEorge Ush

    Invariably this would be a list of several hundred names someone typed into Excel by hand that they made up on the fly. Almost nothing in here would resemble the issues we’d see in real data but that didn’t matter. We always had to try our best to match these, and then later we would show the customer what the cost was in terms of false positive increases. Often we won these battles but not always. In the end a configurable matching system combined with an ad-hoc data cleaning system was invaluable for dealing with cases like this.

    Does your product allow scans to be rerun with different settings?

    This was an interesting one that we received many variations of, often it was an implicit expectation. In fact our platform depended on model configuration being consistent in production for both efficiency and correctness.

    Efficiency was important and the consistency of the configuration enabled us to only screen changes to the data. In practice this is basically the difference between millions and thousands of records daily per client in terms of computational cost. We were then able to spend those cycles on fancier Entity Resolution techniques dramatically reducing false positives.

    The process depended on its history. If the settings were to change, so would what matched previously and all the saved scores, different records would be above or below different thresholds triggering different downstream workflows and remediation queues. Changes would make the history you are comparing to suspect and so all of the records would need to be rescreened and for a big customer. Rescreening an entire database took a lot of compute and potentially several days. That’s not to say we never changed settings and rescreened, but it was a planned event we’d coordinate with a customer and scale up hardware for.

    Correctness was very important as well, our database was the “system of record”, we strove to have the data always model what actually happened on a given day because we needed to be able to answer ad-hoc regulatory queries into what happened when and why. Multiple screening events with different settings would call into question the validity of those screening events and make it very hard to give clear answers.

    So to help we would often start with what we considered very low thresholds and many many examples of what might match or might not to build comfort. In some cases we would even end up setting special test environments for bigger customers and give them ability to roll back the data there on request. This turned out to be a deeper issue and later we addressed it with additional product specific features.

    Can we control who can see which matches?

    This one caught us off guard early on and it goes a lot deeper than you might expect. While we did have data controls from day one we expected that data would be relatively open for review inside compliance departments, silly to think that in retrospect given the risk involved in some of these matches. In actuality our customers were often carved up into different teams with different levels of access within the bigger institutions. Often they didn’t talk to each other at all.

    Workflows that determined which team’s queue the matches were put into would take into account many upstream factors that could change including it’s likelihood score. Often the more probable or riskier matches were routed directly to “level 2” analysts. Sometimes certain facets of records would cause matches to be routed to different teams as well, for example sanctions.

    Consider the case where a record is updated with some new evidence and now the match belongs to an entirely new department. Even worse, what happens when the evidence goes down, do you take the record away from the level 2 reviewers? We could leave the old match in the old queue and put the new one in the new queue but this would cause cross-team confusion about who would take an action.

    Additionally, one of the most important views in our industry is the “alert” view which is simply a roll up of all of the matches for a given customer record. The customer record is the unit they act on (i.e. file reports on), and typically at most one of the matches on a customer record can be true. It’s easy to see why they often do review in terms of their own customer records. If one is deemed clearly true the rest can usually be discarded as false.

    But what happens now when this is spread across teams that can’t see matches in each others queues? You have remediated matches in the system but can’t see the decision or why, and so can’t consider the full set of potential matches to judge if a record is true. On top of all that, as records bounced between teams, certain elements of that record’s history would have to be omitted depending on who remediated it and who was looking at it.

    Different kinds of customers wanted to solve this differently and so our solutions to this were myriad, we ended up building a very configurable system. Some customers allowed ownership of the entire history to move between teams. Others wanted to show that the data was there but omitted. Some just opened things up as long as it wasn’t a particularly risky data segment.

    I hope this peek into some of the pitfalls I ran into as a technical leader for a machine learning based product line was interesting. Please let me know if you’re finding this series interesting by leaving a comment or giving me a shoutout on twitter and I’ll keep at it. Thanks!

  • How Machine Learning Products are Different (Part 1, High Level)

    First some context. My direct experience over the last decade+ has been mostly B2B selling enterprise grade software to large corporates and so my perspective is skewed this way. I have had many conversations with folks working in the small business and consumer spaces and know those can be very different worlds.

    When you sell software in a risk-averse corporate context checklists always come into play. Many many checklists. Feature checklists of course but also many flavors of contractual, security and even internal role/responsibility based checklists. Most of these checklists have to be approved before a sale can be completed, much less a corporate integration.

    These checklists presume the business, much less the software, will work in specific ways. Ways that try to make it as easy as possible to line up and compare vendors within a space. Ways that a Machine Learning driven approach come into direct conflict with. Ways that make it very hard to participate in the market as a startup. When you go about trying to do anything in a different way it is inevitable that you come into direct conflict with these checklists.

    Now, it is possible to bypass many of these checklists if the customer is sufficiently motivated. This is how we got our first big customers at Safe Banking Systems: They were often under heavy pressure from a government, sometimes even a cease and desist. The current products in the market couldn’t do anything to save them on the timelines they needed. With our AI we could come in and do in a couple of weeks what would take other vendors half a year or more.

    Repeated successes led the market start to bend to us. After a few years our reputation allowed us to skip some parts of these checklists. Later we even saw the checklists starting to conform to us, but this took a solid decade and a lot of outreach work.

    For example, I personally spent over half a year of my life working on a rather large document for practitioners explaining how our AI worked, why it worked as it did, the process for using that AI, how to interpret it, and why certain processes that were previously necessary were unnecessary and vice versa. Our model choices were heavily constrained by how explainable they were to compliance department bankers.

    Also sometimes constraining were the internal structures our customers businesses. After we launched our end to end platform we found one area of friction was that our software required a tiny percent of the number of operators to review results. On the surface this sounds like a big win right? It’s so much cheaper. However, some decision makers measured their power in terms of the number of employees under them and so our software became a threat. Of course, you can always review more records that have a very high probability of being false positives, there could be something in there.

    Of course, this is all on top of the typical technical challenges of integrating with large corporates on a data focused project: Data being combined from many systems upstream by an unrelated department. Different departments with different conflicting requirements. Impossible product change requests. Surprisingly onerous clauses slipped into contracts such as “no downtime”. I could spend years writing down the stories.

    In the end I believe the key was that our domain was a “small world” and the higher level executives all talked to each other, and as we won them over, us. They were all very risk averse, it was their job to control risk, but it eventually became clear we were the ones to call if you needed to make this very hard problem easy.

  • Don’t Overlook Tech Writers

    Early in my career I believed that Tech Writers were a interface role to Marketing and Sales. As a product focused developer, I didn’t understand the true value of Tech Writers beyond cleaning up content to make it customer friendly as in the places where I worked this is the role they most often took. I now see them much differently and I wanted to write a bit about what changed my mind.

    For me the big turning point was at Safe Banking Systems when we embarked upon SOC2 compliance. This process involves creating a mountain of documentation and so to facilitate this we included a seasoned tech writer to focus on the project. It was in working together that I realized the depth of the value of what a good Tech Writer brings to the table.

    Maybe the most important facet of Tech Writers is that they can get information out of the heads of tech folks who would otherwise not write documentation, due to lack of time or other “reasons”. A Tech Writer will sit with them and interview taking extensive notes, and then write up the results. The interviewee then only needs to review and tweak their work for accuracy. This removed a ton of friction from the process and we ended up generating a lot more documentation and much more quickly.

    For those who do write documentation, Tech Writers can free up your developers to get back to thinking about hard problems and writing code. Are there any parts of your process that require extensive documentation on a regular basis? Working a full day to make a nice document becomes an hour or less to get the information out. A good Tech Writer can get this work done much more quickly than a typical dev. This similarly removes friction and leads to more documentation actually getting written.

    When we had completed SOC2 our seasoned Tech Writer stayed on board, and her value only grew as she better understood the product and the team. We generated more and more documentation and, as you probably have seen, eventually just keeping the documentation organized became a problem. Thankfully one of our her naturally evolved to provide organization and structure to all of this documentation that would otherwise have become a big mess. One hard won lesson here is that without some way to provide this structure it doesn’t matter how good your documentation is, because no one will be able to find what they’re looking for anyway.

    Tech Writers also can facilitate the updating and regular review of documentation by to ensure accuracy. Before we had tech writers our documentation was in pretty bad shape, often written by more junior people as a way to learn, and it was rarely updated. At first they helped us ensure product changes were making it into the docs, but also as we grew they helped us implement processes to ensure freshness and compliance.

    Finally, Tech writers can facilitate important writing projects that you otherwise would put off or leave undone. When I led the Model Validation Working Group our goal was to help banks better understand and integrate our machine learning driven technology. None of our competitors had similar technology and so it was all quite alien to our customers, and in a regulatory domain like accounts and payments screening thus required a whole lot of explanation. I knew we would have to build a massive document detailing every facet of integration, customization, tuning and maintenance. With two Tech Writers we were able to complete this project in about six months. Without them it never would have happened.

    Tech Writers are often overlooked because of their nature as a supporting role. It’s harder to see their impact. They can however make the hard easy and sometimes even make the impossible possible, and so shouldn’t be overlooked. I now believe every team over a certain level of maturity should have at least one Tech Writer to support their team. If you don’t already have one on staff, consider hiring a Tech Writer to facilitate your own team’s growth.

    I wrote this post with fond memories of working with Cindy Orwig and Abby Friedman, both of whom I worked with at Safe Banking Systems. Abby runs Marketing Solutions Consulting, and is available marketing and tech writing work.

  • Sandy the Musician (For Sammy and Will)

    Sandy was a brilliant kid. School was a matter of course for her, she got mostly As without any major effort. What made Sandy special though was her affinity for music, and she truly loved it and it came to her naturally. Even as a middle schooler she would spend hours each day practicing her keyboard and guitar. It wasn’t hard work for her, it was her dream, she was drawn to it naturally.

    Through high school people called her a musical genius, she was far above other students in her small school. People would tell her stories about how successful she would be in the future, they would tell her just how special she was. Her parents and teachers praised her endlessly. She got into a prestigious music university with little effort, with numerous recommendations from her high school teachers.

    Now in University, Sandy definitely noticed the playing field had changed. She was still one of the best, but only one of them. There were others of similar dedication and skill. While she felt less special, she made fast friends with the other talented kids, formed bands, and even sold some music for real money online. It was at this point that she decided she wanted to be a professional musician in the music industry. The idea was set and she wouldn’t be dissuaded.

    In University Sandy found she had a weakness in composing new music. It was the first time in her life she had ever really struggled. It never came out the way she wanted. When she compared her compositions to the best in her class they were mediocre at best. This really hurt, she wasn’t used to being second rate.

    Time flies and soon graduation day came. Sandy still felt she was special, and wanted more than anything to work as a professional musician.

    Sandy moved to New York, and shared a small apartment with four other hopeful kids. By day they all worked service industry jobs, like waiter or cashier. By night each had their own dream they would chase. There was a comedian, a dancer, a off Broadway actor. Each like Sandy felt they had a special talent, and each worked very hard to make it.

    Sandy was in several bands, but none felt truly special. She enjoyed playing the live shows in NYC but year on year the long days were wearing her down. She had seen some lucky ones join bands that made moderate success, eventually enough to make a real living at it, but they were few and far between.

    Roommates come and go, and a rare few of those had seen success as well. There was that one comedian who had opened for Jerry Seinfeld at The Stand, a dancer/actor who had gotten a prime role on an off broadway play. Only one had really made it, an actor who had gotten a reoccurring role in a popular TV show. Most ended packing up and going home back to where they came from.

    Years rolled by but Sandy was unwilling to give up, she started a side business teaching music lessons to hopeful kids to help pay the bills. It started to eat into her live performance time but she was just sick of living three to a bedroom in New York.

    By the time Sandy was 30 she had made quite a name for herself in the nightclub scene in New York. She had a lot of friends, and between her side job teaching and now managing a Dunken Donuts, she was able to afford a studio apartment all to herself. But she felt empty, this was not the success she felt she was destined for. That wide eyed kid she was thought she would take over the world.

    She lived the rest of her life out at about the same way. Never breaking through to the success she wanted. She never got the winning lottery ticket. It wasn’t a wasted life by any means, but she never seriously considered any alternate paths around her and it led her to a narrow cutoff where very few even very talented people make it through.

    What could Sandy have done differently?

    • She could have gone to graduate school and taught music in university or high school.
    • She could have become an online music persona. Started a Youtube channel and twitch stream, teaching or just playing.
    • She could have taken programming classes and worked on music software.
    • She could have pushed the boundaries in innovative computer generated music.
    • If she explored more, she could have done something completely different. Maybe she had a innate talent for graphic design but she never tried it.

    But she never set herself up for any of these opportunities, she never explored outside of music and developed synergistic skills. She assumed she would follow the normal path and success would come eventually. Unfortunately in these kinds of cases where a lot of people are striving for the same thing, it’s not just about talent or skill, there’s a good deal of luck.

  • The Infosec Apocalypse

    The rise of tooling for vulnerability detection combined with pressure driven by Vendor Due Diligence is causing a massive enterprise freezeout for non-mainstream technologies across the board. Of particular concern is the impact this will have on the adoption of functional programming in enterprise and small business B2B development.

    I see now that the last 10 years were “easy mode” for the growth of new programming tools and infrastructure, with many new breakthrough technologies seeing rapid adoption. Languages like Node, Go and to some degree Scala saw breakaway success, not to mention all of the new cloud tech, NoSQL tech, containerization and data processing platforms along with their custom query DSLs. Other languages like Haskell saw success in small companies and skunkworks style teams solving very difficult problems.

    The Rise of Vulnerability Scanning

    Just this past year I’ve come to see we’re in the middle of a massive change across the industry. There are new forces at play which will calcify current software stacks and make it extremely hard for existing or new entrants to see similar success without a massive coordinated push backed by big enterprise companies. This force is the rise of InfoSec and vulnerability detection tooling.

    Tools like Blackduck, WhiteSource, Checkmarx, Veracode are exploding in popularity, there are too many to list and many variations on the same theme. In the wake of so many data leaks and hacking events enterprises no longer trust their developers and SREs to take care of security, and so protocols are being implemented top down. This isn’t just on the code scanning side, there is a similar set of things going on with network scanning as well which impacts programming languages less, but similarly will calcify server stacks.

    These tools are quickly making their way into SOC2 and SDLC policies across industry, and if your language or new infrastructure tool isn’t supported by them there’s little chance you will get the previously already tenuous approval to use them. This sets the already high bar for adoption much higher. As you might expect, vendors will only implement support for languages that meet some threshold for profitability of their tools. Not only do you need to build a modern set of tools for your language to compete, now you also need support from external vendors.

    Vendor Due Diligence

    Maybe we just cede this territory to enterprise tools with big backers like Microsoft and Oracle, we never more than a few small inroads anyway. The use of these tools is arguably a good thing overall for software security. Unfortunately, the problem cannot be sidestepped so easily, and I’m afraid this is where things look very bleak. The biggest new trend is in enforcement of these tools through Vendor Due Diligence.

    You may not be familiar with Vendor Due Diligence if you aren’t in a manager role. The basic idea is your customer will send you a long list of technical questions about your product which you must fill out to their satisfaction before they buy your product or service. In the B2B space where I work these lists are nothing new, but have been getting longer and longer over the last 10 years, now often numbering in the hundreds of questions.

    Most recently I’ve seen more and more invasive questions being asked, some even going into how teams are organized, but important to this article is that across the board they now all ask about vulnerability scanning and now often request specific outputs for well-known vulnerability scanning tools. The implication being that if you’re not scanning with these tools they won’t buy your software, and the list of supported languages is small.

    Any experienced technology manager sees the natural tradeoff here. When it comes down to making money versus using cool tech, cool tech will lose every time. You’re just burning money if you’re building cool things with cool tech if you know no one will buy it.

    So What Now?

    Potentially we will see a resurgence of “compile-to” functional programming with mainstream language targets to sidestep the issue. I suppose though that the extra build complexity and problems debugging will prevent this from ever being mainstream, not to mention that the vulnerability tools look for specific patterns and likely won’t behave well on generated code.

    There is some hope in the form of projects like SonarCube which enables users to come together and build custom plugins. Will functional programming communities come together to build and maintain such boring tech? I somewhat doubt it. This kind of work is not what most programmers would choose to do in their off time. Similarly, vulnerability detection is unlikely to be a good target to be advanced a little at a time with academic papers. It would take true functional programming fanatics to build companies or tools dedicated to the cause. If you are interested in helping out, pay attention to the OWASP Top 10 as this list drives focus for many infosec teams.

    Where does this leave us? If our communities do nothing then smaller B2B software operations focused mom and pop shops or consumer focused web applications likely won’t see any impact unless static analysis makes it into data protection law. Beyond these use cases FP will be relegated to tiny boxes on the back end where vulnerabilities are much less of a concern and the mathematical skills of functional programmers can bring extreme amounts of value.

    I know there are many deeper facets I didn’t cover here, if you want to continue the discussion join the thread on twitter.

  • 30 Years of Scrum

    There is some controversy as to when Scrum was invented, but many attribute it to Hirotaka Takeuchi and Ikujiro Nonaka in 1986. While still new and cutting edge to many companies this 30 year old process has it’s fair share of both proponents and opponents. Seeing as how I started programming at about this time under the watchful gaze of my stepfather I’m going to take a look back over my own career and speak to how my perception of scrum has matured as I went from solo-developer to working in teams and more recently to leading teams myself.

    As a solo developer your process is your own. Your organization and motivation (or lack thereof) is what makes or breaks your work. It’s easy to develop a lot of bad habits, and these can be hard to break when joining your first team. There’s also great pride to be found in building things yourself from scratch, but this itself can be a pathology leading to a defensive “I am my code” kind of perspective. In the late 90s and early 2000s I had never heard of Agile, Scrum or Kanban. I managed to get by with piles of haphazard notes covered in scribbles usually jumping from one half-baked idea to the next with no regard for the big mess I was creating. I still managed to make things, but I hate to think of how much of my early career was spent rewriting the same thing over and over from scratch because I had coded myself into a corner yet again.

    I had my first experience working on a real team fresh out of university at Atalasoft. When I first joined each developer was a silo, rarely interacting with the others. Surprisingly, this largely worked as thankfully the company was founded on experienced people and had zero turnaround. It also helped that they had proper infrastructure: source control with continuous integration and bug tracking with FogBugz. But the company was growing, and starting to hire junior people out of the local University (such as myself), it was quickly becoming obvious that something needed to change.

    My first real assignment after coming on and getting oriented was attack the huge backlog of bugs in our tracking system. These bugs spanned every silo across the company and were in at least four different languages. At first it was difficult working with people who were not used to having outsiders poking around in their code. There was defensiveness and sometimes even anger. It didn’t help that I was pretty arrogant, thinking I was hot stuff with my almost perfect CompSci GPA. Nothing brings humility like working in a large legacy codebase however.

    I came to appreciate when people had written tests, even if the coverage was poor, they were like signposts on the road. The worst code to work in was always the untested but I was able to move forward with a copy of Michael Feathers’ Working Effectively With Legacy Code largely guiding my sometimes painfully slow progress.

    As a side note, we at one point attempted TDD but it slowed development to a crawl and had only a small impact on bugs per line of code after a release cycle. The much more effective approach we landed on later was to have a sprint where we tested each other’s code. This became a kind of contest to see who could break each other’s stuff in the most horrible way and testing became great fun. I would look forward to this sprint more than any other.

    About a year after I joined the decision to adopt Scrum and Agile was made and many (probably most) people were unhappy about the change. At the time I was particularly unhappy about the 9am daily meetings (which I was almost always late for). I think this decision was largely a response to the difficulties we were having with the productivity of newer hires. The silo model was breaking down as junior programmers were being brought on.

    At first we struggled with all of the planning and meetings, but after a month or so we were back at our former level of productivity, and within six things were much improved. Junior devs were being given appropriately sized pieces of work. There were multiple people successfully working on the same bits of the code base. There was still some resistance, and there was a lot of pain around letting go of code ownership, but it was largely working.

    Under this system I was able to grow as a developer in great strides. They tested me with larger and larger pieces of problems at a time and within two years of joining Atalasoft I was designing whole new products. This was only possible because we had a system in place where it was obvious if I was stuck, and a system in place to help me decompose problems if I needed it. By the time I left Atalasoft I was running the Scrum meetings myself.

    Today I don’t use Scrum in running my research department. We are a small collection of experts working largely (but not entirely) in silos on research problems. Scrum would be too much structure for what we’re trying to accomplish. However, I wouldn’t hesitate to reach for Scrum again if I were building a team to make products or if I had a large proportion of junior developers. I know of no faster way to grow a junior hire into a full-fledged software developer.

  • A safer way to use F# with SQL CLR

    Every blog post I’ve read about using F# with SQL CLR involves turning off SQL CLR security for the entire database like so:

    alter database mydatabase set trustworthy on

    This is because there is not currently a “safe” FSharp.Core, and so even when building with –standalone you must load the assemblies in to SQL as “unsafe”.

    In our business we deal with real live bank data, and I would never be able to sell this to our ops team. Fortunately, there is a way to allow unsafe assemblies on a per-assembly basis as I found out here. It’s a little more complex, but not awful.

    I’m going to gloss over some of the wiring things up code here, as it can be found in many other places.

    1) As with any SQL CLR, you need to turn it on before you can use it

    SP_CONFIGURE ‘clr enabled’, 1
    GO
    RECONFIGURE
    GO

    2) You must sign your assemblies to-be-loaded with an asymmetric key pair.
    In F# you do this by adding an attribute to your assembly. I like to put these kinds of things in my AssemblyInfo.fs. You can find out more information on how to create a public-private key pair here. I also recommend compiling with the F# –standalone flag, so as to avoid having to pull in FSharp.Core as well.

    []
    do ()

    3) Make sure you’re in the Master database.

    USE [Master]
    GO

    4) Pull in that asymmetric key into SQL Server and give it a name.

    CREATE ASYMMETRIC KEY FSHARP_CLR_Key
    FROM EXECUTABLE FILE = ‘C:\MyProj\bin\Release\FSHARPCLR.dll’
    GO

    5) Create a login linked to this asymmetric key. You also need to give this user external assembly access.

    CREATE LOGIN FSHARP_CLR_Login
    FROM ASYMMETRIC KEY FSHARP_CLR_Key

    GRANT EXTERNAL ACCESS ASSEMBLY TO FSHARP_CLR_Login

    GO

    6) Move to the database where you’ll be deploying your assembly.

    USE [Resources]
    GO

    7) Make a database user to run under.

    CREATE USER FSHARP_CLR_Login
    FOR LOGIN FSHARP_CLR_Login
    GO

    8) Pull your assembly into SQL!
    It’s still unsafe, but in this case it’s special dispensation for a single assembly with controllable permissions instead of whole-hog access.

    CREATE ASSEMBLY FSHARP_CLR
    FROM ‘C:\MyProj\bin\Release\FSHARPCLR.dll’
    WITH PERMISSION_SET=UNSAFE
    GO

    9) Wire up the API with CREATE FUNCTION or CREATE PROCEDURE calls as you would normally.

    CREATE FUNCTION [dbo].[StringDistance]
    (@name1 nvarchar,
    @name2 nvarchar)
    RETURNS float
    AS
    EXTERNAL NAME [FSHARP_CLR].[Namespace.ClassName]
    .[CalcStringDistance]
    GO

    10) And now we can call it easily in SQL!

    SELECT dbo.StringDistance(‘Richard’, ‘Rick’)

    That’s it!
    Please let me know if there are any future incompatibilities.

  • Developing an Algorithm in F#: Fast Rotational Alignments with Gosper’s Hack

    This post is for the 7th day of the 2014 F# Advent Calendar.

    It has been said that functional languages can’t be as fast as their imperative cousins because of all of the allocation and garbage collection, this is patently false (as far as F# is concerned at least) largely because the CLR has value types.

    Recently Louis Thrall and I ran into an interesting problem at work: how do you efficiently compare all rotations of one array to another to select the best? This is easy when the arrays are the same size, just shift the indices around:

    Which is equivalent to a nested for loop, the top one rotating and the inner doing comparisons.

    But what about differently sized arrays? It turns out this is more complicated, even for comparing a size 2 array to size 3:

    Like comparing beads on two strings, there is much shifting as the indices are rotated around. The extra freedom from this shifting means that more comparisons must be made even though there total number of combined elements is smaller.

    This problem also has an interesting structure, as it costs the most when one array is about half the size of the other. When the smaller array is size 1 then it simply needs to be moved around, and when the arrays are equal in size it’s a simple matter of rotations.

    After some research Lou noted that this we were “counting the number of cyclic suborders of size k of a cyclic order of size n” which produces k(n choose k) different configurations.

    Now that we had bounds and knew what we were doing was reasonable we had to figure out an algorithm for this process. It occurred to us that it would be ideal to pre-compute the comparison values to avoid repeated comparisons. Given this matrix of comparison values it wasn’t too difficult to determine a procedure for checking all possible configurations.

    Each time a pair is chosen, it further constrains what the rest of the assignments may be. The bottom level is somewhat obvious, just select each of the possible starting combinations down the y-axis. In the next column you potentially have a range of possible “downward” (using clock arithmetic) selections bounded by the difference in lengths and what has already been fixed.

    Now it was time to implement, and the first stab was nice and concise and functional and worked pretty well.

    open System
    
    // arr1 is always equal in size or smaller than arr2
    let bestRot (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
        // Pre-calculate similarities
        let scores = 
            Array2D.init arr1.Length arr2.Length 
                        (fun i j -> f arr1.[i] arr2.[j])
        // inner function for recursively finding paths
        // col = previous column, prow = previous row
        // df = degrees of freedom, path = path accumulator
        let rec inner col prow df path =
            seq {
                // when we're out of columns to assign we're done
                if col >= arr1.Length then yield path |> List.rev
                else
                    // We have df "degrees of freedom" left
                    for d = 1 to df + 1 do 
                        // Clock arithmetic for the row
                        let nrow = (prow + d) % arr2.Length
                        // Recurse yielding out all further paths
                        yield! inner (col + 1) nrow (df + 1 - d) 
                                     ((col, nrow) :: path)           
            }
        let res, score =
            // the difference in length 
            // is the starting "degrees of freedom"
            let diff = arr2.Length - arr1.Length
            seq {
                // for each y-axis starting point
                for y = 0 to arr2.Length - 1 do
                    // 1 selected, r is the starting point (on y), 
                    //    starting is always 0 on x. 
                    // ((0, y) :: []) is the accumulator 
                    //    with the initial (0, y) coordinates
                    yield! inner 1 y diff ((0, y) :: [])
            } 
            // Sum each path to find total similarity
            |> Seq.map 
                (fun l -> l, l |> Seq.sumBy (fun (i,j) -> scores.[i,j]))
            // Get the path with the highest similarity
            |> Seq.maxBy snd
        // Create output array and copy in the results
        let out = Array.create arr1.Length Int32.MinValue
        for (i,j) in res do
            out.[i] <- j
        // Return results
        out, score
    

    However, after real world trials we found that it was taking an astonishing 20% of our runtime. We had to find a better way. We soon found Gosper’s hack which did mostly what we wanted. It gave us the combinations, and we found we could just rotate each of those around to give us all of our cyclic suborders.

    And so we went about rewriting our nice functional sample to use Gosper’s Hack to avoid the need to generate paths and also to be (nearly) minimally allocating to avoid garbage collection overhead.

    open System
    
    // arr1 is always equal in size or smaller than arr2
    let bestRotGos (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
        // Start Gosper's Hack Machinery Bootstrap
        let bitsConfig = Array.create arr1.Length Int32.MinValue
       
        let inline setBits v =
            let inline getBit x i = x &&& (1 <<< i) <> 0
            let mutable idx = 0
            for i = 0 to 31 do
                if getBit v i then
                    bitsConfig.[idx] <- i
                    idx <- idx + 1
     
        let inline nextGosper set =
            let c = set &&& -set
            let r = set + c
            r + (((r^^^set)/c)>>>2)
            
        let limit =
            let n = arr2.Length
            (1 <<< n)
     
        let mutable set =
            let k = arr1.Length
            (1 <<< k) - 1    
        // End Gosper's Hack Machinery Bootstrap
    
        // Pre-calculate sims and init placeholder variables
        let scores = Array2D.init arr1.Length arr2.Length 
                        (fun i j -> f arr1.[i] arr2.[j])    
                            
        let mutable maxScore = Double.NegativeInfinity
        let bestConfig = Array.create arr1.Length Int32.MinValue
    
        while (set < limit) do
            setBits set // Turn "set" bits into indices in "bitsConfig"
    
            // For each rotation
            for i = 0 to bitsConfig.Length - 1 do
    
                // calculate score and compare with previous best
                let mutable totalScore = 0.0
                for i = 0 to bitsConfig.Length - 1 do
                    let tokenScore = scores.[i,bitsConfig.[i]]
                    totalScore <- totalScore + tokenScore
                if totalScore > maxScore then
                    // and replace if better
                    maxScore <- totalScore
                    Array.blit bitsConfig 0 
                               bestConfig 0 
                               bitsConfig.Length 
    
                // Rotate the array
                let firstElement = bitsConfig.[0]
                for i = 0 to bitsConfig.Length - 2 do
                    bitsConfig.[i] <- bitsConfig.[i + 1]
                bitsConfig.[bitsConfig.Length - 1] <- firstElement
            
            set <- nextGosper set // Put the next combination in "set"
        bestConfig, maxScore
    

    A quick test shows that the new algorithm is about seven times faster across a variety of test inputs:

    let f x y = if x = y then 1.0 else 0.0
    
    let cmp = [|"one"; "two"; "three"; "four"|]
    let cmps = 
        [| for i = 0 to cmp.Length - 1 do 
               for j = i to cmp.Length - 1 do 
                   // Longer array is second
                   yield cmp.[j..], cmp.[i..] 
        |] 
    
    #time
    for i = 0 to 1000000 do
        for c1,c2 in cmps do 
            bestRot f c1 c2 |> ignore
    //Real: 00:00:41.800, CPU: 00:00:41.812, 
    // GC gen0: 10175, gen1: 6, gen2: 1
    #time 
    
    #time
    for i = 0 to 1000000 do
        for c1,c2 in cmps do 
            bestRotGos f c1 c2 |> ignore
    //Real: 00:00:06.378, CPU: 00:00:06.375, 
    // GC gen0: 689, gen1: 1, gen2: 0
    #time 
    

    Of course, there are a few more things that could be done to make this even faster mostly related to further reducing allocations. For example, we could potentially use larger fixed sized arrays if we knew our input was bounded and pass in the arrays as references letting the calling function handle proper recycling.

    However, at this point the function no longer appeared anywhere near significant when profiling our overall usage and so we found this was fast enough. It’s not worth making things uglier and more prone to breakage for minor gains when what we have right now is a referentially transparent function with predictable behavior.

  • Review: Sony Digital Paper DPT-S1 at Lambda Jam 2014

    I don’t usually review hardware here, but I think this device stands out as being particularly useful to people who take a lot of notes and/or read a lot of research papers.

    I read about the Sony Digital Paper DPT-S1 for the first time about a year ago and couldn’t help but be impressed. It promises the ease of reading of e-ink combined with a size that is amicable to academic papers and on top of that allows you to actively annotate the documents with a pen as you read them. It also sports the usual e-ink 3 weeks of battery life. Luckily enough, I managed to get one of my very own right before Lambda Jam 2014 and so had the perfect opportunity to give it a spin in a real use case kind of setting.

    Reading Papers

    Reading and marking up papers in PDF format is where this device shines.

    A Pristine Paper (not yet marked up)

    You simply swipe to turn pages, and it works every time. There’s even pinch zoom. The screen is large enough that you can easily read an entire page without zooming, as was always the problem I had with my first gen e-ink kindle (also the DPT-S1 weighs substantially less). You even get multiple tabs in a workspace so you can swap between different documents quickly for cross-reference.

    In this context it’s a better Kindle DX (now discontinued) that you can take notes on. For me (and for many others I suspect) reading a paper is a very interactive experience. You want to be able to highlight the important parts and even scribble in the margins as you move through it. The DPT-S1 supports this better than any device I have seen yet.

    Marking up a research paper.

    As you can see here you can not only write directly on the paper, but you can also highlight. These are both done with the included stylus for which the standard function is writing but changes to highlighting if you hold down the button on its side. You may also notice the little boxes in the margin of the text, these are collapsible notes.

    Collapsible Notes on the DPT-S1

    As you can see, the darker square in the top right margin is here opened and available for writing. Also please note that these notes were taken by me (with horrible handwriting in general) on the way to Lambda Jam, in a squished economy seat of an airplane, while there was some mild turbulence. While the hand writing isn’t paper-perfect it’s much better than other devices I’ve used in the past, including the iPad.

    One of the best features of the DPT-S1 is also its most limiting: It’s designed to work only with PDF files. The big benefit of this is that all of these writing annotations actually turn into PDF annotations on the given file. This makes them extremely easy to export and use in other contexts.

    Taking Notes

    The other big use case I had in mind for the DPT-S1 was taking notes. I always carry a notebook of some form and over the last three years I’ve managed to create quite a lot of non-indexed content.

    About half of my notebooks from the last three years.

    I usually carry one notebook for notes/exploring ideas, another for planning things (like to-do lists and such), and finally one small one for writing down thoughts on the go. This stack doesn’t include notes from talks I’ve attended or my Coursera class notes. It also doesn’t include the giant stack of hand annotated papers in my office, but that’s more to do with the previous section.

    I took pages and pages of notes on the DPT-S1 at Conal Elliott‘s talk at Lambda Jam (great presentation by the way). Here’s a side by side comparison with some paper notes I’ve written in the past.

    Some notes from Conal Elliot’s talk at Lambda Jam
    Some actual paper notes

    As you can see, my handwriting isn’t great as I tend to go kind of fast and sloppy when not looking at the paper, but the DPT-S1 holds up rather well. I think it would do even better for someone with nicer handwriting than I.

    There is one somewhat annoying downside, and that’s that when you make a new notebook pdf to take notes it only has 10 pages and you have to give it a name with the software keyboard input (it defaults to a date and time based name). This slowed me down big time in the talk because he was moving very fast toward the end, and that’s precisely when I ran out of pages. Still, given how well polished the rest of the device is it’s something I can overlook.

    Browsing the Web

    The final use case for the DPT-S1 is web browsing. This isn’t something I really need as my phone usually does a pretty good job at this, but it could be nice to have for reading blogs and such so I’ll touch on it.

    Hacker News DataTau on the DPT-S1
    This blog, you’re reading it right now.

    My blog actually renders quite well and is very readable, you can scroll by swiping up and down. Pinch-zoom works here too.

    I went to several sites and they all worked well enough, but given that this device is WiFi only I don’t expect I’ll be using it much for reading blog posts on the go.

    Conclusion

    If you’re looking for a cheap consumer device that you can easily buy e-books for you should look elsewhere. It’s expensive (~$1000 usd), hard to acquire (you have to email and talk to sales agents), and has no store, no API (only the filesystem), and only supports PDF.

    However, if you’re like me in that you take a lot of notes and you read a lot of papers, and you don’t mind spending a bit of money on something to solve a major problem in your life, this is by far the best device on the market for your needs.

    Please note, that while they are available on amazon, it’s the imported Japanese language version. Currently the only way to get an english version DPT-S1 is through contacting the sales team at WorlDox.

  • My 2013 F# Year in Review

    It’s been a great year for F# with the blossoming of the fsharp.org working groups. It’s been amazing watching the community come together to form a movement outside of Microsoft. This is most certainly the long term future of F#, protected from the whims of layer upon layer of management. Who knows, in the coming year we might even see community contributions to the F# Core libraries. Who would have thought that would ever have been possible?

    I’m very happy to see that Sergey Tihon has maintained his wonderful weekly roundup of F# community goings on. It’s a big time investment week after week to keep the weekly news going. After leaving Atalasoft, and no longer being paid to blog on a regular basis, I found I couldn’t keep investing the time and felt very badly about not being able to continue my own weekly roundups. Sergey has picked up that mantle with a passion, and I’m so very glad for this extremely useful service he provides to the community.

    Meanwhile Howard Mansell and Tomas Petricek (at his BlueMountain sabbatical), worked toward building a bunch of great new tools for data science in F#. The R Type Provider has become extremely polished and while Deedle may be fresh out of the oven, it already rivals pandas in its ability to easily manipulate data.

    At Bayard Rock Paulmichael Blasucci, Peter Rosconi, and I have been working on a few small community contributions as well. iFSharp Notebook (An F# Kernel for iPython Notebook) is in a working and useful state, but is still missing intellisense and type information as the iPython API wasn’t really designed with that kind of interaction in mind. The Matlab Type Provider is also in a working state, but still missing some features (I would love to have some community contributions if anyone is interested). Also in the works is a nice set of F# bindings for the ACE Editor, I’m hoping we can release those early next year.

    Finally, I wanted to mention what a great time I had at both the F# Tutorials both in London and in NYC this year. I also must say that the London F# culture is just fantastic; Phil is a thoughtful and warm community organizer and it shows in his community. I’ve been a bit lax in my bloggings but they were truly both wonderful events and are getting better with each passing year.

    F# Tutorials NYC 2013

    That right there was the highlight of my year. Just look at all of those smiling functional programmers.