| James's profileJames McCaffreyBlogLists | Help |
|
June 22 How Close are Two Strings? Part IILet's continue looking at the problem of determining how close two strings are. If the two strings have equal lengths, then the Hamming distance is a good answer. But this technique does not work for the general case. One answer is to compute the Levenshtein distance. The Levenshtein distance between two strings is the minimum number of operations necessary to convert one string into the other, where an operation is an insertion, deletion, or substitution of a single character. For example, if:
A = cat B = cots
then the Levenshtein distance is 2. Starting with "cat", you must change the 'a' to 'o', and then add an 's'. There is a very cool visual algorithm to compute the Levenshtein distance. I'll walk you through it. First construct this matrix:
c a t 0 1 2 3 c 1 o 2 t 3 s 4
The short string goes on top, the longer sting goes vertical and each letter gets a 1-based index value.
Now, working by columns, from top to bottom, for each cell in the matrix, first assign a 0 if the corresponding characters match, and a 1 if the characters do not match. Then adjust this temp value by assigning the minimum of these three values:
1. the value in the cell above plus 1. 2. the value in the cell to the left plus 1. 3. the value in the cell to the upper left plus the temp value.
So the first cell in the example above gets a temp value of 0 (because the 'c' characters match) then gets modified to 0 (because condition #3 holds). The first column ends up like this:
c a t 0 1 2 3 c 1 0 o 2 1 t 3 2 s 4 3
The first cell in the second column gets a temp value of 1 because 'a' does not equal 'c'. The temp value gets modified to 1 because condition #2 holds. The second column ends up like this:
c a t 0 1 2 3 c 1 0 1 o 2 1 1 t 3 2 2 s 4 3 3
In the same way, the last column becomes:
c a t 0 1 2 3 c 1 0 1 2 o 2 1 1 2 t 3 2 2 1 s 4 3 3 2
Now, the Levenshtein distance is the value in the lower right corner, 2. This is an interesting algorithm and one which can be easily implemented. June 20 How Close are Two Strings? Part ISuppose you have two strings. How can you write a function which returns a number which is a measure of how different the strings are? This distance function has a wide range of uses in software development and testing. One possible answer is to use what is called the Hamming distance. This is a standard topic in a Discrete Mathematics course, which is a part of a standard computer science degree program at most universities. The Hamming distance between two binary strings of equal length is just the number of positions in which the bits differ. For example, suppose: A = 011010 B = 001011 The Hamming distance between these two binary strings is 2 because they differ at the 2nd and last positions. Interestingly, in the case of binary strings, the Hamming distance is just the "weight" (number of "1" bits) of the result of XORing the two strings. For the two binary strings above, A XOR B = 010001 and the weight of that result is 2. This gives a convenient way to compute the Hamming distance for binary strings. For character strings, we can generalize and just compute the number of positions where characters differ. So, if: A = computer B = colleges the Hamming distance is 5. Well, this is great but unfortunately, the Hamming distance applies only to strings with equal length. I'll describe how to deal with strings of unequal length in the next blog entry. June 13 Web Testing, Part IVThere are two basic ways to test a Web application through the app's UI. The first way is to write test automation which exercises the app using JavaScript to access the IE Document Object Model. This is effective but does have some limitations -- you can only access the parts of the Web app's functionality which are explicitly exposed through the IE DOM. The second way to write Web application UI automation is to directly call into the mshtml.dll and shdocvw.dll Windows libraries. Click on the image at the bottom of this entry to see what I mean. I call this technique "low-level Web UI testing". I could write my harness using C++ and then call the mshtml.dll and shdocvw.dll functions directly. But this is difficult. An easier approach is to write a test harness using C#. The C# code uses the "P/Invoke" mechanism to call into the Win32 libraries -- much easier. Launching the Internet Explorer browser looks something like this: SHDocVw.InternetExplorer ie = null; Process p = Process.Start("iexplore.exe", "about:blank"); if (p == null) throw new Exception("Could not launch IE"); Console.WriteLine("Process handle = " + p.MainWindowHandle.ToString());
After you load the application under test, you can manipulate it like this: Console.WriteLine("Clicking search button"); HTMLInputElement butt = (HTMLInputElement)theDoc.getElementById("Button1"); butt.click(); documentComplete.WaitOne();
Conceptually, the technique is pretty easy but there are a lot of tricky details. For example, in the code snippet above, the WaitOne() method call is a key trick which halts the test harness thread of execution until after the click operation completes. You can't pause using something like Thread.Sleep(2000) because there's really no way to tell how long to pause. I devote an entire chapter of my book ".NET Test Automation Recipes" to all the details of writing low-level Web application UI testing. See http://www.amazon.com/gp/product/1590596633/qid=1150213081/sr=2-1/ref=pd_bbs_b_2_1/103-1065536-4310244?s=books&v=glance&n=283155 for details. XML Canonicalization TestingSoftware testers have to have a good grasp of techniques which involve XML. One concept which is not widely known is XML canonicalization. The W3C defines canonical XML which you can think of XML in a standard form. This means, for example, that whitespace is removed, all quotes are either single-quote or double-quote, XML declarations are removed, and attributes are placed in alphabetical order. The most usual form of canonicalization is called C14N. It is quite complex. In a testing situation, you might run into XML canonicalization if you are testing some system which produces XML files or in-memory documents. To test this system, you need an expected result, which is an XML file/document. One strategy you can use to compare the actual XML result with an expected XML result is to canonicalize both the actual and the expected XML. Then you can do a simple string comparison for equality. XML canonicalization was actually created for use in security-related scenarios. The idea is that XML is often transmitted over a network. To verify that the XML has not been maliciously or accidentally corrupted, the sender can compute a hash (such as MD5 or SHA1) of the XML and publish that hash value. The XML receiver can compute a hash of the received data to validate the data. Unfortunately, a lot of minor changes can happen over a network or the Internet -- whitespace differences in particular. So, a way is needed to standardize XML so that hashes can be meaningfully compared. I devote a section of chapter 12 of my book ".NET Test Automation Recipes" to the details of C14N XML canonicalization. See http://www.amazon.com/gp/product/1590596633/qid=1150213081/sr=2-1/ref=pd_bbs_b_2_1/103-1065536-4310244?s=books&v=glance&n=283155 for details.
June 09 What is the Best Language for Software Test Automation?What is the best programming language for writing software test automation? Like most things in life, it depends. In a .NET environment, your two best choices are C# and VB.NET. Because of the unifying effect of the underlying .NET framework, either language works very well and your choice will be based mostly on whether you feel more comfortable with C/C++/Java syntax (C#) or Visual Basic syntax (VB.NET). In a Microsoft, non-.NET environment your main choices are Visual Basic, C++, and Perl. All three languages have significant downsides. Visual Basic is easier to work with than the other two languages but it is less powerful, meaning the test automation you write will be less powerful. For example, writing UI automation with VB is rather awkward. C++ gives you the most power of the three language choices, but coding with C++ can be nightmarish, especially when dealing with pointers and memory allocation/deallocation. Perl is a compromise between the other two languages in terms of power and ease-of-use, so I consider it the best choice in most situations. I especially like the large number of libraries available when writing test automation with Perl. In a non-Microsoft environment, your main choices are Perl, Java, and C++. As before, all three have pros and cons. Because of its inherently terse syntax, Perl is difficult to maintain. C++ gives you the most power but at the cost of complexity, which usually means it takes quite a bit longer to write automation with C++. Java is a good compromise in many non-Microsoft test automation scenarios. A relatively recent, fourth programming language choice for non-Microsoft software test automation is Ruby. I'm not a big Ruby fan myself, but some of my colleagues, whose opinions I respect, tell me Ruby is much like an improved Perl. As soon as Ruby libraries mature a bit, I intend to take a serious look at Ruby for test automation in both Microsoft and non-Microsoft environments. Let me note at least one specific exception to my opinions in this entry. If you are testing software components which are intended to be called by multiple languages (like a general math library for example), you should test those components using multiple languages. And finally, the usual disclaimer: no one programming language is the best choice for all test automation situations. The more languages you can use for automation, the better. But the reality is that you only have so much bandwidth to learn technologies so you should think about issues like the one I've presented here so you can be most efficient with your personal skill upgrade strategy. June 06 Estimating Software Testing Time and CostIf you test software, at some point you'll have to estimate how long some testing effort will take, or how much the effort will cost. In most situations, this boils down to estimating time because cost is generally time multiplied by some money rate. So, let's just think about estimating the time required for a testing effort. The answer is pretty easy: you just have to estimate based on your experience with similar projects (as well as other experienced engineers' estimates). Now that said, you can provide a better answer than just a wild guess. One simple technique I like to use is called the beta probability distribution technique. It is very simple. You estimate an optimistic time, a best-guess time, and a pessimistic time, and then compute a weighted average with weights 1, 4, and 1. For example, suppose you are trying to estimate how long it will take to write a set of 3,000 unit tests. If your optimistic time estimate is 3 weeks, your best-guess estimate is 5 weeks, and your pessimistic estimate is 9 weeks, your beta distribution estimate is ( 1*3 + 4*5 + 1*9 ) / 6 = 32/6 = 5.33 weeks. Additionally, if x is your optimistic estimate and y is your pessimistic estimate, then the standard deviation of the beta estimate is sqrt( (y-x)*(y-x) / 36 ) = sqrt( (9-3)*(9-3) / 36 ) = sqrt(1) = 1.00 weeks. So, you can make an interval estimate like, "Using an 80% statistical confidence level, I estimate the testing effort will take 5.33 weeks, plus or minus 1.28 weeks." Like any project estimate, it is crude. But even a crude estimate is better than no estimate at all. |
|
|