How to Start a Software Company 2.0

by Richard Rodger

       
 
How To Make Your Globs Specific

Let’s say you’ve got a list of globs: '*', 'a*', '*a', '*a*'. Now sort them. Sort them based on how specific they are. More specific globs match fewer things. Less specific globs match more things.

If Dr. Seuss had had anything to say about globs, what would he have said?


Glob on String
Blob on Thing
Glob on String for Blog on Thing
Blog on Glob for Thing on String

When Globs on Strings match Blogs on Things
Then... Blogs on Globs match Strings on Things

And When Blogs match Globs and Strings match Things
Then... Blogs on Globs match Blogs on Things on Globs on Strings!

So how do you sort globs? And why would you want to sort them? I’m working on a project where you use globs to pick out the error conditions that you are interested in. These error conditions have names. So you can match entire classes of error by using globs. For example 'foo.*' matches the errors 'foo.bar' and 'foo.baz'.

But what if I also add '*.*' to the mix. This also matches 'foo.bar' and 'foo.baz'. But I want 'foo.*' to match first. It’s more specific. For some definition of specific.

Again, 'a*' is more specific than '*a*'. 'a*' can match fewer things. Let’s look at this more closely. Say we have the set of words: [aa,ab,ba,bb]. Then

* matches aa,ab,ba,bb
a* matches aa, ab
*a matches aa, ba
*a* matches aa, ab, ba

So we can order the globs like so: 'a*', '*a', '*a*', '*', in order of increasing matches. The ones at the front are more specific. They match fewer things.

What about 'a*' and '*a'. How do we decide which comes first. Rather arbitrarily, lets say that the more specific prefix of 'a*' makes it the winner. Prefixes are more specific than suffixes. Hmm. That’s getting way too philosophical. Take Java packages: 'com.ricebridge.csvman.*' is more specific than '*.csvman.test'. The latter can occur in any top level package, but the former in only one. Yeah it’s pretty weak, but hey!

Now here’s the thing: what is the rule for sorting globs? I want a rule which:

1. is human computable 'just by looking' — that is, the rule is obvious
2. creates a full ordering of any set of globs — no arbitrary ordering of non-identical globs
3. has reasonable efficiency

The first idea: Use the number of stars. More stars means more specific. When the number of stars is the same, sort alphabetically. Neat and easy. And wrong. Sure it puts 'a*b*c' before 'a*c', but it also puts '*a*' before 'a*', which is incorrect.

Let’s look at globs. There are stars, and then there are normal characters. They always form an alternating pattern: 'a*b*c'. Or '*a*b*'. So this looks like an essential (and not accidental) feature of globs. The number of stars and the number of 'normals'. Yes, in general, the more stars you have, the more constrained you are, but there’s more going on.

After thinking about this for a bit, I realised that the outside stars are special. Very special. A glob with a star at the start or end, or both, can match a lot of things. Way more things that a glob with stars inside other characters. What if we just define an ordering for the outside stars. And we have one already. Form the preceding discussion: 'a', 'a*', '*a', '*a*', '*'.

Now that 'a' can stand for anything inside the outside stars. Other globs in fact. But those globs can’t have outside stars! The glob '**' is the same as the glob '*'. Adjacent stars merge into one. So '*a*', with a = '*b*' becomes '**b**' becomes '*b*'. On the other hand with a = 'b*c', '*a*' becomes '*b*c*'. A different beast entirely.

Now at this point you might be saying, "hold on a sec, you’re trying to order regular expressions! That's insane! You need to evaluate them to do that!" Because globs are just regular expressions: 'a*' is really /a.*/ in regex land. True. But I think that for the special case of globs built from stars only (let’s drop the use of '?'), we can in fact define an ordering without requiring evaluation. The problem is sufficiently complex to require some thought, but it's not a 'hard' problem.

I would think that due to back-tracking, solving the sorting problem for standard regular expressions is probably going to involve solving the halting problem, since you have to evaluate each regex. But I could be wrong. I often am.

Back to the globs. We can sort based on the outside stars. Fine. We have five subsets to sort now: 'a', 'a*', '*a', '*a*', '*'. '*' is a special case — it’s always last. It's the least specific glob of all, matching anything. The glob with no stars, 'a', is also easy. Just sort alphabetically.

For the other three we note that a will always take the forms 'b', 'b*b', 'b*b*b', ..., and so on. That is, the essential thing is the number of inside stars. In this case, the more stars you have, the less you can match. For example 'a*a' will match 'azbza', 'azcza'. But 'a*b*a' will only match 'azbza'. More stars are more specific. Problem solved!

Nope. Not quite. What about globs having the same number of inside stars? What about 'a*a' versus 'a*ab'? Which is more specific? I think 'a*ab' is. It has more information, more constraints on what it can match. For any given finite set of words, there are more ways to end in 'a' alone than 'ab'. So the next criteria is: more normal characters, more specific.

You can see the next catch, can't you? What if we have the same number of normal characters? Alphabetic sort? No, doesn’t work – the characters may be spread out between in the stars in different ways: 'a*ab' versus 'ab*a'. Which is more specific? Hmm. Let’s invoke our prefix rule again. This makes 'ab*a' more specific than 'a*ab', because it has a longer prefix. The criteria is: whoever’s got the mostest on the leftest.

And finally, if the character spread is identical, then we go alphabetic. So 'aa*a' precedes 'ab*a'. Let's see what we have...

The human rules are:

1. Prefixes are more specific than suffixes
2. Outside stars first: 'a', 'a*', '*a','*a*'
3. Inside stars: more stars means more specific
4. Normal characters: more characters, more to the left, is more specific
5. Otherwise go alphabetic

Here’s a sample list of globs, starting with the most specific. See if this list matches your idea of how they should be ordered.

com.ricebridge.csvman.CsvManager
com.*.*Manager
com.ricebridge.*.CsvManager
com.ricebridge.csvman.*
*.ricebridge.csvman
*.ricebridge.*
*

This is almost right. But really I think com.ricebridge.*.CsvManager is more specific than com.*.*Manager. How can we arrange this? Do more stars really mean more specific? Or do normal characters provide far more specificity?

The more normal characters you use, the more specific you are being. Stars provide very little information. But strings of normal characters really narrow things down. But in which set of words? What is the total set of words that we are matching in the com.ricebridge scenario? Infinite? Is there some structure? If these are the names of Java classes (my target domain), then com.ricebridge.*.CsvManager is definitely more specific (matches fewer items) than com.*.*Manager. Seems like prefixes really do rule. I wonder is this somehow connected to Zipf's Law...

Anyway, let’s drop rules 3 and 4, and replace them with:

Inside characters: longest prefix wins

So even if you have loads of stars, if you have a short prefix, you lose. This gives:

com.ricebridge.csvman.CsvManager
com.ricebridge.*.CsvManager
com.*.*Manager
com.ricebridge.csvman.*
*.ricebridge.csvman
*.ricebridge.*
*

That’s a lot better. Let’s restate the human rules:

1. Outside stars first: 'a', 'a*', '*a', '*a*'
2. Inside characters: longest prefix wins
4. Prefixes same length? more stars wins
3. Otherwise go alphabetic

That new rule number 4 means that 'a*b*c' is more specific than 'a*b'. So you only compare the minimum number of prefixes. If they're are equal, take the longest glob as more specific.

A much better ruleset! And it pretty much conforms to the criteria for the sorting algorithm: human computable, full ordering, and reasonable performance.

Time to go code it up. Post a comment if you want to save me from myself!

@ 05:07 PM GMT+00:00 [ comments [2] ]   email this   links to this

If you liked this entry, please consider bookmarking it &mdash Thanks!
Bookmark How To Make Your Globs Specific at del.icio.us Digg How To Make Your Globs Specific at Digg.com Bookmark How To Make Your Globs Specific at reddit.com Bookmark How To Make Your Globs Specific at YahooMyWeb Bookmark How To Make Your Globs Specific at Spurl.net Bookmark How To Make Your Globs Specific at Simpy.com Bookmark Polyphasic Mutants at NewsVine Blink this How To Make Your Globs Specific at blinklist.com Bookmark How To Make Your Globs Specific at Furl.net Fark How To Make Your Globs Specific at Fark.com

 
 
Trackback URL: http://www.richardrodger.com/roller/trackback/richard/Weblog/how_to_make_your_globs
Comments:

I think you need to solve the halting problem first..
i'll get started and get back to you when i'm done :-)

Posted by wilkem (80.196.192.238) on August 30, 2006 at 05:56 PM GMT+00:00 #

I have been driven insane by the other 'hatter'. Dope fiend Seuss had little on LSD addict Carroll. `Not the same thing a bit!' said the Hatter. `You might just as well say that "I see what I eat" is the same thing as "I eat what I see"!'

Posted by Flipside (159.134.148.147) on August 31, 2006 at 08:20 PM GMT+00:00 #

Comments for this have been disabled. Please send me a mail if you want to comment and I will activate comments again.
 
YahooBloglines
NewsgatorMSN
Google Readerdel.icio.us FurlSubscribe with myFeedster
« August 2006 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
26
27
28
29
31
  
       
Today

All | General | Java | Business | Fun | Perl | Rant | Ireland | Web
[This is a Roller site]
[Valid Atom 1.0] [Valid RSS]
Technology Blog Top Sites
Blogarama - The Blogs Directory

Blog Directory & Search engine

Blog Flux Directory
Irish Blogs
 View My Public Stats on MyBlogLog.com

Performancing
Enter your Email


Powered by FeedBlitz
Theme adapted from Sotto.
 
Ricebridge XML Manager
  • Convert XML to a table of data
  • Convert XML to CSV, and CSV to XML
  • High-speed, single-pass XPath
  • Memory-stable and fault-tolerant
  • Loads of documentation
  • Cut-and-paste code examples
  • Find a bug, get a gift cert
Ricebridge Java XML Manager Component


Ricebridge CSV Manager
  • Convert CSV to a table of data
  • Handle any type of delimited file
  • Memory-stable and fault-tolerant
  • Loads of documentation
  • Cut-and-paste code examples
  • Find a bug, get a gift cert
Ricebridge Java CSV Manager Component


Popular Posts

 Sign up for MyBlogLog.com
Alertra Website Monitoring Service
Get Chitika eMiniMalls
Solo Tees
BlogJet