Designing a Scripting Language: Suggestions?

TEK NINJA

Gawd
Joined
Nov 12, 2000
Messages
827
Hello Everyone,

I'm beggining a project to design a scripting language of my own. It will be object oriented, and will be centered around ease of usability, clean language design and performance. Some languages that inspire me are LISP, Python and Ruby, C++ and Java.

I'm doing this:
- "for the heck of it" (or more specifically, for the experience I will gain)
- Because I feel capable of doing it (because I can!)
- Because I'm slightly disappointed by existing options

I have used Python for a few medium-scale projects and found that it was simple and worked well. Yet its design feels slightly unclean. All attributes of a class are public by default, private ones are declared by using ugly __ notation, the whitespace indentation breaks with some editors and doesn't clearly signify the end of some code blocks. It also seems that Python is rather poor when it comes to GUI programming. There is no standard Python GUI library... I also found that the Python reliance on "self-documentation" was often an excuse not to document code at all.

Ruby is also a nice scripting language, but it's rather obscure. It has alot of syntactic tricks and oddities that make it unfriendly for any newcomer (even those with previous Python experience, for example)... It's also difficult to find simple documentation, like a proper language reference... Lots of small tutorials that work "by example", but nothing to actually explain all the language oddities... It seems to me that the only reason Ruby has any popularity at all is Ruby on Rails... And as soon as something else becomes more fashionable, its limited popularity will decline.

Java has ups and downs. The main advantage with it is that everyone knows java (or almost)... The main downside is that it's not such a practical language. Your platform must have an up-to-date JVM... And well, you're mostly limited to libraries that do have java bindings. Java does have a GUI lib, but no stock OpenGL support at this time, which limits its usefulness for developing videogames, for example. Java may be fast for a scripting language, but when you consider that it's actually not a scripting language, and that it is performing very poorly when it comes to anything computationally intensive... You may be disappointed.

My personal goal is simple: to have my cake and eat it too ;)

I would like my language to come with OpenGL support by default, and to encourage "clean" programming practices (eg: commenting, REAL documentation, not making all attributes of your class public). As for cross-platform GUI support, my idea is to simply implement it on top of OpenGL! That way the language can support all kinds of effects you can imagine, provide fast rendering, and guarantee that things look the same (or almost) on every platform.

What I am looking for are two simple things...

First of all, feature suggestions. If you have used scripting languages before, what did you find was good, what did you find was lacking, what would you like to see in an "ideal" scripting language and why? What would you like not to see in a scripting language?

Another thing I would like to find is expertise. If anyone here has experience with programming a garbage collector, a compiler, or developing their own scripting language, for example, I would like to chat with you (send me a PM with your IM contact info if you are interested). I would like to develop a high-performance moving GC (like C# has), and would require some technical help. I may also be interested in working with collaborators if we have enough in common ;)
 
BillLeeLee said:
How much do you know about writing grammars, compilers, language design...?

I've made a LISP interpreter of my own before, with a simple one-pass ad-hoc parser. I've also made a VM that resembles the "virtual RISC" model and could be programmed in assembly. I also know how to use Flex and Bison to generate a parser (which I plan to do for this project). I'm a CS student, so I've taken courses on grammars and programming languages. Currently taking a compilers class. I know most of what is being taught there already.

I feel fairly confident that I can implement this. I plan to begin with a simple interpreted model, and perhaps add a bytecode compiled path at a later point.
 
I've taken some programming languages and compilers courses, so I know a little bit about the subjet. My big suggestion is to completely fully specify your language before you start coding anything. If you halfway define it and then kinda fill in the rest as you go, then your language will most likely end up being confusing as hell and not as robust as you'd like it.

I'm not sure if I agree with your stance on GUI design. Do you really want a universal look and feel across platforms? Generally windows users want their programs to look like windows programs, mac users want their programs to look like mac programs, etc. Having a universal look kinda breaks that expectation. Also, writing your own widget set is no small undertaking. And I don't know if I like having the GUI so closely tied to the language. I know that was one of your goals, but I think there's a good reason why most languages tend not to do that. Perhaps you'd be better off(at least initially) just writing bindings for wxWidgets or something similar. Then later on, once you get your language working pretty good, you could go back and write your own GUI toolkit.

Also, have you researched languages other than the major ones you've mentioned? There are quite a few languages out there that aren't so mainstream, and perhaps there are some that are a bit closer to what you are looking for.
 
I've done a minimum amount of research into existing languages.Mostly for sources of inspiration. My goal remains to make one, so I'll make one no matter what ;)

I did find through my research that most "non mainstream" languages have some major downsides, which usually means they are impractical.
 
TEK NINJA said:
First of all, feature suggestions. If you have used scripting languages before, what did you find was good, what did you find was lacking, what would you like to see in an "ideal" scripting language and why? What would you like not to see in a scripting language?
What I notice is a lack of robust secondaryt tools support. Debuggers (including post-mortem support, detached support, and so on), profilers, and so on. Some languages (and implementations) have these, but at best they're scattered and not in a cohesive package. Some languages do a better job, but few are close to the professional, integrated, and cohesive implementations you'll find in more mature languages.

Your plan is quite an ambitious undertaking. Your note hints at a lot of things, and I think they'll add up to six or eight man-years of development time if you do them from scratch. Have you taken inventory of what you need do, and what you need to write? What's your timeframe for your first release?

TEK NINJA said:
Another thing I would like to find is expertise. If anyone here has experience with programming a garbage collector, a compiler, or developing their own scripting language, for example, I would like to chat with you
I've experience in these areas, but I'm not exactly drowning in copious free time. If you have questions, feel free to post them here and I'll be happy to help when I can.
 
Well I would like to have a working interpreted in perhaps one or two months, something that can compile and run simple scripts. I will then add support for useful libraries over the summer and work on improving the interpreter.

One of my biggest performance woes is the GC. I'm thinking of having a very simple allocator that allocates the first rightmost available space on the heap, and then compacts everything to the left. I've heard C# used this sort of mechanism, but I'm wondering how it does it. Does it keep a list of back-references to every allocated object? If so, how are these stored efficiently. I'm also wondering what's the most efficient way to implement a hash map for fast lookups.
 
Do you intend to dogfood your own language? That is, will you use your language to write your libraries? What platforms are you intending to support?

A first fit allocator is usually prone to bad fragmentation; i'd avoid it. It's fine for your first few cuts, but as soon as you start writing any meaningful code it's going to get you in trouble. I'm sure you'll take care to make a replaceable interface for your memory manager os that you can replace different implementations as you go.

One of the other roles your memory manager will play is to isolate the application from the platform. You'l ned to take special care in designing your memory manager so that it fits each platform you're targeting very, very well.

You can read lots about the implementation of the .NET garbage colector (and the heap that it runs on) on MSDN. The basics are that, indeed, there's a reference tracked for each active object. But collections happen based on generations, which allow the GC to touch lots less memory than a simple "find orphaned" mechanism.

Rico Mariani is a performance architect on .NET (er, or was, last I talked with him) and has a paper called Garbage Collector Basics and Performance Hints on MSDN. The Garbage Collection overview article is another good place to start.

One way to understand the design is to consider some alteratives. What would happen if you simply refcounted, fo example? Let's consider this code pseudocode:

Code:
nDuckCount = 0
while file isn't ended
	string = get next line from my file
	does the string contain "duck"?
		print "I found a duck!"
		nDuckCount = nDuckCount + 1
	end if
wend

Each line you read from the file, depending on how your language is implemented, might return a new string object. The string object is referenced, then goes out of scope, only to create a new string again in the next iteration.

Ref counting would work, but you'd construct and destroy memory at each iteration. Maybe that's bad for performance, depending on how your memory management works. Maybe you end up at a different address for each string, and that's bad for performance too because of caching.

You can't accumulate the dead references, either -- again, with the caching issues, and then you'd grow memory in a sawtooth pattern until you decided to collect again.

Since you control both the language and the runtime, you might want to try to have the language "give" the runtime smarts about the scope and usage of variables. That's very interesting, but also quite complex.

If you're going to do an OO language, you'd better figure out if you want deterministic destructors, too. Does the above code call the string destructor when it goes out of scope, or only when it is destroyed (some day, eventually)?
 
Oh, and Cite Seer has lots of papers about garbage collection. I'm trying to find my favorite ones and i'll update this note when I get them. (I'm on vacation and not at work, and so on -- so maybe I'll never find it. It's in a drawer printed out in a drawer at my office.)

Here it is, I think: Appel's Generational Garbage Collection paper.
 
mikeblas said:
Do you intend to dogfood your own language? That is, will you use your language to write your libraries? What platforms are you intending to support?

I plant to write all the libraries in C++ for speed (and because many will require platform-dependent bindings. Windows, Linux and Mac should be supported directly. BSD platforms, if I can get enough help, although it should ideally not be too difficult to port from Linux. I'm thinking of setting up a website with some sort of library repository so users can contribute to the language by submitting comments and potential ports (which other users could give ratings for). The highest rated library ports could eventually be included in the official language distribution.

A first fit allocator is usually prone to bad fragmentation; i'd avoid it. It's fine for your first few cuts, but as soon as you start writing any meaningful code it's going to get you in trouble. I'm sure you'll take care to make a replaceable interface for your memory manager os that you can replace different implementations as you go.

You don't really see what I'm talking about. In the moving GC I envision, the heap is *compacted*. That is, you have a certain amount of memory pre-allocated for your heap, and when the GC runs, it removes the objects that need to be freed, and then *moves* the objects that are already allocated towards the lower addresses, so that there is always a large free block at the rightmost part of the heap (after all allocated units). So allocating is really only a matter of incrementing a pointer.

The problems I'm concerned with are:
- When to run the GC, how often?
- Should the GC perform a full tracing cycle, or is there a way to do this only partially?
- Should I use back-references, and if so, where should these be stored?

You can read lots about the implementation of the .NET garbage colector (and the heap that it runs on) on MSDN. The basics are that, indeed, there's a reference tracked for each active object. But collections happen based on generations, which allow the GC to touch lots less memory than a simple "find orphaned" mechanism.

Rico Mariani is a performance architect on .NET (er, or was, last I talked with him) and has a paper called Garbage Collector Basics and Performance Hints on MSDN. The Garbage Collection overview article is another good place to start.

mikeblas said:
Oh, and Cite Seer has lots of papers about garbage collection. I'm trying to find my favorite ones and i'll update this note when I get them. (I'm on vacation and not at work, and so on -- so maybe I'll never find it. It's in a drawer printed out in a drawer at my office.)

Here it is, I think: Appel's Generational Garbage Collection paper.

That might be useful. I will read it when I get time... Hopefully it goes into the details.
 
I read the Generational Garbage Collection paper and it seems good enough for my needs... But there is one thing I am not sure about. When a major collection is being performed, where do the allocated objects get copied? In the same region or in a new one? (if its in the same region, how do they ensure nothing gets overwritten). Furthermore, when a major collection occur, it's bound to be CPU time consuming, are more generations usually advised?
 
Back
Top