Test log output
(see here for discussion thread.)
This week I presented a case of study for the problem of lack of test log output standardization in the majority of packages that are used to build the current Linux distributions. This was presented as a BOF ( https://www.linuxplumbersconf.org/2016/ocw/proposals/3555) during the Linux Plumbers Conference.
it was a productive discussion that let us share the problem that we have in the current projects that we use every day to build a distribution ( either in embedded as in a cloud base distribution). The open source projects don't follow a standard output log format to print the passing and failing tests that they run during packaging time ( "make test" or "make check" )
The Clear Linux project is using a simple Perl script that helps them to count the number of passing and failing tests (which should be trivial if could have a single standard output among all the projects, but we don’t):
# perl count.pl <build.log>
Examples of real packages build logs:
So far that simple (and not well engineered) parser has found 26 "standard" outputs ( and counting ) . The script has the fail that it does not recognize the name of the tests in order to detect regressions. Maybe one test was passing in the previous release and in the new one is failing, and then the number of failing tests remains the same.
To be honest, before presenting at LPC I was very confident that this script ( or another version of it , much smarter ) could be beginning of the solution to the problem we have. However, during the discussion at LPC I understand that this might be a huge effort (not sure if bigger) in order to solve the nightmare we already have.
Tim Bird responded: A few remarks about this. This will be something of a stream of ideas, not very well organized. I'd like to prevent requiring too many different language skills in Fuego. In order to write a test for Fuego, we already require knowledge of shell script, python (for the benchmark parsers) and json formats (for the test specs and plans). I'd be hesitant to adopt something in perl, but maybe there's a way to leverage the expertise embedded in your script.
I'm not that fond of the idea of integrating all the parsers into a single program. I think it's conceptually simpler to have a parser per log file format. However, I haven't looked in detail at your parser, so I can't really comment on it's complexity. I note that 0day has a parser per test (but I haven't checked to see if they re-use common parsers between tests.) Possibly some combination of code-driven and data-driven parsers is best, but I don't have the experience you guys do with your parser.
If I understood your presentation, you are currently parsing logs for thousands of packages. I thought you said that about half of the 20,000 packages in a distro have unit tests, and I thought you said that your parser was covering about half of those (so, about 5000 packages currently). And this is with 26 log formats parsed so far.
I'm guessing that packages have a "long tail" of formats, with them getting weirder and weirder the farther out on the tail of formats you get.
Please correct my numbers if I'm mistaken.
> So far that simple (and not well engineered) parser has found 26 > “standard” outputs ( and counting ) .
This is actually remarkable, as Fuego is only handing the formats for the standalone tests we ship with Fuego. As I stated in the BOF, we have two mechanisms, one for functional tests that uses shell, grep and diff, and one for benchmark tests that uses a very small python program that uses regexes. So, currently we only have 50 tests covered, but many of these parsers use very simple one-line grep regexes.
Neither of these Fuego log results parser methods supports tracking individual subtest results.
> The script has the fail that it > does not recognize the name of the tests in order to detect > regressions. Maybe one test was passing in the previous release and in > the new one is failing, and then the number of failing tests remains > the same.
This is a concern with the Fuego log parsing as well.
I would like to modify Fuego's parser to not just parse out counts, but to also convert the results to something where individual sub-tests can be tracked over time. Daniel Sangorrin's recent work converting the output of LTP into excel format might be one way to do this (although I'm not that comfortable with using a proprietary format - I would prefer CSV or json, but I think Daniel is going for ease of use first.)
I need to do some more research, but I'm hoping that there are Jenkins plugins (maybe xUnit) that will provide tools to automatically handle visualization of test and sub-test results over time. If so, I might try converting the Fuego parsers to produce that format.
I do think we share the goal of producing a standard, or at least a recommendation, for a common test log output format. This would help the industry going forward. Even if individual tests don't produce the standard format, it will help 3rd parties write parsers that conform the test output to the format, as well as encourage the development of tools that utilize the format for visualization or regression checking.
Do you feel confident enough to propose a format? I don't at the moment. I'd like to survey the industry for 1) existing formats produced by tests (which you have good experience with, which is already maybe capture well by your perl script), and 2) existing tools that use common formats as input (e.g. the Jenkins xunit plugin). From this I'd like to develop some ideas about the fields that are most commonly used, and a good language to express those fields. My preference would be JSON - I'm something of an XML naysayer, but I could be talked into YAML. Under no circumstances do I want to invent a new language for this.
Here is how I propose moving forward on this. I'd like to get a group together to study this issue. I wrote down a list of people at LPC who seem to be working on test issues. I'd like to do the following:
- perform a survey of the areas I mentioned above
- write up a draft spec
- send it around for comments (to what individual and lists? is an open issue)
- discuss it at a future face-to-face meeting (probably at ELC or maybe next year's plumbers)
- publish it as a standard endorsed by the Linux Foundation
Victor wrote later:
After talking with Guillermo we came to the idea of move our parsers to the Fuego modules
We are going to attack this problem with two solutions, happy to hear feeadback
- 1) Merge the parsers we have into the Fuego infrastructure
- 2) Provide an API to the new developers ( and current maintainers of
the existing packages ) to check if their logs are easy to track
- 'easy to track' means that we can get the status and name of each test
- if the parser can't read the log file we suggest the developer to fit their test to a standard ( as CMAKE or autotools )
To be honest it seems like a Titanic work to change all the packages to a standard log output ( specially since there are things from the 80's ) but we can make the new ones fit the standards we have and sugest the maintainers to fit into one.
Tim , I think that we should make a call for action to the linux comunity , do you think a publication might be useful ? maybe LWN or someplace else ?
- the ClearLinux project has a program count.pl (perl script) which has embedded in it about 26 different parsers for log formats, and can produce counts of passing and failing
tests, based on build logs and test logs (produced using 'make' and 'make test' for the packages.
- it produces text output with a comma-separate list of numbers
- something like '<package>,100,80,20,0,0'
- visualization is done by combining the CSV files and creating graphs from the data
- it produces text output with a comma-separate list of numbers
- Fuego 1.0 does not provide counts or fancy visualization at the moment (pass/fail at the level of a Jenkins job (Fuego test), and plots for some Benchmark measures.
- There are some existing systems for testing packages in debian and yocto:
- essential elements of a good output format are:
- per testcase:
- test identifier (string)
- duration (?)
- per testcase: