The health endpoint histogram has a large amount of cardinality for a
simple endpoint. Introduce a new "Slim" set of buckets for `/health` to
reduce the metrics load on large deployments. Especially those that have
per-node DNS caching services.
Add a metric to count internal health check failures rather than use the
timeout value as side effect monitor of the check error. This avoids
incorrectly recording the timeout value if there is an error that is not
a timeout (ex. refused)
Signed-off-by: SuperQ <superq@gmail.com>
* For caddy v1 in our org
This RP changes all imports for caddyserver/caddy to coredns/caddy. This
is the v1 code of caddy.
For the coredns/caddy repo the following changes have been made:
* anything not needed by us is deleted
* all `telemetry` stuff is deleted
* all its import paths are also changed to point to coredns/caddy
* the v1 branch has been moved to the master branch
* a v1.1.0 tag has been added to signal the latest release
Signed-off-by: Miek Gieben <miek@miek.nl>
* Fix imports
Signed-off-by: Miek Gieben <miek@miek.nl>
* Group coredns/caddy with out plugins
Signed-off-by: Miek Gieben <miek@miek.nl>
* remove this file
Signed-off-by: Miek Gieben <miek@miek.nl>
* Relax import ordering
github.com/coredns is now also a coredns dep, this makes
github.com/coredns/caddy fit more natural in the list.
Signed-off-by: Miek Gieben <miek@miek.nl>
* Fix final import
Signed-off-by: Miek Gieben <miek@miek.nl>
Went over all generated manual pages and fixed some markdown issues,
mostly escaping "_" to avoid underlining entire paragraphs.
Some textual fixes in route53 and other cloud DNS plugins.
Regenerated the markdown with mmark.
Signed-off-by: Miek Gieben <miek@miek.nl>
* Move to CODEOWNERS
No change in who own what; just a move to CODEOWNERS. This allows
dreck cleanups.
Added .dreck.yaml for alias and exec.
Fixes: #3486
Signed-off-by: Miek Gieben <miek@miek.nl>
* stickler bot
Signed-off-by: Miek Gieben <miek@miek.nl>
* sort the file
Signed-off-by: Miek Gieben <miek@miek.nl>
Caught my eye, we name things directive still, esp when talking about
the prometheus *plugin*. Rename everything that needs to be plugin to
'plugin'. Also make sure Metrics is a H2 section (not H1).
Signed-off-by: Miek Gieben <miek@miek.nl>
Abstract the caddy call and make it simpler.
See #3261 for some part of the discussion.
Go from:
~~~ go
func init() {
caddy.RegisterPlugin("any", caddy.Plugin{
ServerType: "dns",
Action: setup,
})
}
~~~
To:
~~~ go
func init() { plugin.Register("any", setup) }
~~~
This requires some external documents in coredns.io to be updated as
well; the old way still works, so it's backwards compatible.
Signed-off-by: Miek Gieben <miek@miek.nl>
* Update Caddy to 1.0.1, and update import path
This fix updates caddy to 1.0.1 and also
updates the import path to github.com/caddyserver/caddy
This fix fixes 2959
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
* Also update plugin.cfg
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
* Update and bump zplugin.go
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Add OnReStartFailed which makes the health plugin stay up if the
Corefile is corrupt and we revert to the previous version.
Also needs a fix for the channel handling
See #2659
Testing it will log the following when restarting with a corrupted
Corefile
~~~
2019-05-04T18:01:59.431Z [INFO] linux/amd64, go1.12.4,
CoreDNS-1.5.0
linux/amd64, go1.12.4,
[INFO] SIGUSR1: Reloading
[INFO] Reloading
[ERROR] Restart failed: Corefile:5 - Error during parsing: Unknown directive 'bdhfhdhj'
[ERROR] SIGUSR1: starting with listener file descriptors: Corefile:5 - Error during parsing: Unknown directive 'bdhfhdhj'
~~~
After which the curl still works.
This also needed a change to reset the channel used for the metrics
go-routine which gets closed on shutdown, otherwise you'll see:
~~~
^C[INFO] SIGINT: Shutting down
panic: close of closed channel
goroutine 90 [running]:
github.com/coredns/coredns/plugin/health.(*health).OnFinalShutdown(0xc000089bc0, 0xc000063d88, 0x4afe6d)
~~~
Signed-off-by: Miek Gieben <miek@miek.nl>
Small, trivial cleanup: got triggered because I saw a comment on how
health plugins polls other plugins which isn't true.
* Remove useless newHealth function
* healthParse -> parse
* Remove useless constants
Net deletion of code.
Signed-off-by: Miek Gieben <miek@miek.nl>
* plugin/health: remove ability to poll other plugins
This mechanism defeats the purpose any plugin (mostly) caching can still
be alive, we can probably forward queries still. Don't poll plugins,
just tell the world we're up and running.
It was only actually used in kubernetes; and there specifically would
mean any network hiccup would NACK the entire server health.
Fixes: #2534
Signed-off-by: Miek Gieben <miek@miek.nl>
* update docs based on feedback
Signed-off-by: Miek Gieben <miek@miek.nl>
* - ensure plugins that use prometheus.MustRegister, re-register after reload
- removing once.Do on the startup function was simplest way to do it.
* - fix underscored names (advice of bot)
* - tune existing UT for reload, and add a test verifying failing reload does not prevent correct registering for metrics
* - ensure different ports for tests that can run in same time ..
* Clean up tests logging
This cleans up the travis logs so you can see the failures better.
Older tests in tests/ would call log.SetOutput(ioutil.Discard) in
a haphazard way. This add log.Discard and put an `init` function in each
package's dir (no way to do this globally). The cleanup in tests/ is
clear.
All plugins also got this init function to have some uniformity and kill
any (future) logging there in the tests as well.
There is a one-off in pkg/healthcheck because that does log.
Signed-off-by: Miek Gieben <miek@miek.nl>
* bring back original log_test.go
Signed-off-by: Miek Gieben <miek@miek.nl>
* suppress logging here as well
Signed-off-by: Miek Gieben <miek@miek.nl>
Prevent future; "remove trailing whitespace" PR, but adding a simple
presubmit that checks for this.
This presubmit flagged quite some offenders, remove all trailing
whitespace from. Apart from that there aren't any other changes.
Signed-off-by: Miek Gieben <miek@miek.nl>
* plugin/health: update README
Make more clear in the readme that health is limited to 1 server.
Fixes#1722
* rephrase and remove ~~~ corefile because it will fail
* update docs
* plugins: use plugin specific logging
Hooking up pkg/log also changed NewWithPlugin to just take a string
instead of a plugin.Handler as that is more flexible and for instance
the Root "plugin" doesn't implement it fully.
Same logging from the reload plugin:
.:1043
2018/04/22 08:56:37 [INFO] CoreDNS-1.1.1
2018/04/22 08:56:37 [INFO] linux/amd64, go1.10.1,
CoreDNS-1.1.1
linux/amd64, go1.10.1,
2018/04/22 08:56:37 [INFO] plugin/reload: Running configuration MD5 = ec4c9c55cd19759ea1c46b8c45742b06
2018/04/22 08:56:54 [INFO] Reloading
2018/04/22 08:56:54 [INFO] plugin/reload: Running configuration MD5 = 9e2bfdd85bdc9cceb740ba9c80f34c1a
2018/04/22 08:56:54 [INFO] Reloading complete
* update docs
* better doc
* reload: use OnRestart
Close the listener on OnRestart for health and metrics so the default
setup function can setup the listener when the plugin is "starting up".
Lightly test with some SIGUSR1-ing. Also checked the reload plugin with
this, seems fine:
.com.:1043
.:1043
2018/04/20 15:01:25 [INFO] CoreDNS-1.1.1
2018/04/20 15:01:25 [INFO] linux/amd64, go1.10,
CoreDNS-1.1.1
linux/amd64, go1.10,
2018/04/20 15:01:25 [INFO] Running configuration MD5 = aa8b3f03946fb60546ca1f725d482714
2018/04/20 15:02:01 [INFO] Reloading
2018/04/20 15:02:01 [INFO] Running configuration MD5 = b34a96d99e01db4015a892212560155f
2018/04/20 15:02:01 [INFO] Reloading complete
^C2018/04/20 15:02:06 [INFO] SIGINT: Shutting down
With this corefile:
.com {
proxy . 127.0.0.1:53
prometheus :9054
whoami
reload
}
. {
proxy . 127.0.0.1:53
prometheus :9054
whoami
reload
}
The prometheus port was 9053, changed that to 54 so reload would pick it
up.
From a cursory look it seems this also fixes:
Fixes#1604#1618#1686#1492
* At least make it test
* Use onfinalshutdown
* reload: add reload test
This test #1604 adn right now fails.
* Address review comments
* Add bug section explaining things a bit
* compile tests
* Fix tests
* fixes
* slightly less crazy
* try to make prometheus setup less confusing
* Use ephermal port for test
* Don't use the listener
* These are shared between goroutines, just use the boolean in the main
structure.
* Fix text in the reload README,
* Set addr to TODO once stopping it
* Morph fturb's comment into test, to test reload and scrape health and
metric endpoint
* Update all plugins to use plugin/pkg/log
I wish this could have been done with sed. Alas manually changed all
callers to use the new plugin/pkg/log package.
* Error -> Info
* Add docs to debug plugin as well
* plugin/health: make reload work
Remove the once.Do from the startup, so we can re-bind the HTTP
listener. Also clarify the usage of health in multiple server blocks
(this is not the best approach - but there isn't a generic solution at
this point).
Manual tested as we lack testing infra, i.e kill -SIGUSR1 and some
CURLing of the health endpoint.
* Readme test fix
* update
* dont need this
This should have everyone, but the process was quite manual. The rename
from middleware -> plugin also meant I had to do some extra digging on
who actually submitted the PR. I also double checked the current list of
people with commit access.
Every plugin now has an OWNERS, except *reverse*. I'll file a bug for
that.
* plugin/health: add lameduck mode
Add a way to configure lameduck more, i.e. set health to false, stop
polling plugins. Then wait for a duration before shutting down. As the
health middleware is configured early on in the plugin list, it will
hold up all other shutdown, meaning we still answer queries.
* Add New
* More tests
* golint
* remove confusing text
* plugin/health: add 'overloaded metrics'
Query our on health endpoint and record (and export as a metric) the
time it takes. The Get has a 5s timeout, that, when reached, will set
the metric duration to 5s. The actually call "I'm I overloaded" is left
to an external entity.
* README
* golint and govet
* and the tests
* Add manual pages
Generate manual pages from the README and extend README with Name and
Description sections.
The generation requires 'ronn' which may not be available. Just check in
all generated manual pages.
Switched health and autopath plugin to allow any plugins to be used instead
of a hardcoded list. I did not switch federation over since it wasn't
obvious that anything other than kubernetes could be used with it.
Fixes#1291
Implement health.Healther in erratic and kubernetes plugin. The
kubernetes' healtcheck is only performed on startup - i.e. turn
healthy after the initial loading.
Erratic follow the drop count: every query%drop turns the healthcheck
unhealthy.
Fixes: #985
* doc update
Go through all README and fix mistakes, extend example and let more
corefile snippets be test for validity.
* Cant use spefic addr in test